CN109256152A - Speech assessment method and device, electronic equipment, storage medium - Google Patents
Speech assessment method and device, electronic equipment, storage medium Download PDFInfo
- Publication number
- CN109256152A CN109256152A CN201811327485.XA CN201811327485A CN109256152A CN 109256152 A CN109256152 A CN 109256152A CN 201811327485 A CN201811327485 A CN 201811327485A CN 109256152 A CN109256152 A CN 109256152A
- Authority
- CN
- China
- Prior art keywords
- voice data
- model
- data
- sample
- sample voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000012549 training Methods 0.000 claims description 14
- 238000012360 testing method Methods 0.000 claims description 12
- 239000000284 extract Substances 0.000 claims description 9
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 238000004590 computer program Methods 0.000 claims description 4
- 238000011157 data evaluation Methods 0.000 claims description 4
- 238000005516 engineering process Methods 0.000 abstract description 6
- 238000012545 processing Methods 0.000 description 13
- 238000011156 evaluation Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000005070 sampling Methods 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000033764 rhythmic process Effects 0.000 description 4
- 230000014759 maintenance of location Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- SBNFWQZLDJGRLK-UHFFFAOYSA-N phenothrin Chemical compound CC1(C)C(C=C(C)C)C1C(=O)OCC1=CC=CC(OC=2C=CC=CC=2)=C1 SBNFWQZLDJGRLK-UHFFFAOYSA-N 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000000605 extraction Methods 0.000 description 2
- 210000004209 hair Anatomy 0.000 description 2
- 230000005291 magnetic effect Effects 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 238000006386 neutralization reaction Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000003014 reinforcing effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 210000000352 storage cell Anatomy 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/69—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The disclosure is directed to a kind of speech assessment method and devices, electronic equipment, storage medium, it is related to field of computer technology, this method comprises: extracting sample characteristics from sample voice data, and acoustic model is trained by the sample characteristics, obtains trained acoustic model;Language model is constructed according to received text data corresponding with the sample voice data, and the sample voice data are decoded by the language model and trained acoustic model, to obtain the acoustic feature of the sample voice data;Rating Model is trained by the prosodic features of the acoustic feature and the sample voice data, and is scored according to trained Rating Model target speech data, to obtain the score value of the target speech data.The disclosure can accurately score to target speech data.
Description
Technical field
This disclosure relates to field of computer technology, in particular to a kind of speech assessment method, speech assessment device,
Electronic equipment and computer readable storage medium.
Background technique
With the development of computer technology, identification evaluation and test can be carried out to the voice of student automatically by oral evaluation system.
Speech evaluating system in the related technology, most of Speech acoustics models based on adult audio driven and carry out,
The voice of children differs larger with adult Speech acoustics model, therefore leads to identification inaccuracy;In addition, speech evaluating on the market
System mostly uses greatly acoustics score, is combined using acoustic feature, evaluates and tests last score by linear regression or support vector machines.
However single acoustics score linear combination can not with profession scoring teacher scoring match, evaluation and test efficiency it is lower and evaluate and test
Score is not accurate enough.
It should be noted that information is only used for reinforcing the reason to the background of the disclosure disclosed in above-mentioned background technology part
Solution, therefore may include the information not constituted to the prior art known to persons of ordinary skill in the art.
Summary of the invention
The disclosure is designed to provide a kind of speech assessment method and device, electronic equipment, storage medium, and then at least
Overcome the problems, such as accurately score to voice caused by the limitation and defect due to the relevant technologies to a certain extent.
Other characteristics and advantages of the disclosure will be apparent from by the following detailed description, or partially by the disclosure
Practice and acquistion.
According to one aspect of the disclosure, a kind of speech assessment method is provided, comprising: extract sample from sample voice data
Eigen, and acoustic model is trained by the sample characteristics, obtain trained acoustic model;According to the sample
The corresponding received text data of this voice data construct language model, and pass through the language model and trained acoustic model
The sample voice data are decoded, to obtain the acoustic feature of the sample voice data;Pass through the acoustic feature
Rating Model is trained with the prosodic features of the sample voice data, and according to trained Rating Model to target language
Sound data score, to obtain the score value of the target speech data.
In a kind of exemplary embodiment of the disclosure, extracting sample characteristics from sample voice data includes: that will pass through
It manually scores and appraisal result meets the online voice data of preset condition as the sample voice data;From the sample language
Fbank feature is extracted in sound data as the sample characteristics.
In a kind of exemplary embodiment of the disclosure, acoustic model is trained by the sample characteristics, is obtained
Trained acoustic model includes: to carry out off-line training to the acoustic model according to the Fbank feature, obtains depth nerve
Network recessiveness Markov acoustic model.
In a kind of exemplary embodiment of the disclosure, according to received text data corresponding with the sample voice data
Building language model includes: to retain the preset characters for including, and will be in except pronunciation dictionary in the received text data
Word is mapped as noise, to be pre-processed to obtain pretreated received text data to the received text data;According to
Pretreated received text data construct the language model, and the language model is bi-gram language model.
In a kind of exemplary embodiment of the disclosure, the method also includes: according to sample voice data and described
Received text data obtain the prosodic features of the sample voice data, the prosodic features include volume, tone,
At least one of word speed, language fluency and language integrity degree.
In a kind of exemplary embodiment of the disclosure, by the language model and trained acoustic model to described
Sample voice data are decoded, with obtain the acoustic feature of the sample voice data include: using the language model and
Trained acoustic model is decoded the sample voice data, to obtain decoded sample voice data;Extract solution
Code after sample voice data in each phoneme marking and duration characteristics, and according to the marking of each phoneme and it is described when
Long feature determines the acoustic feature.
In a kind of exemplary embodiment of the disclosure, the acoustic feature includes phoneme average mark, phoneme score criteria
Difference, 39 dimension phonemes show feature, phoneme is averaged at least one of duration and phoneme duration standard deviation.
In a kind of exemplary embodiment of the disclosure, the marking of each phoneme in decoded sample voice data is extracted
It include: that the pressure of the sample voice data and the text data is obtained according to text data and pronunciation dictionary with duration characteristics
It is aligned result;The marking of each phoneme and duration characteristics in the pressure alignment result are obtained by default marking rule.
In a kind of exemplary embodiment of the disclosure, pass through the rhythm of the acoustic feature and the sample voice data
It includes: by the training of the acoustic feature, the prosodic features and artificial labeled data that feature, which is trained Rating Model,
The Rating Model obtains trained Rating Model.
In a kind of exemplary embodiment of the disclosure, target speech data is commented according to trained Rating Model
Point, it include: by the trained Rating Model to the target voice number to obtain the score value of the target speech data
According to acoustic feature and prosodic features analyzed, obtain score value corresponding with the target speech data.
According to one aspect of the disclosure, a kind of speech assessment device is provided, comprising: acoustic training model module is used for
Sample characteristics are extracted from sample voice data, and acoustic model is trained by the sample characteristics, are trained
Acoustic model;Acoustic feature obtains module, for being constructed according to received text data corresponding with the sample voice data
Language model, and the sample voice data are decoded by the language model and trained acoustic model, with
To the acoustic feature of the sample voice data;Voice data evaluation and test module, for passing through the acoustic feature and the sample
The prosodic features of voice data is trained Rating Model, and is carried out according to trained Rating Model to target speech data
Scoring, to obtain the score value of the target speech data.
According to one aspect of the disclosure, a kind of electronic equipment is provided, comprising: processor;And
Memory, for storing the executable instruction of the processor;Wherein, the processor is configured to via execution institute
Executable instruction is stated to execute speech assessment method described in above-mentioned any one.
According to one aspect of the disclosure, a kind of computer readable storage medium is provided, computer program is stored thereon with,
The computer program realizes speech assessment method described in above-mentioned any one when being executed by processor.
A kind of speech assessment method for being there is provided in disclosure exemplary embodiment, speech assessment device, electronic equipment and
In computer readable storage medium, on the one hand, by obtaining acoustic mode corresponding with sample voice data in sample voice data
Type and language model, and then Rating Model is obtained according to the acoustic feature and prosodic features of sample voice data, it enables to comment
Sub-model more meets sample voice data, improves speech recognition accuracy;On the other hand, due to being combined in the Rating Model of foundation
It acoustic feature and prosodic features and can quickly identify, so that by trained Rating Model to target speech data
Evaluation and test score it is more acurrate, improve scoring accuracy rate.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not
The disclosure can be limited.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the disclosure
Example, and together with specification for explaining the principles of this disclosure.It should be evident that the accompanying drawings in the following description is only the disclosure
Some embodiments for those of ordinary skill in the art without creative efforts, can also basis
These attached drawings obtain other attached drawings.
Fig. 1 schematically shows a kind of speech assessment method schematic diagram in disclosure exemplary embodiment;
Fig. 2 schematically shows the specific flow chart of speech assessment in disclosure exemplary embodiment;
Fig. 3 schematically shows a kind of block diagram of speech assessment device in disclosure exemplary embodiment;
Fig. 4 schematically shows the block diagram of a kind of electronic equipment in disclosure exemplary embodiment;
Fig. 5 schematically shows a kind of program product in disclosure exemplary embodiment.
Specific embodiment
Example embodiment is described more fully with reference to the drawings.However, example embodiment can be with a variety of shapes
Formula is implemented, and is not understood as limited to example set forth herein;On the contrary, thesing embodiments are provided so that the disclosure will more
Fully and completely, and by the design of example embodiment comprehensively it is communicated to those skilled in the art.Described feature, knot
Structure or characteristic can be incorporated in any suitable manner in one or more embodiments.In the following description, it provides perhaps
More details fully understand embodiment of the present disclosure to provide.It will be appreciated, however, by one skilled in the art that can
It is omitted with technical solution of the disclosure one or more in the specific detail, or others side can be used
Method, constituent element, device, step etc..In other cases, be not shown in detail or describe known solution to avoid a presumptuous guest usurps the role of the host and
So that all aspects of this disclosure thicken.
In addition, attached drawing is only the schematic illustrations of the disclosure, it is not necessarily drawn to scale.Identical attached drawing mark in figure
Note indicates same or similar part, thus will omit repetition thereof.Some block diagrams shown in the drawings are function
Energy entity, not necessarily must be corresponding with physically or logically independent entity.These function can be realized using software form
Energy entity, or these functional entitys are realized in one or more hardware modules or integrated circuit, or at heterogeneous networks and/or place
These functional entitys are realized in reason device device and/or microcontroller device.
A kind of speech assessment method is provided firstly in this example embodiment, can be applied to the Foreigh-language oral-speech to children
In the application scenarios evaluated and tested.The speech assessment method is described in detail with reference to shown in Fig. 1.
In step s 110, sample characteristics are extracted from sample voice data, and by the sample characteristics to acoustic mode
Type is trained, and obtains trained acoustic model.
In the present exemplary embodiment, sample voice data can be online voice data, to for children speech data into
Row scoring, then sample voice data can be the online voice data of children.Sample voice data can for English voice data or
Person is the voice data of other Languages, is illustrated by taking English voice data as an example in the present exemplary embodiment.In order to enable structure
The model built is more acurrate, and sample voice data can be to score by artificial, and the result manually to score meets preset condition
The online voice data of children, artificial scoring therein can be for example professional teacher's scoring.Sample voice data can with it is to be evaluated
The voice data of survey is identical, can also be different, and is not particularly limited herein.Preset condition can be used for the online voice of multiple children
Data are screened, and preset condition is specifically as follows artificial scoring integrity degree full marks, pronunciation accuracy rate high score.Multiple hairs if it exists
Sound accuracy rate high score can then arrange the online voice data of children according to the descending sequence of pronunciation accuracy rate score value
Column, select the online voice data of the children for being arranged in top N as sample voice data.It can be for example, online voice data 1 passes through
The result manually to score is integrity degree full marks, and pronunciation accuracy rate 90 is divided;Online voice data 2 passes through the result manually to score
Whole degree full marks, pronunciation accuracy rate 98 are divided, then can choose online voice data 2 and be used as sample voice data.The sample language of acquisition
The format of sound data can be PCM (Pulse Code Modulation, pulse-code modulation recording) format audio, if the sample obtained
This voice data is MP3 format either other formats, then is handled again firstly the need of being converted to PCM format.
After selected sample voice data, it can extract the sample characteristics of sample voice data, to instruct by sample characteristics
Practice the acoustic model for being directed to sample voice data.Wherein, sample characteristics can be Fbank feature, or MFCC
(Mel Frequency Cepstral Coefficents, mel-frequency cepstrum feature).Since Fbank feature is relative to MFCC
The correlation of feature is stronger, is that Fbank feature is with sample characteristics more suitable for model training, therefore in the present exemplary embodiment
Example is illustrated.
Extract sample voice data Fbank feature when, specifically can by Fourier transformation, calculate energy spectrum and
The processes such as Mel filtering carry out.Next, can be trained by the Fbank feature of extraction to an acoustic model, the acoustic mode
Type can be deep neural network recessiveness Markov acoustic model DNN-HMM, or traditional GMM-HMM model, also
It can be other suitable machine learning models.Since GMM-HMM model is not so good as depth in word error rate and system robustness
The combination DNN-HMM of neural network and recessive Markov model, so using DNN-HMM as sound in the present exemplary embodiment
Learn model.It should be noted that instruction can be fitted to the parameter of acoustic model according to the Fbank feature of sample voice data
Practice, the relatively good parameter of performance is obtained, to obtain trained deep neural network recessiveness Markov acoustic model.Into one
The Fbank feature of sample voice data can be inputted trained acoustic model, be obtained by trained acoustic model by step ground
Each phoneme belongs to the probability of which phonetic symbol in sample voice data.
Next, in the step s 120, constructing language according to received text data corresponding with the sample voice data
Model, and the sample voice data are decoded by the language model and trained acoustic model, to obtain
State the acoustic feature of sample voice data.
In the present exemplary embodiment, received text data refer to the correct text of sample voice data.For example, sample language
The corresponding received text data of sound data 1 are as follows: " I got it from my best friend ", sample voice data are corresponding
Text data needs are compared with received text data, to determine the accuracy rate of sample voice data.Language model is for retouching
The probability for stating sentence appearance, can be effectively combined the knowledge of syntax and semantics, the internal relation between descriptor, to propose
High discrimination reduces search range.
Specifically, can be pre-processed first to received text data when constructing language model, to pass through pretreatment
Received text data afterwards obtain more accurate language model.Pretreated process specifically may include two ways: mode one is protected
The preset characters for including in the received text data are stayed,
Wherein, preset characters refer to the character of Non-American Standard Code for Information Interchange coding, such as punctuation mark and other language are constituted
Character (Chinese character) etc..That is, the punctuation mark and Chinese character that can include in first retention criteria text data
Etc., and these preset characters are added in subsequent return the result.By the preset characters in retention criteria text data, can protect
Demonstrate,prove the consistency of input and output.Mode two, the word for being not present in pronunciation dictionary either word list, such as compound word,
Word, self-word creation of misspelling etc. can uniformly be considered identify word Unknow, can further will be unable to identify single
Word Unknow is uniformly mapped as noise, does not consider the shadow that can not identify word Unknow when finally scoring voice data
It rings.
Further, the language model is constructed according to pretreated received text data, i.e., to pretreated mark
Quasi- text data carries out grammatical and semantic analysis, obtains language model.The language model be binary bi-gram language model, two
First bi-gram language model refers in received text data, and n-th word occurs only and the N-1 word phase before it
It closes, and it is all uncorrelated to other any words.The probability that can occur at this time with each word in evaluation criteria text data, Jin Ergen
The probability that entire sentence occurs is obtained according to the product of each word probability of occurrence.For example, it is assumed that S indicates received text data,
It is made of the word (w1, w2 ..., wm) that a string of particular orders arrange, and m indicates the length of received text data, i.e. word
Number.When calculating a possibility that received text data S occurs in entire corpus P (S) i.e. P (w1, w2 ..., wm), the binary
Bi-gram language model can be expressed as formula
After obtaining language model, it can be obtained as shown in Figure 2 in conjunction with trained acoustic model in step S110
Decoder obtains the corresponding text data of sample voice data, herein to be decoded by decoder to sample voice data
The corresponding text data of sample voice data can be identical as received text data, can also be different.Decoder be mainly used for
In the case where determining input feature vector sequence, by four kinds of knowledge sources such as acoustic model, acoustical context, pronunciation dictionary and language model
In the search space of composition, by Viterbi search, best word string is found to get arriving and the most matched text of sample language data
Data.
Since the size of language model will have a direct impact on decoded speed, it is decoded using general language model
When, since general language model causes greatly decoding time too long very much, and it can not analyze and evaluate and test user and repeat to read or leak
The case where reading.Sample voice data are solved by bi-gram language model in the present exemplary embodiment and acoustic model
Yard, the probability that the probability and a certain section of words that some word occurs in available sample voice data occur, it is possible thereby to examine
Measure user skip and stressed situation, simultaneously as language model is smaller, decoding speed is also ensured.
Based on this, after being decoded to sample voice data, each sound in decoded sample voice data can extract
The marking of element and duration characteristics.It specifically, can be according to text data and pronunciation dictionary to sample voice data and sample voice number
Pressure alignment is carried out according to corresponding text data.Alignment is forced to refer to that a sample voice data cutting is several phoneme sections,
At the beginning of obtaining each phoneme using DNN-HMM model and deadline.Each phoneme is in sample in pressure alignment result
Initial time and deadline in this voice data are all determining.
Audio forces alignment that can realize by Viterbi decoding algorithm.Specifically audio can be cut flat with shifting and be divided into and grown one by one
It spends very short frame and obtains multiple samplings of audio, the length of frame can be for example 5ms, 10ms either other numerical value.Citing and
The length of speech, frame can be 25ms, and multiple samplings are obtained in such a way that every frame is translated 10ms backward.To the every of audio
A sampling carries out feature extraction, then carries out similarity calculation with standard pronunciation target feature, indicated with bi (Ot) t-th of sampling and
The similarity of i-th of phonetic symbol model.The maximum probability δ t that present video at the time of sampling t reaches phonetic symbol i is represented with δ t (i)
(i), then the result δ t+1 (i) for extrapolating the t+1 moment can be sampled by t-th.In decoding process, t is constantly passed since 0
Increase, until audio terminates, finally obtains the corresponding δ N (i) of each phonetic symbol i.Due to forcing the simplicity of alignment, usually have
Higher accuracy rate.
It completes after forcing alignment, can learn which in sample voice data each phoneme in text data correspond to
Section, and can learn the practical pronunciation in the sample voice data of user.Based on this, default marking rule can be used to measure
The accurate pronunciation degree of each phoneme in the corresponding text data of sample voice data.Default marking rule can be GOP
(Goodness of Pronunciation) algorithm, wherein qi is the phoneme currently to be given a mark in text data, and A is pressure pair
The corresponding one section of voice of qi after neat, NF (A) is the frame number of this section of voice.GOP marking is exactly a conditional probability in fact, it is described
In the case where observing user speech A, this section of voice correspond to the probability of phoneme qi.This probability is higher, illustrates that pronunciation is got over
Accurately, this probability is lower, illustrates that pronunciation is poorer.That is, can be to the corresponding text of sample voice data according to GOP algorithm
The pronunciation situation of each phoneme is given a mark in notebook data, while being also known that this phoneme pair according to the result of pressure alignment
The speech frame section answered, to obtain the position of pronunciation mistake.
Marking and duration characteristics by the available each phoneme of default marking rule.What the marking of each phoneme referred to
It is the pronouncing accuracy of each phoneme, the duration characteristics of each phoneme refer to the duration of each phoneme, the i.e. rate of articulation.
Further, the acoustic feature of sample voice data can be determined according to the marking and duration characteristics of each phoneme.Wherein, acoustics
Feature includes but is not limited to that phoneme average mark, phoneme score criteria are poor, 39 dimension phonemes show feature, phoneme is averaged duration and sound
One of plain duration standard deviation is a variety of.Phoneme average mark, refer to according to pronunciation dictionary and decoding force alignment as a result, by
The average mark for each phoneme that GOP marking generates.Phoneme average mark can reflect the whole pronunciation level of sample voice data, be
The most most important index at all of evaluation and test pronunciation.Phoneme score criteria is poor, refers to the standard deviation of all phoneme scores, can reflect
The degree of stability of user pronunciation.39 dimension phonemes show feature, refer to the score average mark that 39 phonemes respectively obtain, dimension is sound
The number of element, characterizes performance situation of the user on each phoneme.Phoneme is averaged duration, refers to that the duration of each phoneme asks flat
, the speed of user pronunciation speed can be indicated.Phoneme duration standard deviation refers to the standard deviation of all phoneme pronunciation durations, can
To indicate the stability of user pronunciation.By by default marking rule, can more accurately obtain each phoneme marking and
Duration characteristics.
In step s 130, by the acoustic feature and the prosodic features of the sample voice data to Rating Model into
Row training, and scored according to trained Rating Model target speech data, to obtain the target speech data
Score value.
In the present exemplary embodiment, the corresponding prosodic features of sample voice data is the important indicator for judging voice, specifically
The rhythm that sample voice data can be obtained according to sample voice data and the corresponding received text data of sample voice data is special
Sign, the prosodic features can include but is not limited to one of volume, tone, word speed, language fluency, language integrity degree
Or it is a variety of.
After obtaining prosodic features, the acoustic feature of sample voice data, prosodic features and artificial mark can be passed through
Data are trained a Rating Model for being evaluated and tested to voice data.Artificial labeled data refers to manually to sample
Acoustic feature and prosodic features specifically can be fitted to artificial labeled data, with right by the score that this voice data is evaluated and tested
Rating Model is trained, and obtains trained Rating Model.Rating Model in the present exemplary embodiment can be regression tree
Model, such as Xgboost model.For example, sample voice data 1 are by obtained artificial labeled data of manually giving a mark
98, the corresponding acoustic feature of sample voice data 1 and prosodic features can be trained, until being fitted to 98, thus
A trained Rating Model is obtained according to fitting parameter.In the present exemplary embodiment, pass through acoustic feature and prosodic features
It is fitted to the mode training Rating Model of artificial labeled data, the accuracy rate of the Rating Model of foundation can be improved, so as to more
Accurately score sample voice data.Due to increasing the prosodic features of sample data, avoid only by single
The problem of inaccuracy caused by acoustic feature is evaluated and tested, close to truth under the mode that manually scores, can to score
It is higher with the compactness that manually scores, the accuracy and reference value of speech evaluating are improved, user satisfaction is increased.
After obtaining the relatively good trained Rating Model of performance, target speech data to be evaluated can be inputted the instruction
The Rating Model perfected is divided by acoustic feature and prosodic features of the Rating Model to the target speech data
Analysis, obtains score value corresponding with the target speech data.Wherein, target speech data can be with the type of sample voice data
It is identical, for example, voice data of children.Extract the acoustic feature of target speech data and the method and step of prosodic features
Similar in S110 neutralization procedure S120, details are not described herein again.It should be noted that still needing to mark it for target speech data
Quasi- text is pre-processed, the preset characters to include in retention criteria text data and the word that will be not present in pronunciation dictionary
It is mapped as noise, in order to score.
It should be noted that may have difference to the standards of grading of target speech data under different application scene, therefore can
Rating Model is raised or moved down according to actual needs, accurately to comment the target speech data under different application scene
Point.Specifically, evaluation and test score can set corresponding Nonlinear Mapping relationship, to realize according to different scene and product demand,
Provide the score of different dimensions.
The specific flow chart of speech assessment in speech assessment system is shown in Fig. 2, includes sound in the speech assessment system
Learning model, language model and Rating Model, specific steps includes:
Step S201 obtains sample voice data, and sample voice data can be PCM format audio.
Step S202, extracts sample characteristics from sample voice data, and sample characteristics can be Fbank feature.
Step S203, according to sample characteristics training acoustic model, acoustic model can be DNN-HMM model.
Step S204 obtains the corresponding received text data of sample voice data.
Step S205, by the corresponding received text data generation instruction model of sample voice data, which can be with
For bi-gram model.
Step S206 constitutes decoder by language model and acoustic model, to be decoded to sample voice data.
Step S207 evaluates and tests decoded sample voice data according to GOP algorithm, obtains acoustic feature.
Step S208 generates the corresponding prosodic features of sample voice data according to received text data and PCM format audio.
Step S209 generates Rating Model according to acoustic feature, prosodic features and artificial labeled data, and Rating Model can be with
For Xgboost model, and the score value of target speech data is obtained according to trained Rating Model.
By the method in Fig. 2, acoustic model corresponding with sample voice data can be constructed, and then passes through acoustic model
Quickly sample voice data are decoded with language model, further according to acoustic feature and prosodic features training scoring mould
Type may make trained Rating Model more acurrate, to accurately be scored target speech data.
The disclosure additionally provides a kind of speech assessment device.Refering to what is shown in Fig. 3, the speech assessment device 300 may include:
Acoustic training model module 301 can be used for from sample voice data extracting sample characteristics, and pass through the sample
Eigen is trained acoustic model, obtains trained acoustic model;
Acoustic feature obtains module 302, can be used for according to received text data corresponding with the sample voice data
Language model is constructed, and the sample voice data are decoded by the language model and trained acoustic model,
To obtain the acoustic feature of the sample voice data;
Voice data evaluation and test module 303 can be used for the rhythm by the acoustic feature and the sample voice data
Feature is trained Rating Model, and is scored according to trained Rating Model target speech data, to obtain
State the score value of target speech data.
It should be noted that the detail of each module is in corresponding speech assessment side in above-mentioned speech assessment device
It is described in detail in method, therefore details are not described herein again.
It should be noted that although being referred to several modules or list for acting the equipment executed in the above detailed description
Member, but this division is not enforceable.In fact, according to embodiment of the present disclosure, it is above-described two or more
Module or the feature and function of unit can embody in a module or unit.Conversely, an above-described mould
The feature and function of block or unit can be to be embodied by multiple modules or unit with further division.
In addition, although describing each step of method in the disclosure in the accompanying drawings with particular order, this does not really want
These steps must be executed in this particular order by asking or implying, or having to carry out step shown in whole could realize
Desired result.Additional or alternative, it is convenient to omit multiple steps are merged into a step and executed by certain steps, and/
Or a step is decomposed into execution of multiple steps etc..
In an exemplary embodiment of the disclosure, a kind of electronic equipment that can be realized the above method is additionally provided.
Person of ordinary skill in the field it is understood that various aspects of the invention can be implemented as system, method or
Program product.Therefore, various aspects of the invention can be embodied in the following forms, it may be assumed that complete hardware embodiment, complete
The embodiment combined in terms of full Software Implementation (including firmware, microcode etc.) or hardware and software, can unite here
Referred to as circuit, " module " or " system ".
The electronic equipment 400 of this embodiment according to the present invention is described referring to Fig. 4.The electronics that Fig. 4 is shown
Equipment 400 is only an example, should not function to the embodiment of the present invention and use scope bring any restrictions.
As shown in figure 4, electronic equipment 400 is showed in the form of universal computing device.The component of electronic equipment 400 can wrap
It includes but is not limited to: at least one above-mentioned processing unit 410, at least one above-mentioned storage unit 420, the different system components of connection
The bus 430 of (including storage unit 420 and processing unit 410).
Wherein, the storage unit is stored with program code, and said program code can be held by the processing unit 410
Row, so that various according to the present invention described in the execution of the processing unit 410 above-mentioned " illustrative methods " part of this specification
The step of illustrative embodiments.For example, the processing unit 410 can execute step as shown in fig. 1: in step S110
In, sample characteristics are extracted from sample voice data, and be trained to acoustic model by the sample characteristics, trained
Good acoustic model;In the step s 120, language mould is constructed according to received text data corresponding with the sample voice data
Type, and the sample voice data are decoded by the language model and trained acoustic model, it is described to obtain
The acoustic feature of sample voice data;In step s 130, pass through the rhythm of the acoustic feature and the sample voice data
Feature is trained Rating Model, and is scored according to trained Rating Model target speech data, to obtain
State the score value of target speech data.
Storage unit 420 may include the readable medium of volatile memory cell form, such as Random Access Storage Unit
(RAM) 4201 and/or cache memory unit 4202, it can further include read-only memory unit (ROM) 4203.
Storage unit 420 can also include program/utility with one group of (at least one) program module 4205
4204, such program module 4205 includes but is not limited to: operating system, one or more application program, other program moulds
It may include the realization of network environment in block and program data, each of these examples or certain combination.
Bus 430 can be to indicate one of a few class bus structures or a variety of, including storage unit bus or storage
Cell controller, peripheral bus, graphics acceleration port, processing unit use any bus structures in a variety of bus structures
Local bus.
Display unit 440 can be display having a display function, to pass through the display exhibits by processing unit 410
Execute processing result obtained from the method in the present exemplary embodiment.Display include but is not limited to liquid crystal display either
Other displays.
Electronic equipment 400 can also be with one or more external equipments 600 (such as keyboard, sensing equipment, bluetooth equipment
Deng) communication, can also be enabled a user to one or more equipment interact with the electronic equipment 400 communicate, and/or with make
Any equipment (such as the router, modulation /demodulation that the electronic equipment 400 can be communicated with one or more of the other calculating equipment
Device etc.) communication.This communication can be carried out by input/output (I/O) interface 450.Also, electronic equipment 400 can be with
By network adapter 460 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public network,
Such as internet) communication.As shown, network adapter 460 is communicated by bus 430 with other modules of electronic equipment 400.
It should be understood that although not shown in the drawings, other hardware and/or software module can not used in conjunction with electronic equipment 400, including but not
Be limited to: microcode, device driver, redundant processing unit, external disk drive array, RAID system, tape drive and
Data backup storage system etc..
In an exemplary embodiment of the disclosure, a kind of computer readable storage medium is additionally provided, energy is stored thereon with
Enough realize the program product of this specification above method.In some possible embodiments, various aspects of the invention may be used also
In the form of being embodied as a kind of program product comprising program code, when described program product is run on the terminal device, institute
Program code is stated for executing the terminal device described in above-mentioned " illustrative methods " part of this specification according to this hair
The step of bright various illustrative embodiments.
Refering to what is shown in Fig. 5, describing the program product for realizing the above method of embodiment according to the present invention
500, can using portable compact disc read only memory (CD-ROM) and including program code, and can in terminal device,
Such as it is run on PC.However, program product of the invention is without being limited thereto, in this document, readable storage medium storing program for executing can be with
To be any include or the tangible medium of storage program, the program can be commanded execution system, device or device use or
It is in connection.
Described program product can be using any combination of one or more readable mediums.Readable medium can be readable letter
Number medium or readable storage medium storing program for executing.Readable storage medium storing program for executing for example can be but be not limited to electricity, magnetic, optical, electromagnetic, infrared ray or
System, device or the device of semiconductor, or any above combination.The more specific example of readable storage medium storing program for executing is (non exhaustive
List) include: electrical connection with one or more conducting wires, portable disc, hard disk, random access memory (RAM), read-only
Memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read only memory
(CD-ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.
Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal,
In carry readable program code.The data-signal of this propagation can take various forms, including but not limited to electromagnetic signal,
Optical signal or above-mentioned any appropriate combination.Readable signal medium can also be any readable Jie other than readable storage medium storing program for executing
Matter, the readable medium can send, propagate or transmit for by instruction execution system, device or device use or and its
The program of combined use.
The program code for including on readable medium can transmit with any suitable medium, including but not limited to wirelessly, have
Line, optical cable, RF etc. or above-mentioned any appropriate combination.
The program for executing operation of the present invention can be write with any combination of one or more programming languages
Code, described program design language include object oriented program language-Java, C++ etc., further include conventional
Procedural programming language-such as " C " language or similar programming language.Program code can be fully in user
It calculates and executes in equipment, partly executes on a user device, being executed as an independent software package, partially in user's calculating
Upper side point is executed on a remote computing or is executed in remote computing device or server completely.It is being related to far
Journey calculates in the situation of equipment, and remote computing device can pass through the network of any kind, including local area network (LAN) or wide area network
(WAN), it is connected to user calculating equipment, or, it may be connected to external computing device (such as utilize ISP
To be connected by internet).
In addition, above-mentioned attached drawing is only the schematic theory of processing included by method according to an exemplary embodiment of the present invention
It is bright, rather than limit purpose.It can be readily appreciated that the time that above-mentioned processing shown in the drawings did not indicated or limited these processing is suitable
Sequence.In addition, be also easy to understand, these processing, which can be, for example either synchronously or asynchronously to be executed in multiple modules.
Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to its of the disclosure
His embodiment.This application is intended to cover any variations, uses, or adaptations of the disclosure, these modifications, purposes or
Adaptive change follow the general principles of this disclosure and including the undocumented common knowledge in the art of the disclosure or
Conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the disclosure are by claim
It points out.
Claims (13)
1. a kind of speech assessment method characterized by comprising
Sample characteristics are extracted from sample voice data, and acoustic model is trained by the sample characteristics, are instructed
The acoustic model perfected;
Construct language model according to received text data corresponding with the sample voice data, and by the language model and
Trained acoustic model is decoded the sample voice data, to obtain the acoustic feature of the sample voice data;
Rating Model is trained by the prosodic features of the acoustic feature and the sample voice data, and according to training
Good Rating Model scores to target speech data, to obtain the score value of the target speech data.
2. speech assessment method according to claim 1, which is characterized in that extract sample characteristics from sample voice data
Include:
Artificial scoring will be passed through and appraisal result meets the online voice data of preset condition as the sample voice data;
Fbank feature is extracted from the sample voice data as the sample characteristics.
3. speech assessment method according to claim 2, which is characterized in that by the sample characteristics to acoustic model into
Row training, obtaining trained acoustic model includes:
Off-line training is carried out to the acoustic model according to the Fbank feature, obtains deep neural network recessiveness Markov
Acoustic model.
4. speech assessment method according to claim 1, which is characterized in that according to corresponding with the sample voice data
Received text data construct language model
Retain the preset characters for including in the received text data, and the word except pronunciation dictionary is mapped as making an uproar
Sound, to be pre-processed to obtain pretreated received text data to the received text data;
The language model is constructed according to pretreated received text data, the language model is bi-gram language model.
5. speech assessment method according to claim 1, which is characterized in that the method also includes:
The prosodic features of the sample voice data, institute are obtained according to sample voice data and the received text data
Stating prosodic features includes at least one of volume, tone, word speed, language fluency and language integrity degree.
6. speech assessment method according to claim 1, which is characterized in that pass through the language model and trained sound
It learns model to be decoded the sample voice data, includes: to obtain the acoustic feature of the sample voice data
The sample voice data are decoded using the language model and trained acoustic model, after obtaining decoding
Sample voice data;
The marking of each phoneme and duration characteristics in decoded sample voice data are extracted, and are beaten according to each phoneme
Divide and the duration characteristics determine the acoustic feature.
7. speech assessment method according to claim 6, which is characterized in that the acoustic feature include phoneme average mark,
Phoneme score criteria is poor, 39 dimension phonemes show feature, phoneme is averaged at least one of duration and phoneme duration standard deviation.
8. speech assessment method according to claim 6, which is characterized in that extract every in decoded sample voice data
The marking of a phoneme and duration characteristics include:
Result is aligned according to the pressure that text data and pronunciation dictionary obtain the sample voice data and the text data;
The marking of each phoneme and duration characteristics in the pressure alignment result are obtained by default marking rule.
9. speech assessment method according to claim 1, which is characterized in that pass through the acoustic feature and the sample language
The prosodic features of sound data, which is trained Rating Model, includes:
By the acoustic feature, the prosodic features and artificial labeled data the training Rating Model, trained
Rating Model.
10. speech assessment method according to claim 1, which is characterized in that according to trained Rating Model to target
Voice data scores, and includes: to obtain the score value of the target speech data
It is analyzed by acoustic feature and prosodic features of the trained Rating Model to the target speech data,
Obtain score value corresponding with the target speech data.
11. a kind of speech assessment device characterized by comprising
Acoustic training model module, for extracting sample characteristics from sample voice data, and by the sample characteristics to sound
It learns model to be trained, obtains trained acoustic model;
Acoustic feature obtains module, for constructing language mould according to received text data corresponding with the sample voice data
Type, and the sample voice data are decoded by the language model and trained acoustic model, it is described to obtain
The acoustic feature of sample voice data;
Voice data evaluation and test module, for by the acoustic feature and the prosodic features of the sample voice data to scoring mould
Type is trained, and is scored according to trained Rating Model target speech data, to obtain the target voice number
According to score value.
12. a kind of electronic equipment characterized by comprising
Processor;And
Memory, for storing the executable instruction of the processor;
Wherein, the processor is configured to come described in perform claim requirement 1-10 any one via the execution executable instruction
Speech assessment method.
13. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program
Speech assessment method described in claim 1-10 any one is realized when being executed by processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811327485.XA CN109256152A (en) | 2018-11-08 | 2018-11-08 | Speech assessment method and device, electronic equipment, storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811327485.XA CN109256152A (en) | 2018-11-08 | 2018-11-08 | Speech assessment method and device, electronic equipment, storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109256152A true CN109256152A (en) | 2019-01-22 |
Family
ID=65042980
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811327485.XA Pending CN109256152A (en) | 2018-11-08 | 2018-11-08 | Speech assessment method and device, electronic equipment, storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109256152A (en) |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110289015A (en) * | 2019-05-27 | 2019-09-27 | 北京大米科技有限公司 | A kind of audio-frequency processing method, device, server, storage medium and system |
CN110490428A (en) * | 2019-07-26 | 2019-11-22 | 合肥讯飞数码科技有限公司 | Job of air traffic control method for evaluating quality and relevant apparatus |
CN110853628A (en) * | 2019-11-18 | 2020-02-28 | 苏州思必驰信息科技有限公司 | Model training method and device, electronic equipment and storage medium |
CN110992938A (en) * | 2019-12-10 | 2020-04-10 | 同盾控股有限公司 | Voice data processing method and device, electronic equipment and computer readable medium |
CN111128181A (en) * | 2019-12-09 | 2020-05-08 | 科大讯飞股份有限公司 | Recitation question evaluation method, device and equipment |
CN111583906A (en) * | 2019-02-18 | 2020-08-25 | 中国移动通信有限公司研究院 | Role recognition method, device and terminal for voice conversation |
CN111627445A (en) * | 2020-05-26 | 2020-09-04 | 福建省海峡智汇科技有限公司 | Matching method and system for site or personnel |
CN111640452A (en) * | 2019-03-01 | 2020-09-08 | 北京搜狗科技发展有限公司 | Data processing method and device and data processing device |
CN112201225A (en) * | 2020-09-30 | 2021-01-08 | 北京大米科技有限公司 | Corpus obtaining method and device, readable storage medium and electronic equipment |
CN112257407A (en) * | 2020-10-20 | 2021-01-22 | 网易(杭州)网络有限公司 | Method and device for aligning text in audio, electronic equipment and readable storage medium |
CN112397048A (en) * | 2020-12-10 | 2021-02-23 | 标贝(北京)科技有限公司 | Pronunciation stability evaluation method, device and system for speech synthesis and storage medium |
CN112581939A (en) * | 2020-12-06 | 2021-03-30 | 中国南方电网有限责任公司 | Intelligent voice analysis method applied to power dispatching normative evaluation |
CN112668617A (en) * | 2020-12-21 | 2021-04-16 | 广东电网有限责任公司电力科学研究院 | Power grid employee work satisfaction evaluation method and device |
CN112669810A (en) * | 2020-12-16 | 2021-04-16 | 平安科技(深圳)有限公司 | Speech synthesis effect evaluation method and device, computer equipment and storage medium |
CN112767932A (en) * | 2020-12-11 | 2021-05-07 | 北京百家科技集团有限公司 | Voice evaluation system, method, device, equipment and computer readable storage medium |
CN112802456A (en) * | 2021-04-14 | 2021-05-14 | 北京世纪好未来教育科技有限公司 | Voice evaluation scoring method and device, electronic equipment and storage medium |
CN112908359A (en) * | 2021-01-31 | 2021-06-04 | 云知声智能科技股份有限公司 | Voice evaluation method and device, electronic equipment and computer readable medium |
CN112951277A (en) * | 2019-11-26 | 2021-06-11 | 新东方教育科技集团有限公司 | Method and device for evaluating speech |
CN115066679A (en) * | 2020-03-25 | 2022-09-16 | 苏州七星天专利运营管理有限责任公司 | Method and system for extracting self-made terms in professional field |
CN115223588A (en) * | 2022-03-24 | 2022-10-21 | 华东师范大学 | Child voice phrase matching method based on pinyin distance and sliding window |
CN115273897A (en) * | 2022-08-05 | 2022-11-01 | 北京有竹居网络技术有限公司 | Method, apparatus, device and storage medium for processing voice data |
CN115346421A (en) * | 2021-05-12 | 2022-11-15 | 北京猿力未来科技有限公司 | Spoken language fluency scoring method, computing device and storage medium |
CN115359808A (en) * | 2022-08-22 | 2022-11-18 | 北京有竹居网络技术有限公司 | Method for processing voice data, model generation method, model generation device and electronic equipment |
US20230110905A1 (en) * | 2021-10-07 | 2023-04-13 | Nvidia Corporation | Unsupervised alignment for text to speech synthesis using neural networks |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101197084A (en) * | 2007-11-06 | 2008-06-11 | 安徽科大讯飞信息科技股份有限公司 | Automatic spoken English evaluating and learning system |
CN103065626A (en) * | 2012-12-20 | 2013-04-24 | 中国科学院声学研究所 | Automatic grading method and automatic grading equipment for read questions in test of spoken English |
CN103151042A (en) * | 2013-01-23 | 2013-06-12 | 中国科学院深圳先进技术研究院 | Full-automatic oral language evaluating management and scoring system and scoring method thereof |
CN107945788A (en) * | 2017-11-27 | 2018-04-20 | 桂林电子科技大学 | A kind of relevant Oral English Practice pronunciation error detection of text and quality score method |
-
2018
- 2018-11-08 CN CN201811327485.XA patent/CN109256152A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101197084A (en) * | 2007-11-06 | 2008-06-11 | 安徽科大讯飞信息科技股份有限公司 | Automatic spoken English evaluating and learning system |
CN103065626A (en) * | 2012-12-20 | 2013-04-24 | 中国科学院声学研究所 | Automatic grading method and automatic grading equipment for read questions in test of spoken English |
CN103151042A (en) * | 2013-01-23 | 2013-06-12 | 中国科学院深圳先进技术研究院 | Full-automatic oral language evaluating management and scoring system and scoring method thereof |
CN107945788A (en) * | 2017-11-27 | 2018-04-20 | 桂林电子科技大学 | A kind of relevant Oral English Practice pronunciation error detection of text and quality score method |
Cited By (36)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111583906B (en) * | 2019-02-18 | 2023-08-15 | 中国移动通信有限公司研究院 | Role recognition method, device and terminal for voice session |
CN111583906A (en) * | 2019-02-18 | 2020-08-25 | 中国移动通信有限公司研究院 | Role recognition method, device and terminal for voice conversation |
CN111640452B (en) * | 2019-03-01 | 2024-05-07 | 北京搜狗科技发展有限公司 | Data processing method and device for data processing |
CN111640452A (en) * | 2019-03-01 | 2020-09-08 | 北京搜狗科技发展有限公司 | Data processing method and device and data processing device |
CN110289015B (en) * | 2019-05-27 | 2021-09-17 | 北京大米科技有限公司 | Audio processing method, device, server, storage medium and system |
CN110289015A (en) * | 2019-05-27 | 2019-09-27 | 北京大米科技有限公司 | A kind of audio-frequency processing method, device, server, storage medium and system |
CN110490428A (en) * | 2019-07-26 | 2019-11-22 | 合肥讯飞数码科技有限公司 | Job of air traffic control method for evaluating quality and relevant apparatus |
CN110853628A (en) * | 2019-11-18 | 2020-02-28 | 苏州思必驰信息科技有限公司 | Model training method and device, electronic equipment and storage medium |
CN112951277A (en) * | 2019-11-26 | 2021-06-11 | 新东方教育科技集团有限公司 | Method and device for evaluating speech |
CN112951277B (en) * | 2019-11-26 | 2023-01-13 | 新东方教育科技集团有限公司 | Method and device for evaluating speech |
CN111128181A (en) * | 2019-12-09 | 2020-05-08 | 科大讯飞股份有限公司 | Recitation question evaluation method, device and equipment |
CN110992938A (en) * | 2019-12-10 | 2020-04-10 | 同盾控股有限公司 | Voice data processing method and device, electronic equipment and computer readable medium |
CN115066679B (en) * | 2020-03-25 | 2024-02-20 | 苏州七星天专利运营管理有限责任公司 | Method and system for extracting self-made terms in professional field |
CN115066679A (en) * | 2020-03-25 | 2022-09-16 | 苏州七星天专利运营管理有限责任公司 | Method and system for extracting self-made terms in professional field |
CN111627445A (en) * | 2020-05-26 | 2020-09-04 | 福建省海峡智汇科技有限公司 | Matching method and system for site or personnel |
CN111627445B (en) * | 2020-05-26 | 2023-07-07 | 福建省海峡智汇科技有限公司 | Matching method and system for sites or personnel |
CN112201225A (en) * | 2020-09-30 | 2021-01-08 | 北京大米科技有限公司 | Corpus obtaining method and device, readable storage medium and electronic equipment |
CN112201225B (en) * | 2020-09-30 | 2024-02-02 | 北京大米科技有限公司 | Corpus acquisition method and device, readable storage medium and electronic equipment |
CN112257407A (en) * | 2020-10-20 | 2021-01-22 | 网易(杭州)网络有限公司 | Method and device for aligning text in audio, electronic equipment and readable storage medium |
CN112257407B (en) * | 2020-10-20 | 2024-05-14 | 网易(杭州)网络有限公司 | Text alignment method and device in audio, electronic equipment and readable storage medium |
CN112581939A (en) * | 2020-12-06 | 2021-03-30 | 中国南方电网有限责任公司 | Intelligent voice analysis method applied to power dispatching normative evaluation |
CN112397048A (en) * | 2020-12-10 | 2021-02-23 | 标贝(北京)科技有限公司 | Pronunciation stability evaluation method, device and system for speech synthesis and storage medium |
CN112397048B (en) * | 2020-12-10 | 2023-07-14 | 标贝(北京)科技有限公司 | Speech synthesis pronunciation stability evaluation method, device and system and storage medium |
CN112767932A (en) * | 2020-12-11 | 2021-05-07 | 北京百家科技集团有限公司 | Voice evaluation system, method, device, equipment and computer readable storage medium |
CN112669810B (en) * | 2020-12-16 | 2023-08-01 | 平安科技(深圳)有限公司 | Speech synthesis effect evaluation method, device, computer equipment and storage medium |
CN112669810A (en) * | 2020-12-16 | 2021-04-16 | 平安科技(深圳)有限公司 | Speech synthesis effect evaluation method and device, computer equipment and storage medium |
CN112668617A (en) * | 2020-12-21 | 2021-04-16 | 广东电网有限责任公司电力科学研究院 | Power grid employee work satisfaction evaluation method and device |
CN112908359A (en) * | 2021-01-31 | 2021-06-04 | 云知声智能科技股份有限公司 | Voice evaluation method and device, electronic equipment and computer readable medium |
CN112802456A (en) * | 2021-04-14 | 2021-05-14 | 北京世纪好未来教育科技有限公司 | Voice evaluation scoring method and device, electronic equipment and storage medium |
CN115346421A (en) * | 2021-05-12 | 2022-11-15 | 北京猿力未来科技有限公司 | Spoken language fluency scoring method, computing device and storage medium |
US11869483B2 (en) * | 2021-10-07 | 2024-01-09 | Nvidia Corporation | Unsupervised alignment for text to speech synthesis using neural networks |
US11769481B2 (en) | 2021-10-07 | 2023-09-26 | Nvidia Corporation | Unsupervised alignment for text to speech synthesis using neural networks |
US20230110905A1 (en) * | 2021-10-07 | 2023-04-13 | Nvidia Corporation | Unsupervised alignment for text to speech synthesis using neural networks |
CN115223588A (en) * | 2022-03-24 | 2022-10-21 | 华东师范大学 | Child voice phrase matching method based on pinyin distance and sliding window |
CN115273897A (en) * | 2022-08-05 | 2022-11-01 | 北京有竹居网络技术有限公司 | Method, apparatus, device and storage medium for processing voice data |
CN115359808A (en) * | 2022-08-22 | 2022-11-18 | 北京有竹居网络技术有限公司 | Method for processing voice data, model generation method, model generation device and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109256152A (en) | Speech assessment method and device, electronic equipment, storage medium | |
US8504367B2 (en) | Speech retrieval apparatus and speech retrieval method | |
CN111833853B (en) | Voice processing method and device, electronic equipment and computer readable storage medium | |
US20160314783A1 (en) | Method for building language model, speech recognition method and electronic apparatus | |
CN109741732A (en) | Name entity recognition method, name entity recognition device, equipment and medium | |
CN103677729B (en) | Voice input method and system | |
US20140039896A1 (en) | Methods and System for Grammar Fitness Evaluation as Speech Recognition Error Predictor | |
CN101551947A (en) | Computer system for assisting spoken language learning | |
CN109697988B (en) | Voice evaluation method and device | |
CN112397056B (en) | Voice evaluation method and computer storage medium | |
CN109377981B (en) | Phoneme alignment method and device | |
CN112466279B (en) | Automatic correction method and device for spoken English pronunciation | |
CN110782880B (en) | Training method and device for prosody generation model | |
CN110503956A (en) | Audio recognition method, device, medium and electronic equipment | |
WO2023093295A1 (en) | Artificial intelligence-based audio processing method and apparatus, electronic device, computer program product, and computer-readable storage medium | |
CN115116428B (en) | Prosodic boundary labeling method, device, equipment, medium and program product | |
CN109697975B (en) | Voice evaluation method and device | |
CN108364655A (en) | Method of speech processing, medium, device and computing device | |
CN114299930A (en) | End-to-end speech recognition model processing method, speech recognition method and related device | |
Thennattil et al. | Phonetic engine for continuous speech in Malayalam | |
CN112309429A (en) | Method, device and equipment for explosion loss detection and computer readable storage medium | |
CN114420159A (en) | Audio evaluation method and device and non-transient storage medium | |
CN111489742B (en) | Acoustic model training method, voice recognition device and electronic equipment | |
Li et al. | English sentence pronunciation evaluation using rhythm and intonation | |
Carson-Berndsen | Multilingual time maps: portable phonotactic models for speech technology |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190122 |
|
RJ01 | Rejection of invention patent application after publication |