CN109256152A

CN109256152A - Speech assessment method and device, electronic equipment, storage medium

Info

Publication number: CN109256152A
Application number: CN201811327485.XA
Authority: CN
Inventors: 智小楠; 饶丰; 周峰; 刘文闯; 疏北平; 衷奕
Original assignee: Shanghai Joint Operation Information Technology Co Ltd
Current assignee: Shanghai Joint Operation Information Technology Co Ltd
Priority date: 2018-11-08
Filing date: 2018-11-08
Publication date: 2019-01-22

Abstract

The disclosure is directed to a kind of speech assessment method and devices, electronic equipment, storage medium, it is related to field of computer technology, this method comprises: extracting sample characteristics from sample voice data, and acoustic model is trained by the sample characteristics, obtains trained acoustic model；Language model is constructed according to received text data corresponding with the sample voice data, and the sample voice data are decoded by the language model and trained acoustic model, to obtain the acoustic feature of the sample voice data；Rating Model is trained by the prosodic features of the acoustic feature and the sample voice data, and is scored according to trained Rating Model target speech data, to obtain the score value of the target speech data.The disclosure can accurately score to target speech data.

Description

Speech assessment method and device, electronic equipment, storage medium

Technical field

This disclosure relates to field of computer technology, in particular to a kind of speech assessment method, speech assessment device, Electronic equipment and computer readable storage medium.

Background technique

With the development of computer technology, identification evaluation and test can be carried out to the voice of student automatically by oral evaluation system.

Speech evaluating system in the related technology, most of Speech acoustics models based on adult audio driven and carry out, The voice of children differs larger with adult Speech acoustics model, therefore leads to identification inaccuracy；In addition, speech evaluating on the market System mostly uses greatly acoustics score, is combined using acoustic feature, evaluates and tests last score by linear regression or support vector machines. However single acoustics score linear combination can not with profession scoring teacher scoring match, evaluation and test efficiency it is lower and evaluate and test Score is not accurate enough.

It should be noted that information is only used for reinforcing the reason to the background of the disclosure disclosed in above-mentioned background technology part Solution, therefore may include the information not constituted to the prior art known to persons of ordinary skill in the art.

Summary of the invention

The disclosure is designed to provide a kind of speech assessment method and device, electronic equipment, storage medium, and then at least Overcome the problems, such as accurately score to voice caused by the limitation and defect due to the relevant technologies to a certain extent.

Other characteristics and advantages of the disclosure will be apparent from by the following detailed description, or partially by the disclosure Practice and acquistion.

According to one aspect of the disclosure, a kind of speech assessment method is provided, comprising: extract sample from sample voice data Eigen, and acoustic model is trained by the sample characteristics, obtain trained acoustic model；According to the sample The corresponding received text data of this voice data construct language model, and pass through the language model and trained acoustic model The sample voice data are decoded, to obtain the acoustic feature of the sample voice data；Pass through the acoustic feature Rating Model is trained with the prosodic features of the sample voice data, and according to trained Rating Model to target language Sound data score, to obtain the score value of the target speech data.

In a kind of exemplary embodiment of the disclosure, extracting sample characteristics from sample voice data includes: that will pass through It manually scores and appraisal result meets the online voice data of preset condition as the sample voice data；From the sample language Fbank feature is extracted in sound data as the sample characteristics.

In a kind of exemplary embodiment of the disclosure, acoustic model is trained by the sample characteristics, is obtained Trained acoustic model includes: to carry out off-line training to the acoustic model according to the Fbank feature, obtains depth nerve Network recessiveness Markov acoustic model.

In a kind of exemplary embodiment of the disclosure, according to received text data corresponding with the sample voice data Building language model includes: to retain the preset characters for including, and will be in except pronunciation dictionary in the received text data Word is mapped as noise, to be pre-processed to obtain pretreated received text data to the received text data；According to Pretreated received text data construct the language model, and the language model is bi-gram language model.

In a kind of exemplary embodiment of the disclosure, the method also includes: according to sample voice data and described Received text data obtain the prosodic features of the sample voice data, the prosodic features include volume, tone, At least one of word speed, language fluency and language integrity degree.

In a kind of exemplary embodiment of the disclosure, by the language model and trained acoustic model to described Sample voice data are decoded, with obtain the acoustic feature of the sample voice data include: using the language model and Trained acoustic model is decoded the sample voice data, to obtain decoded sample voice data；Extract solution Code after sample voice data in each phoneme marking and duration characteristics, and according to the marking of each phoneme and it is described when Long feature determines the acoustic feature.

In a kind of exemplary embodiment of the disclosure, the acoustic feature includes phoneme average mark, phoneme score criteria Difference, 39 dimension phonemes show feature, phoneme is averaged at least one of duration and phoneme duration standard deviation.

In a kind of exemplary embodiment of the disclosure, the marking of each phoneme in decoded sample voice data is extracted It include: that the pressure of the sample voice data and the text data is obtained according to text data and pronunciation dictionary with duration characteristics It is aligned result；The marking of each phoneme and duration characteristics in the pressure alignment result are obtained by default marking rule.

In a kind of exemplary embodiment of the disclosure, pass through the rhythm of the acoustic feature and the sample voice data It includes: by the training of the acoustic feature, the prosodic features and artificial labeled data that feature, which is trained Rating Model, The Rating Model obtains trained Rating Model.

In a kind of exemplary embodiment of the disclosure, target speech data is commented according to trained Rating Model Point, it include: by the trained Rating Model to the target voice number to obtain the score value of the target speech data According to acoustic feature and prosodic features analyzed, obtain score value corresponding with the target speech data.

According to one aspect of the disclosure, a kind of speech assessment device is provided, comprising: acoustic training model module is used for Sample characteristics are extracted from sample voice data, and acoustic model is trained by the sample characteristics, are trained Acoustic model；Acoustic feature obtains module, for being constructed according to received text data corresponding with the sample voice data Language model, and the sample voice data are decoded by the language model and trained acoustic model, with To the acoustic feature of the sample voice data；Voice data evaluation and test module, for passing through the acoustic feature and the sample The prosodic features of voice data is trained Rating Model, and is carried out according to trained Rating Model to target speech data Scoring, to obtain the score value of the target speech data.

According to one aspect of the disclosure, a kind of electronic equipment is provided, comprising: processor；And

Memory, for storing the executable instruction of the processor；Wherein, the processor is configured to via execution institute Executable instruction is stated to execute speech assessment method described in above-mentioned any one.

According to one aspect of the disclosure, a kind of computer readable storage medium is provided, computer program is stored thereon with, The computer program realizes speech assessment method described in above-mentioned any one when being executed by processor.

A kind of speech assessment method for being there is provided in disclosure exemplary embodiment, speech assessment device, electronic equipment and In computer readable storage medium, on the one hand, by obtaining acoustic mode corresponding with sample voice data in sample voice data Type and language model, and then Rating Model is obtained according to the acoustic feature and prosodic features of sample voice data, it enables to comment Sub-model more meets sample voice data, improves speech recognition accuracy；On the other hand, due to being combined in the Rating Model of foundation It acoustic feature and prosodic features and can quickly identify, so that by trained Rating Model to target speech data Evaluation and test score it is more acurrate, improve scoring accuracy rate.

It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not The disclosure can be limited.

Detailed description of the invention

The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the disclosure Example, and together with specification for explaining the principles of this disclosure.It should be evident that the accompanying drawings in the following description is only the disclosure Some embodiments for those of ordinary skill in the art without creative efforts, can also basis These attached drawings obtain other attached drawings.

Fig. 1 schematically shows a kind of speech assessment method schematic diagram in disclosure exemplary embodiment；

Fig. 2 schematically shows the specific flow chart of speech assessment in disclosure exemplary embodiment；

Fig. 3 schematically shows a kind of block diagram of speech assessment device in disclosure exemplary embodiment；

Fig. 4 schematically shows the block diagram of a kind of electronic equipment in disclosure exemplary embodiment；

Fig. 5 schematically shows a kind of program product in disclosure exemplary embodiment.

Specific embodiment

Example embodiment is described more fully with reference to the drawings.However, example embodiment can be with a variety of shapes Formula is implemented, and is not understood as limited to example set forth herein；On the contrary, thesing embodiments are provided so that the disclosure will more Fully and completely, and by the design of example embodiment comprehensively it is communicated to those skilled in the art.Described feature, knot Structure or characteristic can be incorporated in any suitable manner in one or more embodiments.In the following description, it provides perhaps More details fully understand embodiment of the present disclosure to provide.It will be appreciated, however, by one skilled in the art that can It is omitted with technical solution of the disclosure one or more in the specific detail, or others side can be used Method, constituent element, device, step etc..In other cases, be not shown in detail or describe known solution to avoid a presumptuous guest usurps the role of the host and So that all aspects of this disclosure thicken.

In addition, attached drawing is only the schematic illustrations of the disclosure, it is not necessarily drawn to scale.Identical attached drawing mark in figure Note indicates same or similar part, thus will omit repetition thereof.Some block diagrams shown in the drawings are function Energy entity, not necessarily must be corresponding with physically or logically independent entity.These function can be realized using software form Energy entity, or these functional entitys are realized in one or more hardware modules or integrated circuit, or at heterogeneous networks and/or place These functional entitys are realized in reason device device and/or microcontroller device.

A kind of speech assessment method is provided firstly in this example embodiment, can be applied to the Foreigh-language oral-speech to children In the application scenarios evaluated and tested.The speech assessment method is described in detail with reference to shown in Fig. 1.

In step s 110, sample characteristics are extracted from sample voice data, and by the sample characteristics to acoustic mode Type is trained, and obtains trained acoustic model.

In the present exemplary embodiment, sample voice data can be online voice data, to for children speech data into Row scoring, then sample voice data can be the online voice data of children.Sample voice data can for English voice data or Person is the voice data of other Languages, is illustrated by taking English voice data as an example in the present exemplary embodiment.In order to enable structure The model built is more acurrate, and sample voice data can be to score by artificial, and the result manually to score meets preset condition The online voice data of children, artificial scoring therein can be for example professional teacher's scoring.Sample voice data can with it is to be evaluated The voice data of survey is identical, can also be different, and is not particularly limited herein.Preset condition can be used for the online voice of multiple children Data are screened, and preset condition is specifically as follows artificial scoring integrity degree full marks, pronunciation accuracy rate high score.Multiple hairs if it exists Sound accuracy rate high score can then arrange the online voice data of children according to the descending sequence of pronunciation accuracy rate score value Column, select the online voice data of the children for being arranged in top N as sample voice data.It can be for example, online voice data 1 passes through The result manually to score is integrity degree full marks, and pronunciation accuracy rate 90 is divided；Online voice data 2 passes through the result manually to score Whole degree full marks, pronunciation accuracy rate 98 are divided, then can choose online voice data 2 and be used as sample voice data.The sample language of acquisition The format of sound data can be PCM (Pulse Code Modulation, pulse-code modulation recording) format audio, if the sample obtained This voice data is MP3 format either other formats, then is handled again firstly the need of being converted to PCM format.

After selected sample voice data, it can extract the sample characteristics of sample voice data, to instruct by sample characteristics Practice the acoustic model for being directed to sample voice data.Wherein, sample characteristics can be Fbank feature, or MFCC (Mel Frequency Cepstral Coefficents, mel-frequency cepstrum feature).Since Fbank feature is relative to MFCC The correlation of feature is stronger, is that Fbank feature is with sample characteristics more suitable for model training, therefore in the present exemplary embodiment Example is illustrated.

Extract sample voice data Fbank feature when, specifically can by Fourier transformation, calculate energy spectrum and The processes such as Mel filtering carry out.Next, can be trained by the Fbank feature of extraction to an acoustic model, the acoustic mode Type can be deep neural network recessiveness Markov acoustic model DNN-HMM, or traditional GMM-HMM model, also It can be other suitable machine learning models.Since GMM-HMM model is not so good as depth in word error rate and system robustness The combination DNN-HMM of neural network and recessive Markov model, so using DNN-HMM as sound in the present exemplary embodiment Learn model.It should be noted that instruction can be fitted to the parameter of acoustic model according to the Fbank feature of sample voice data Practice, the relatively good parameter of performance is obtained, to obtain trained deep neural network recessiveness Markov acoustic model.Into one The Fbank feature of sample voice data can be inputted trained acoustic model, be obtained by trained acoustic model by step ground Each phoneme belongs to the probability of which phonetic symbol in sample voice data.

Next, in the step s 120, constructing language according to received text data corresponding with the sample voice data Model, and the sample voice data are decoded by the language model and trained acoustic model, to obtain State the acoustic feature of sample voice data.

In the present exemplary embodiment, received text data refer to the correct text of sample voice data.For example, sample language The corresponding received text data of sound data 1 are as follows: " I got it from my best friend ", sample voice data are corresponding Text data needs are compared with received text data, to determine the accuracy rate of sample voice data.Language model is for retouching The probability for stating sentence appearance, can be effectively combined the knowledge of syntax and semantics, the internal relation between descriptor, to propose High discrimination reduces search range.

Specifically, can be pre-processed first to received text data when constructing language model, to pass through pretreatment Received text data afterwards obtain more accurate language model.Pretreated process specifically may include two ways: mode one is protected The preset characters for including in the received text data are stayed,

Wherein, preset characters refer to the character of Non-American Standard Code for Information Interchange coding, such as punctuation mark and other language are constituted Character (Chinese character) etc..That is, the punctuation mark and Chinese character that can include in first retention criteria text data Etc., and these preset characters are added in subsequent return the result.By the preset characters in retention criteria text data, can protect Demonstrate,prove the consistency of input and output.Mode two, the word for being not present in pronunciation dictionary either word list, such as compound word, Word, self-word creation of misspelling etc. can uniformly be considered identify word Unknow, can further will be unable to identify single Word Unknow is uniformly mapped as noise, does not consider the shadow that can not identify word Unknow when finally scoring voice data It rings.

Further, the language model is constructed according to pretreated received text data, i.e., to pretreated mark Quasi- text data carries out grammatical and semantic analysis, obtains language model.The language model be binary bi-gram language model, two First bi-gram language model refers in received text data, and n-th word occurs only and the N-1 word phase before it It closes, and it is all uncorrelated to other any words.The probability that can occur at this time with each word in evaluation criteria text data, Jin Ergen The probability that entire sentence occurs is obtained according to the product of each word probability of occurrence.For example, it is assumed that S indicates received text data, It is made of the word (w1, w2 ..., wm) that a string of particular orders arrange, and m indicates the length of received text data, i.e. word Number.When calculating a possibility that received text data S occurs in entire corpus P (S) i.e. P (w1, w2 ..., wm), the binary Bi-gram language model can be expressed as formula

After obtaining language model, it can be obtained as shown in Figure 2 in conjunction with trained acoustic model in step S110 Decoder obtains the corresponding text data of sample voice data, herein to be decoded by decoder to sample voice data The corresponding text data of sample voice data can be identical as received text data, can also be different.Decoder be mainly used for In the case where determining input feature vector sequence, by four kinds of knowledge sources such as acoustic model, acoustical context, pronunciation dictionary and language model In the search space of composition, by Viterbi search, best word string is found to get arriving and the most matched text of sample language data Data.

Since the size of language model will have a direct impact on decoded speed, it is decoded using general language model When, since general language model causes greatly decoding time too long very much, and it can not analyze and evaluate and test user and repeat to read or leak The case where reading.Sample voice data are solved by bi-gram language model in the present exemplary embodiment and acoustic model Yard, the probability that the probability and a certain section of words that some word occurs in available sample voice data occur, it is possible thereby to examine Measure user skip and stressed situation, simultaneously as language model is smaller, decoding speed is also ensured.

Based on this, after being decoded to sample voice data, each sound in decoded sample voice data can extract The marking of element and duration characteristics.It specifically, can be according to text data and pronunciation dictionary to sample voice data and sample voice number Pressure alignment is carried out according to corresponding text data.Alignment is forced to refer to that a sample voice data cutting is several phoneme sections, At the beginning of obtaining each phoneme using DNN-HMM model and deadline.Each phoneme is in sample in pressure alignment result Initial time and deadline in this voice data are all determining.

Audio forces alignment that can realize by Viterbi decoding algorithm.Specifically audio can be cut flat with shifting and be divided into and grown one by one It spends very short frame and obtains multiple samplings of audio, the length of frame can be for example 5ms, 10ms either other numerical value.Citing and The length of speech, frame can be 25ms, and multiple samplings are obtained in such a way that every frame is translated 10ms backward.To the every of audio A sampling carries out feature extraction, then carries out similarity calculation with standard pronunciation target feature, indicated with bi (Ot) t-th of sampling and The similarity of i-th of phonetic symbol model.The maximum probability δ t that present video at the time of sampling t reaches phonetic symbol i is represented with δ t (i) (i), then the result δ t+1 (i) for extrapolating the t+1 moment can be sampled by t-th.In decoding process, t is constantly passed since 0 Increase, until audio terminates, finally obtains the corresponding δ N (i) of each phonetic symbol i.Due to forcing the simplicity of alignment, usually have Higher accuracy rate.

It completes after forcing alignment, can learn which in sample voice data each phoneme in text data correspond to Section, and can learn the practical pronunciation in the sample voice data of user.Based on this, default marking rule can be used to measure The accurate pronunciation degree of each phoneme in the corresponding text data of sample voice data.Default marking rule can be GOP (Goodness of Pronunciation) algorithm, wherein qi is the phoneme currently to be given a mark in text data, and A is pressure pair The corresponding one section of voice of qi after neat, NF (A) is the frame number of this section of voice.GOP marking is exactly a conditional probability in fact, it is described In the case where observing user speech A, this section of voice correspond to the probability of phoneme qi.This probability is higher, illustrates that pronunciation is got over Accurately, this probability is lower, illustrates that pronunciation is poorer.That is, can be to the corresponding text of sample voice data according to GOP algorithm The pronunciation situation of each phoneme is given a mark in notebook data, while being also known that this phoneme pair according to the result of pressure alignment The speech frame section answered, to obtain the position of pronunciation mistake.

Marking and duration characteristics by the available each phoneme of default marking rule.What the marking of each phoneme referred to It is the pronouncing accuracy of each phoneme, the duration characteristics of each phoneme refer to the duration of each phoneme, the i.e. rate of articulation. Further, the acoustic feature of sample voice data can be determined according to the marking and duration characteristics of each phoneme.Wherein, acoustics Feature includes but is not limited to that phoneme average mark, phoneme score criteria are poor, 39 dimension phonemes show feature, phoneme is averaged duration and sound One of plain duration standard deviation is a variety of.Phoneme average mark, refer to according to pronunciation dictionary and decoding force alignment as a result, by The average mark for each phoneme that GOP marking generates.Phoneme average mark can reflect the whole pronunciation level of sample voice data, be The most most important index at all of evaluation and test pronunciation.Phoneme score criteria is poor, refers to the standard deviation of all phoneme scores, can reflect The degree of stability of user pronunciation.39 dimension phonemes show feature, refer to the score average mark that 39 phonemes respectively obtain, dimension is sound The number of element, characterizes performance situation of the user on each phoneme.Phoneme is averaged duration, refers to that the duration of each phoneme asks flat , the speed of user pronunciation speed can be indicated.Phoneme duration standard deviation refers to the standard deviation of all phoneme pronunciation durations, can To indicate the stability of user pronunciation.By by default marking rule, can more accurately obtain each phoneme marking and Duration characteristics.

In step s 130, by the acoustic feature and the prosodic features of the sample voice data to Rating Model into Row training, and scored according to trained Rating Model target speech data, to obtain the target speech data Score value.

In the present exemplary embodiment, the corresponding prosodic features of sample voice data is the important indicator for judging voice, specifically The rhythm that sample voice data can be obtained according to sample voice data and the corresponding received text data of sample voice data is special Sign, the prosodic features can include but is not limited to one of volume, tone, word speed, language fluency, language integrity degree Or it is a variety of.

After obtaining prosodic features, the acoustic feature of sample voice data, prosodic features and artificial mark can be passed through Data are trained a Rating Model for being evaluated and tested to voice data.Artificial labeled data refers to manually to sample Acoustic feature and prosodic features specifically can be fitted to artificial labeled data, with right by the score that this voice data is evaluated and tested Rating Model is trained, and obtains trained Rating Model.Rating Model in the present exemplary embodiment can be regression tree Model, such as Xgboost model.For example, sample voice data 1 are by obtained artificial labeled data of manually giving a mark 98, the corresponding acoustic feature of sample voice data 1 and prosodic features can be trained, until being fitted to 98, thus A trained Rating Model is obtained according to fitting parameter.In the present exemplary embodiment, pass through acoustic feature and prosodic features It is fitted to the mode training Rating Model of artificial labeled data, the accuracy rate of the Rating Model of foundation can be improved, so as to more Accurately score sample voice data.Due to increasing the prosodic features of sample data, avoid only by single The problem of inaccuracy caused by acoustic feature is evaluated and tested, close to truth under the mode that manually scores, can to score It is higher with the compactness that manually scores, the accuracy and reference value of speech evaluating are improved, user satisfaction is increased.

After obtaining the relatively good trained Rating Model of performance, target speech data to be evaluated can be inputted the instruction The Rating Model perfected is divided by acoustic feature and prosodic features of the Rating Model to the target speech data Analysis, obtains score value corresponding with the target speech data.Wherein, target speech data can be with the type of sample voice data It is identical, for example, voice data of children.Extract the acoustic feature of target speech data and the method and step of prosodic features Similar in S110 neutralization procedure S120, details are not described herein again.It should be noted that still needing to mark it for target speech data Quasi- text is pre-processed, the preset characters to include in retention criteria text data and the word that will be not present in pronunciation dictionary It is mapped as noise, in order to score.

It should be noted that may have difference to the standards of grading of target speech data under different application scene, therefore can Rating Model is raised or moved down according to actual needs, accurately to comment the target speech data under different application scene Point.Specifically, evaluation and test score can set corresponding Nonlinear Mapping relationship, to realize according to different scene and product demand, Provide the score of different dimensions.

The specific flow chart of speech assessment in speech assessment system is shown in Fig. 2, includes sound in the speech assessment system Learning model, language model and Rating Model, specific steps includes:

Step S201 obtains sample voice data, and sample voice data can be PCM format audio.

Step S202, extracts sample characteristics from sample voice data, and sample characteristics can be Fbank feature.

Step S203, according to sample characteristics training acoustic model, acoustic model can be DNN-HMM model.

Step S204 obtains the corresponding received text data of sample voice data.

Step S205, by the corresponding received text data generation instruction model of sample voice data, which can be with For bi-gram model.

Step S206 constitutes decoder by language model and acoustic model, to be decoded to sample voice data.

Step S207 evaluates and tests decoded sample voice data according to GOP algorithm, obtains acoustic feature.

Step S208 generates the corresponding prosodic features of sample voice data according to received text data and PCM format audio.

Step S209 generates Rating Model according to acoustic feature, prosodic features and artificial labeled data, and Rating Model can be with For Xgboost model, and the score value of target speech data is obtained according to trained Rating Model.

By the method in Fig. 2, acoustic model corresponding with sample voice data can be constructed, and then passes through acoustic model Quickly sample voice data are decoded with language model, further according to acoustic feature and prosodic features training scoring mould Type may make trained Rating Model more acurrate, to accurately be scored target speech data.

The disclosure additionally provides a kind of speech assessment device.Refering to what is shown in Fig. 3, the speech assessment device 300 may include:

Acoustic training model module 301 can be used for from sample voice data extracting sample characteristics, and pass through the sample Eigen is trained acoustic model, obtains trained acoustic model；

Acoustic feature obtains module 302, can be used for according to received text data corresponding with the sample voice data Language model is constructed, and the sample voice data are decoded by the language model and trained acoustic model, To obtain the acoustic feature of the sample voice data；

Voice data evaluation and test module 303 can be used for the rhythm by the acoustic feature and the sample voice data Feature is trained Rating Model, and is scored according to trained Rating Model target speech data, to obtain State the score value of target speech data.

It should be noted that the detail of each module is in corresponding speech assessment side in above-mentioned speech assessment device It is described in detail in method, therefore details are not described herein again.

It should be noted that although being referred to several modules or list for acting the equipment executed in the above detailed description Member, but this division is not enforceable.In fact, according to embodiment of the present disclosure, it is above-described two or more Module or the feature and function of unit can embody in a module or unit.Conversely, an above-described mould The feature and function of block or unit can be to be embodied by multiple modules or unit with further division.

In addition, although describing each step of method in the disclosure in the accompanying drawings with particular order, this does not really want These steps must be executed in this particular order by asking or implying, or having to carry out step shown in whole could realize Desired result.Additional or alternative, it is convenient to omit multiple steps are merged into a step and executed by certain steps, and/ Or a step is decomposed into execution of multiple steps etc..

In an exemplary embodiment of the disclosure, a kind of electronic equipment that can be realized the above method is additionally provided.

Person of ordinary skill in the field it is understood that various aspects of the invention can be implemented as system, method or Program product.Therefore, various aspects of the invention can be embodied in the following forms, it may be assumed that complete hardware embodiment, complete The embodiment combined in terms of full Software Implementation (including firmware, microcode etc.) or hardware and software, can unite here Referred to as circuit, " module " or " system ".

The electronic equipment 400 of this embodiment according to the present invention is described referring to Fig. 4.The electronics that Fig. 4 is shown Equipment 400 is only an example, should not function to the embodiment of the present invention and use scope bring any restrictions.

As shown in figure 4, electronic equipment 400 is showed in the form of universal computing device.The component of electronic equipment 400 can wrap It includes but is not limited to: at least one above-mentioned processing unit 410, at least one above-mentioned storage unit 420, the different system components of connection The bus 430 of (including storage unit 420 and processing unit 410).

Wherein, the storage unit is stored with program code, and said program code can be held by the processing unit 410 Row, so that various according to the present invention described in the execution of the processing unit 410 above-mentioned " illustrative methods " part of this specification The step of illustrative embodiments.For example, the processing unit 410 can execute step as shown in fig. 1: in step S110 In, sample characteristics are extracted from sample voice data, and be trained to acoustic model by the sample characteristics, trained Good acoustic model；In the step s 120, language mould is constructed according to received text data corresponding with the sample voice data Type, and the sample voice data are decoded by the language model and trained acoustic model, it is described to obtain The acoustic feature of sample voice data；In step s 130, pass through the rhythm of the acoustic feature and the sample voice data Feature is trained Rating Model, and is scored according to trained Rating Model target speech data, to obtain State the score value of target speech data.

Storage unit 420 may include the readable medium of volatile memory cell form, such as Random Access Storage Unit (RAM) 4201 and/or cache memory unit 4202, it can further include read-only memory unit (ROM) 4203.

Storage unit 420 can also include program/utility with one group of (at least one) program module 4205 4204, such program module 4205 includes but is not limited to: operating system, one or more application program, other program moulds It may include the realization of network environment in block and program data, each of these examples or certain combination.

Bus 430 can be to indicate one of a few class bus structures or a variety of, including storage unit bus or storage Cell controller, peripheral bus, graphics acceleration port, processing unit use any bus structures in a variety of bus structures Local bus.

Display unit 440 can be display having a display function, to pass through the display exhibits by processing unit 410 Execute processing result obtained from the method in the present exemplary embodiment.Display include but is not limited to liquid crystal display either Other displays.

Electronic equipment 400 can also be with one or more external equipments 600 (such as keyboard, sensing equipment, bluetooth equipment Deng) communication, can also be enabled a user to one or more equipment interact with the electronic equipment 400 communicate, and/or with make Any equipment (such as the router, modulation /demodulation that the electronic equipment 400 can be communicated with one or more of the other calculating equipment Device etc.) communication.This communication can be carried out by input/output (I/O) interface 450.Also, electronic equipment 400 can be with By network adapter 460 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public network, Such as internet) communication.As shown, network adapter 460 is communicated by bus 430 with other modules of electronic equipment 400. It should be understood that although not shown in the drawings, other hardware and/or software module can not used in conjunction with electronic equipment 400, including but not Be limited to: microcode, device driver, redundant processing unit, external disk drive array, RAID system, tape drive and Data backup storage system etc..

In an exemplary embodiment of the disclosure, a kind of computer readable storage medium is additionally provided, energy is stored thereon with Enough realize the program product of this specification above method.In some possible embodiments, various aspects of the invention may be used also In the form of being embodied as a kind of program product comprising program code, when described program product is run on the terminal device, institute Program code is stated for executing the terminal device described in above-mentioned " illustrative methods " part of this specification according to this hair The step of bright various illustrative embodiments.

Refering to what is shown in Fig. 5, describing the program product for realizing the above method of embodiment according to the present invention 500, can using portable compact disc read only memory (CD-ROM) and including program code, and can in terminal device, Such as it is run on PC.However, program product of the invention is without being limited thereto, in this document, readable storage medium storing program for executing can be with To be any include or the tangible medium of storage program, the program can be commanded execution system, device or device use or It is in connection.

Described program product can be using any combination of one or more readable mediums.Readable medium can be readable letter Number medium or readable storage medium storing program for executing.Readable storage medium storing program for executing for example can be but be not limited to electricity, magnetic, optical, electromagnetic, infrared ray or System, device or the device of semiconductor, or any above combination.The more specific example of readable storage medium storing program for executing is (non exhaustive List) include: electrical connection with one or more conducting wires, portable disc, hard disk, random access memory (RAM), read-only Memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read only memory (CD-ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.

Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, In carry readable program code.The data-signal of this propagation can take various forms, including but not limited to electromagnetic signal, Optical signal or above-mentioned any appropriate combination.Readable signal medium can also be any readable Jie other than readable storage medium storing program for executing Matter, the readable medium can send, propagate or transmit for by instruction execution system, device or device use or and its The program of combined use.

The program code for including on readable medium can transmit with any suitable medium, including but not limited to wirelessly, have Line, optical cable, RF etc. or above-mentioned any appropriate combination.

The program for executing operation of the present invention can be write with any combination of one or more programming languages Code, described program design language include object oriented program language-Java, C++ etc., further include conventional Procedural programming language-such as " C " language or similar programming language.Program code can be fully in user It calculates and executes in equipment, partly executes on a user device, being executed as an independent software package, partially in user's calculating Upper side point is executed on a remote computing or is executed in remote computing device or server completely.It is being related to far Journey calculates in the situation of equipment, and remote computing device can pass through the network of any kind, including local area network (LAN) or wide area network (WAN), it is connected to user calculating equipment, or, it may be connected to external computing device (such as utilize ISP To be connected by internet).

In addition, above-mentioned attached drawing is only the schematic theory of processing included by method according to an exemplary embodiment of the present invention It is bright, rather than limit purpose.It can be readily appreciated that the time that above-mentioned processing shown in the drawings did not indicated or limited these processing is suitable Sequence.In addition, be also easy to understand, these processing, which can be, for example either synchronously or asynchronously to be executed in multiple modules.

Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to its of the disclosure His embodiment.This application is intended to cover any variations, uses, or adaptations of the disclosure, these modifications, purposes or Adaptive change follow the general principles of this disclosure and including the undocumented common knowledge in the art of the disclosure or Conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the disclosure are by claim It points out.

Claims

1. a kind of speech assessment method characterized by comprising

Sample characteristics are extracted from sample voice data, and acoustic model is trained by the sample characteristics, are instructed The acoustic model perfected；

Construct language model according to received text data corresponding with the sample voice data, and by the language model and Trained acoustic model is decoded the sample voice data, to obtain the acoustic feature of the sample voice data；

Rating Model is trained by the prosodic features of the acoustic feature and the sample voice data, and according to training Good Rating Model scores to target speech data, to obtain the score value of the target speech data.

2. speech assessment method according to claim 1, which is characterized in that extract sample characteristics from sample voice data Include:

Artificial scoring will be passed through and appraisal result meets the online voice data of preset condition as the sample voice data；

Fbank feature is extracted from the sample voice data as the sample characteristics.

3. speech assessment method according to claim 2, which is characterized in that by the sample characteristics to acoustic model into Row training, obtaining trained acoustic model includes:

Off-line training is carried out to the acoustic model according to the Fbank feature, obtains deep neural network recessiveness Markov Acoustic model.

4. speech assessment method according to claim 1, which is characterized in that according to corresponding with the sample voice data Received text data construct language model

Retain the preset characters for including in the received text data, and the word except pronunciation dictionary is mapped as making an uproar Sound, to be pre-processed to obtain pretreated received text data to the received text data；

The language model is constructed according to pretreated received text data, the language model is bi-gram language model.

5. speech assessment method according to claim 1, which is characterized in that the method also includes:

The prosodic features of the sample voice data, institute are obtained according to sample voice data and the received text data Stating prosodic features includes at least one of volume, tone, word speed, language fluency and language integrity degree.

6. speech assessment method according to claim 1, which is characterized in that pass through the language model and trained sound It learns model to be decoded the sample voice data, includes: to obtain the acoustic feature of the sample voice data

The sample voice data are decoded using the language model and trained acoustic model, after obtaining decoding Sample voice data；

The marking of each phoneme and duration characteristics in decoded sample voice data are extracted, and are beaten according to each phoneme Divide and the duration characteristics determine the acoustic feature.

7. speech assessment method according to claim 6, which is characterized in that the acoustic feature include phoneme average mark, Phoneme score criteria is poor, 39 dimension phonemes show feature, phoneme is averaged at least one of duration and phoneme duration standard deviation.

8. speech assessment method according to claim 6, which is characterized in that extract every in decoded sample voice data The marking of a phoneme and duration characteristics include:

Result is aligned according to the pressure that text data and pronunciation dictionary obtain the sample voice data and the text data；

The marking of each phoneme and duration characteristics in the pressure alignment result are obtained by default marking rule.

9. speech assessment method according to claim 1, which is characterized in that pass through the acoustic feature and the sample language The prosodic features of sound data, which is trained Rating Model, includes:

By the acoustic feature, the prosodic features and artificial labeled data the training Rating Model, trained Rating Model.

10. speech assessment method according to claim 1, which is characterized in that according to trained Rating Model to target Voice data scores, and includes: to obtain the score value of the target speech data

It is analyzed by acoustic feature and prosodic features of the trained Rating Model to the target speech data, Obtain score value corresponding with the target speech data.

11. a kind of speech assessment device characterized by comprising

Acoustic training model module, for extracting sample characteristics from sample voice data, and by the sample characteristics to sound It learns model to be trained, obtains trained acoustic model；

Acoustic feature obtains module, for constructing language mould according to received text data corresponding with the sample voice data Type, and the sample voice data are decoded by the language model and trained acoustic model, it is described to obtain The acoustic feature of sample voice data；

Voice data evaluation and test module, for by the acoustic feature and the prosodic features of the sample voice data to scoring mould Type is trained, and is scored according to trained Rating Model target speech data, to obtain the target voice number According to score value.

12. a kind of electronic equipment characterized by comprising

Processor；And

Memory, for storing the executable instruction of the processor；

Wherein, the processor is configured to come described in perform claim requirement 1-10 any one via the execution executable instruction Speech assessment method.

13. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program Speech assessment method described in claim 1-10 any one is realized when being executed by processor.