CN109215632A

CN109215632A - A kind of speech evaluating method, device, equipment and readable storage medium storing program for executing

Info

Publication number: CN109215632A
Application number: CN201811162964.0A
Authority: CN
Inventors: 金海�; 吴奎; 胡阳; 朱群; 竺博; 魏思
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2018-09-30
Filing date: 2018-09-30
Publication date: 2019-01-15
Anticipated expiration: 2038-09-30
Also published as: JP6902010B2; JP2020056982A; CN109215632B

Abstract

This application discloses a kind of speech evaluating method, device, equipment and readable storage medium storing program for executing, the application obtains voice to be evaluated and the answer text as evaluating standard, the text feature of acoustic feature and answer text based on voice to be evaluated, it can determine the alignment information of voice to be evaluated Yu answer text, it can be understood that, alignment information shows the alignment relation of voice to be evaluated Yu answer text, and then evaluation result of the determination of the automation voice to be evaluated relative to answer text can be realized according to alignment information.Due to not needing manually to be evaluated and tested, interference of the subjective impact to evaluation result of people is not only avoided, but also reduce the consumption of cost of labor.

Description

A kind of speech evaluating method, device, equipment and readable storage medium storing program for executing

Technical field

This application involves voice processing technology fields, more specifically to a kind of speech evaluating method, device, equipment And readable storage medium storing program for executing.

Background technique

With deepening continuously for educational reform, speaking test is carried out throughout the country.It is spoken for written exam Examination can evaluate and test the spoken language proficiency of examinee.

Existing speaking test is mostly the answer by professional teacher, according to the corresponding correct option information of topic, to examinee It is evaluated and tested.This mode manually evaluated and tested extremely is easy the subjective impact by people, causes evaluation result by human interference, and And it can also consume a large amount of cost of labor.

Summary of the invention

In view of this, this application provides a kind of speech evaluating method, device, equipment and readable storage medium storing program for executing, for solving Disadvantage present in the mode of certainly existing artificial carry out speaking test evaluation and test.

To achieve the goals above, it is proposed that scheme it is as follows:

A kind of speech evaluating method, comprising:

Obtain voice to be evaluated, and the answer text as evaluating standard；

The text feature of acoustic feature and the answer text based on the voice to be evaluated, determines the language to be evaluated The alignment information of sound and the answer text；

According to the alignment information, evaluation result of the voice to be evaluated relative to the answer text is determined.

Preferably, the acquisition process of the acoustic feature of the voice to be evaluated, comprising:

The spectrum signature for obtaining the voice to be evaluated, as acoustic feature；

Or,

Obtain the spectrum signature of the voice to be evaluated；

The hidden layer of neural network model is obtained to the hidden layer feature after spectrum signature conversion, as acoustic feature.

Preferably, the acquisition process of the text feature of the answer text, comprising:

The vector for obtaining the answer text, as text feature；

Or,

Obtain the vector of the answer text；

The hidden layer of neural network model is obtained to the hidden layer feature after vector conversion, as text feature.

Preferably, the text feature of the acoustic feature based on the voice to be evaluated and the answer text determines The alignment information of the voice to be evaluated and the answer text, comprising:

The text feature of acoustic feature and the answer text based on the voice to be evaluated, determines that frame level pays attention to torque Battle array, the frame level attention matrix include: for any one text unit in the answer text, in the voice to be evaluated Alignment probability of each frame voice to the text unit.

Preferably, the text feature of the acoustic feature based on the voice to be evaluated and the answer text determines Frame level attention matrix, comprising:

Using the first full articulamentum of neural network model handle the voice to be evaluated acoustic feature and the answer The text feature of text, the first full articulamentum are configured to receive and process the acoustic feature and the text feature, It is indicated with generating the internal state of frame level attention matrix.

Preferably, the text feature of the acoustic feature based on the voice to be evaluated and the answer text determines The alignment information of the voice to be evaluated and the answer text, further includes:

Based on the frame level attention matrix and the acoustic feature, word grade acoustics alignment matrix, institute's predicate grade sound are determined Learning alignment matrix includes: to include with the acoustic information that each text unit is aligned in the answer text, the acoustic information Probability is aligned as weight using the text unit and every frame voice, and the knot of summation is weighted to the acoustic feature of every frame voice Fruit；

Based on institute's predicate grade acoustics alignment matrix and the text feature, word grade attention matrix, institute's predicate grade note are determined Meaning torque battle array includes: text feature for any one text unit in the answer text, each in the answer text The acoustic information of a text unit is aligned probability to it.

Preferably, described to be based on institute's predicate grade acoustics alignment matrix and the text feature, determine word grade attention matrix, Include:

Institute's predicate grade acoustics alignment matrix and the text feature are handled using the second full articulamentum of neural network model, The second full articulamentum is configured to receive and process institute's predicate grade acoustics alignment matrix and the text feature, to generate word The internal state of grade attention matrix indicates.

Preferably, described according to the alignment information, determine voice to be evaluated the commenting relative to the answer text Survey result, comprising:

According to the alignment information, the matching degree of the voice to be evaluated Yu the answer text is determined；

According to the matching degree, evaluation result of the voice to be evaluated relative to the answer text is determined.

It is preferably, described that the matching degree of the voice to be evaluated Yu the answer text is determined according to the alignment information, Include:

The alignment information is handled using the convolution unit of neural network model, the convolution unit is configured as receiving simultaneously The alignment information is handled, is indicated with generating the internal state of matching degree of the voice to be evaluated and the answer text.

Preferably, described according to the matching degree, determine evaluation and test of the voice to be evaluated relative to the answer text As a result, comprising:

The matching degree is handled using the full articulamentum of the third of neural network model, the full articulamentum of third is configured as The matching degree is received and processed, to generate inside shape of the voice to be evaluated relative to the evaluation result of the answer text State indicates.

A kind of speech evaluating device, comprising:

Data capture unit, for obtaining voice to be evaluated, and the answer text as evaluating standard；

Alignment information determination unit, the text for acoustic feature and the answer text based on the voice to be evaluated Feature determines the alignment information of the voice to be evaluated Yu the answer text；

Evaluation result determination unit, for determining that the voice to be evaluated is answered relative to described according to the alignment information The evaluation result of case text.

It preferably, further include acoustic feature acquiring unit, comprising:

First acoustic feature obtains subelement, for obtaining the spectrum signature of the voice to be evaluated, as acoustic feature；

Or,

Second acoustic feature obtains subelement, for obtaining the spectrum signature of the voice to be evaluated；

Third acoustic feature obtains subelement, after the hidden layer for obtaining neural network model converts the spectrum signature Hidden layer feature, as acoustic feature.

Preferably, further includes: text feature acquiring unit, comprising:

First text feature obtains subelement, for obtaining the vector of the answer text, as text feature；

Or,

Second text feature obtains subelement, for obtaining the vector of the answer text；

Third text feature obtains subelement, for obtaining the hidden layer of neural network model to hidden after vector conversion Layer feature, as text feature.

Preferably, the alignment information determination unit includes:

Frame level attention matrix determination unit, for acoustic feature and the answer text based on the voice to be evaluated Text feature, determine that frame level attention matrix, the frame level attention matrix include: for any one in the answer text A text unit, alignment probability of each frame voice to the text unit in the voice to be evaluated.

Preferably, the frame level attention matrix determination unit includes:

It is special to handle the acoustics for the first full articulamentum using neural network model for first full articulamentum processing unit It seeks peace the text feature, the first full articulamentum is configured to receive and process the acoustic feature and the text is special Sign, is indicated with generating the internal state of frame level attention matrix.

Preferably, the alignment information determination unit further include:

Word grade acoustics alignment matrix determination unit, for being based on the frame level attention matrix and the acoustic feature, really Determine word grade acoustics alignment matrix, institute's predicate grade acoustics alignment matrix includes: and each text unit pair in the answer text Neat acoustic information, the acoustic information include being aligned probability as weight, to every frame using the text unit and every frame voice The acoustic feature of voice is weighted the result of summation；

Word grade attention matrix determination unit, for being based on institute's predicate grade acoustics alignment matrix and the text feature, really Determine word grade attention matrix, institute's predicate grade attention matrix includes: for any one text unit in the answer text Text feature, the acoustic information of each text unit is aligned probability to it in the answer text.

Preferably, institute's predicate grade attention matrix determination unit includes:

Second full articulamentum processing unit, for handling institute's predicate grade sound using the second full articulamentum of neural network model Alignment matrix and the text feature are learned, the second full articulamentum is configured to receive and process institute's predicate grade acoustics alignment square Battle array and the text feature, are indicated with generating the internal state of word grade attention matrix.

Preferably, the evaluation result determination unit includes:

Matching degree determination unit, for determining the voice to be evaluated and the answer text according to the alignment information Matching degree；

Matching degree applying unit, for determining the voice to be evaluated relative to the answer text according to the matching degree This evaluation result.

Preferably, the matching degree determination unit includes:

Convolution unit processing unit handles the alignment information for the convolution unit using neural network model, described Convolution unit is configured to receive and process the alignment information, to generate of the voice to be evaluated Yu the answer text Internal state with degree indicates.

Preferably, the matching degree applying unit includes:

The full articulamentum processing unit of third handles the matching for the full articulamentum of third using neural network model Degree, the full articulamentum of third is configured to receive and process the matching degree, to generate the voice to be evaluated relative to institute The internal state for stating the evaluation result of answer text indicates.

A kind of speech evaluating equipment, including memory and processor；

The memory, for storing program；

The processor realizes each step of speech evaluating method as described above for executing described program.

A kind of readable storage medium storing program for executing is stored thereon with computer program, real when the computer program is executed by processor Now each step of speech evaluating method as described above.

It can be seen from the above technical scheme that speech evaluating method provided by the embodiments of the present application, obtains language to be evaluated Sound and answer text as evaluating standard, the text feature of acoustic feature and answer text based on voice to be evaluated can be with Determine the alignment information of voice to be evaluated Yu answer text, it is to be understood that alignment information shows voice to be evaluated and answers The alignment relation of case text, and then the determination of automation voice to be evaluated can be realized relative to answer text according to alignment information This evaluation result.Due to not needing manually to be evaluated and tested, interference of the subjective impact to evaluation result of people is not only avoided, but also subtract The consumption of cost of labor is lacked.

Detailed description of the invention

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of application for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.

Fig. 1 is a kind of speech evaluating method flow chart disclosed in the embodiment of the present application；

Fig. 2 illustrates a kind of flow diagram of neural network model progress speech evaluating；

Fig. 3 illustrates the flow diagram that another neural network model carries out speech evaluating；

Fig. 4 is a kind of speech evaluating apparatus structure schematic diagram disclosed in the embodiment of the present application；

Fig. 5 is a kind of hardware block diagram of speech evaluating equipment disclosed in the embodiment of the present application.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.

In order to solve existing spoken assessment by manually, causes evaluation result by human interference and waste of manpower cost is asked Topic, present inventor first proposed a solution, it can be carried out using speech recognition modeling to voice to be evaluated Identification obtains identification text, and extracts keyword from answer text, and then calculates identification text to the hit rate of keyword, The evaluation result that voice to be evaluated is determined according to the hit rate can determine that evaluation and test score is higher if hit rate is higher.

But further study show that, the above-mentioned solution that inventor proposes is due to needing first to know voice to be evaluated Not Wei text, which can use speech recognition modeling.If using general speech recognition modeling to it is different examination scenes to Evaluation and test voice is identified can there is a problem of that recognition accuracy is low, and then causes evaluation result inaccurate.If being examined for difference Speech recognition modeling is respectively trained in examination hall scape, then needs manually to be trained data marking for every equal schedule ahead of examination, Expend a large amount of costs of labor.

On this basis, inventor further studies, and finally realizes from actively finding voice to be evaluated and answer The angle of the alignment information of text is set out, and realizes the speech evaluating of automation.The speech evaluating method of the application can be based on tool The electronic equipment of standby data-handling capacity is realized, such as intelligent terminal, server, cloud platform.

The speech evaluating scheme of the application can be adapted for speaking test evaluation and test scene and other be related to pronunciation level The scene evaluated and tested.

Next, being illustrated in conjunction with speech evaluating method of the attached drawing 1 to the application, this method may include:

Step S100, voice to be evaluated, and the answer text as evaluating standard are obtained.

Specifically, by taking speaking test scene as an example, voice to be evaluated can be spoken answer that examinee provides and record.It is corresponding , the answer text as evaluating standard can be preset in the present embodiment.By taking material reads speaking test topic as an example, make It can be the text information extracted from reading material for the answer text of evaluating standard.In addition to this, it is asked for other types The speaking test of topic, the answer text as evaluating standard can be the corresponding answer content of problem.

In this step, the acquisition modes of voice to be evaluated can be to be received by sound pick-up outfit, and sound pick-up outfit may include Microphone, such as head microphone.

Step S110, the text feature of acoustic feature and the answer text based on the voice to be evaluated, determines institute State the alignment information of voice to be evaluated Yu the answer text.

Wherein, the acoustic feature of voice to be evaluated reflects the acoustic information of voice to be evaluated.The text of answer text is special Sign reflects the text information of answer text.The type of acoustic feature can there are many, similarly, the type of text feature can also be with There are many.

Acoustic feature and text feature are based in the present embodiment, that actively finds voice to be evaluated and answer text is aligned letter Breath, the alignment information reflect the alignment relation of voice to be evaluated Yu answer text.It is understood that for meeting evaluation and test mark Quasi- voice to be evaluated, should be also very high with the integrity degree that is aligned of answer text, on the contrary, for do not meet evaluating standard to Evaluate and test voice, with answer text to be aligned integrity degree very low.

Step S120, according to the alignment information, evaluation and test of the voice to be evaluated relative to the answer text is determined As a result.

According to above-mentioned discussion it is found that alignment information reflects the alignment relation of voice to be evaluated Yu answer text, with to Whether evaluation and test voice meets evaluating standard and related to evaluating standard matching degree, therefore can be according to alignment letter in this step Breath, determines evaluation result of the voice to be evaluated relative to answer text.

Speech evaluating method provided by the embodiments of the present application can realize that the determination of automation is to be evaluated according to alignment information Survey evaluation result of the voice relative to answer text.Due to not needing manually to be evaluated and tested, the subjective impact pair of people has both been avoided The interference of evaluation result, and reduce the consumption of cost of labor.

Further, since this case is from the angle for the alignment information for actively finding voice to be evaluated and answer text, It determines evaluation result, does not need to carry out speech recognition to voice to be evaluated using speech recognition modeling, and to identification text With the calculating of answer text key word hit rate, asking for evaluation result inaccuracy caused by speech recognition result inaccuracy is avoided Topic, speech evaluating result is more accurate, and scheme can be adapted for various speech evaluating scenes, and robustness is stronger, does not need Ancillary cost manpower is given a mark under different scenes to determine training data, and human cost is saved.

In another embodiment of the application, the acoustics of the voice to be evaluated mentioned in above-mentioned steps S110 is described The acquisition process of the text feature of feature and answer text.

The acquisition process of the acoustic feature of voice to be evaluated is introduced first:

A kind of optional mode, can directly acquire the spectrum signature of voice to be evaluated, and using the spectrum signature as to Evaluate and test the acoustic feature of voice.

Wherein, spectrum signature may include mel-frequency cepstrum coefficient (MelFrequency Cepstrum Coefficient, MFCC) feature or perception linear prediction (PerceptualLinear Predictive, PLP) feature etc..

For the ease of statement, defining voice to be evaluated includes T frame.

Then when obtaining the spectrum signature of voice to be evaluated, sub-frame processing first can be carried out to voice to be evaluated, and to point Voice to be evaluated after frame carries out preemphasis, and then extracts the spectrum signature of every frame voice.

Another optional mode, the spectrum signature of available voice to be evaluated further obtain neural network model Hidden layer to the spectrum signature conversion after hidden layer feature, as acoustic feature.

Here, neural network model can use multiple structural forms, as RNN (Recurrent Neural Network, Recurrent neural network), LSTM (Long Short-TermMemory, long in short-term memory network), GRU (Gated Recurrent Unit, gating cycle unit) etc..

Spectrum signature is converted by the hidden layer of neural network model, deep layer mapping can be carried out to spectrum signature, Obtained hidden layer feature is deeper compared to spectrum signature level, is better able to embody the acoustic characteristic of voice to be evaluated, therefore can Using by hidden layer feature as acoustic feature.

Acoustic feature can be indicated by following matrix form:

Wherein, h_t(t=1,2 ..., T) indicates the acoustic feature of t frame voice, and the dimension of the acoustic feature of each frame It remains unchanged, is defined as m dimension.

Further, the acquisition process of the text feature of voice to be evaluated is introduced:

A kind of optional mode can directly acquire the vector of answer text, and using the vector as the text of answer text Eigen.

Wherein, the vector of answer text can be the combination of the term vector of the text unit of composition answer text, alternatively, literary Vector result of the term vector of this unit after certain calculation process.Show for example, using neural network model to text unit Term vector extract hidden layer feature, the vector result as text unit.The representation method of the term vector of text unit can not Excessive limitation is done, one-hot or embedding method can be used such as to indicate term vector.

Further, the text unit of answer text can freely be set, and such as use word grade, phoneme level or root grade text list Member.

For the ease of statement, defining answer text includes C text unit.

Then its term vector can be obtained for each text unit in answer text, finally according to C text unit Term vector determines the text feature of answer text.

Another optional mode, the vector of available answer text further obtain the hidden layer of neural network model To the hidden layer feature after vector conversion, as text feature.

Ibid, neural network model can use multiple structural forms, as RNN (Recurrent Neural Network, Recurrent neural network), LSTM (Long Short-Term Memory, long in short-term memory network), GRU (Gated Recurrent Unit, gating cycle unit) etc..

Converted by the vector of the hidden layer answer case text of neural network model, can answer the vector of case text into The mapping of row deep layer, obtained hidden layer feature is deeper compared to the vector level of answer text, is better able to embody answer text Text characteristics, therefore can be using hidden layer feature as text feature.

Text feature can be indicated by following matrix form:

Wherein, s_i(i=1,2 ..., C) indicates the text feature of i-th of text unit, and the text of each text unit The dimension of eigen remains unchanged, and is defined as n dimension.

In another embodiment of the application, to above-mentioned steps S110, the acoustic feature based on the voice to be evaluated With the text feature of the answer text, determine that the process of the alignment information of the voice to be evaluated and the answer text carries out It introduces.

In the present embodiment, can acoustic feature and answer text based on voice to be evaluated text feature, determine frame level Attention matrix, the frame level attention matrix include: for any one text unit in answer text, it is each in voice to be measured Alignment probability of the frame voice to the text unit.

The frame level attention matrix of above-mentioned determination can be used as the alignment information of voice to be evaluated Yu answer text.It connects down Come, above-mentioned alignment probability illustrated by formula:

e_it=a (h_t,s_i)=w^T(Ws_i+Vh_t+b)

Wherein, e_itIndicate the alignment information of the text feature of i-th of text unit and the acoustic feature of t frame voice；a_it It indicates for i-th of text unit, alignment probability of the t frame voice to i-th of text unit；s_iIndicate i-th of text list The text feature of member, is a n-dimensional vector；h_tIt indicates the acoustic feature of t frame voice, is a m dimensional vector；W, V, w, b are Four parameters, wherein W can be a k*n dimension matrix, V can be a k*m dimension matrix, w can be a k tie up to Amount, these three parameters are used for Feature Mapping, and b is a biasing, can be a k dimensional vector.

Above-mentioned frame level attention matrix can be expressed as form:

In the present embodiment, one kind is provided based on attention mechanism, frame level attention is determined by neural network model The optional embodiment of matrix, can specifically include:

The acoustic feature and the text feature are handled using the first full articulamentum of neural network model, described first Full articulamentum is configured to receive and process the acoustic feature and the text feature, to generate in frame level attention matrix Portion's state indicates.

Wherein, the first of neural network model the full articulamentum can be expressed as above-mentioned e_itAnd a_itFormula form.And W, V, W, parameter of tetra- parameters of b as the first full articulamentum.By repetitive exercise neural network model, above-mentioned four can be updated with iteration A parameter, until four parameters are fixed after model training.

The frame level attention matrix as alignment information that the present embodiment determines includes each frame in the voice to be evaluated Voice, the alignment probability of any one text unit in answer case text, namely obtained the frame level alignment letter of voice to be evaluated Breath, the frame level attention matrix is related relative to the matching degree of evaluating standard to voice to be evaluated, therefore subsequent can be based on The frame level attention matrix determines evaluation result of the voice to be evaluated relative to answer text.

Further, it is contemplated that the difference of different user word speed, when expressing same answer text, different user is generated Voice duration may be different, and then the frame number for causing voice to include is different.According to above scheme determine as alignment information Frame level attention matrix since frame number difference causes frame level attention matrix also with regard to different, and then is based on frame level attention matrix Determining evaluation result also can be different.And in actual conditions, what it is due to different user expression is same answer text, evaluation and test knot Fruit ought to be identical.Based on this problem, another scheme for determining alignment information is present embodiments provided.

In the text feature of the acoustic feature and answer text based on voice to be evaluated of above-described embodiment introduction, frame is obtained On the basis of grade attention matrix, the present embodiment further increases following processing links:

1, it is based on the frame level attention matrix and the acoustic feature, determines word grade acoustics alignment matrix, institute's predicate grade Acoustics alignment matrix includes: and the acoustic information that each text unit is aligned in the answer text, the acoustic information packet It includes and probability is aligned as weight using the text unit and every frame voice, summation is weighted to the acoustic feature of every frame voice As a result.

Specifically, the expression way for the acoustic information being aligned in word grade acoustics alignment matrix with i-th of text unit is as follows:

Wherein, a_itAnd h_tMeaning referring to introducing above.

Upper predicate grade acoustics alignment matrix can indicate are as follows:

Wherein, c_i(i=1,2 ..., C) indicates the acoustics alignment information of i-th of text unit, c_iFor m dimension.

2, it is based on institute's predicate grade acoustics alignment matrix and the text feature, determines word grade attention matrix, institute's predicate grade Attention matrix includes: the text feature for any one text unit in the answer text, every in the answer text The acoustic feature of one text unit is aligned probability to it.

The word grade attention matrix that this step determines can be used as the alignment information of voice to be evaluated Yu answer text.It connects down Come, by formula come declarer grade attention matrix:

Wherein, K_ijIndicate the acoustic feature of i-th of text unit and the text feature of j-th of text unit is aligned letter Breath；I_ijIndicate the acoustic information of i-th of text unit to the alignment probability of the text feature of j-th of text unit；For s_j's Transposition, c_iIndicate the acoustics alignment information of i-th of text unit；s_jIndicate that the text feature of j-th of text unit, U are parameter, Point multiplication operation is carried out for word grade acoustics alignment feature to be mapped to text feature identical dimensional.

Word grade attention matrix can be expressed as form:

In the present embodiment, one kind is provided based on attention mechanism, word grade attention is determined by neural network model The optional embodiment of matrix, can specifically include:

Wherein, the second of neural network model the full articulamentum can be expressed as above-mentioned K_ijAnd I_ijFormula form.And U this Parameter of one parameter as the second full articulamentum.By repetitive exercise neural network model, above-mentioned parameter U can be updated with iteration, Until model training terminates parameter U and fixes.

The word grade attention matrix as alignment information that the present embodiment determines includes each text list in answer text The acoustic feature of member, to the alignment probability of the text feature of any one text unit, namely has obtained word grade attention matrix, The word grade attention matrix is related relative to the matching degree of evaluating standard to voice to be evaluated, therefore subsequent can be based on the word Grade attention matrix determines evaluation result of the voice to be evaluated relative to answer text.

Further, since word grade attention matrix is unrelated with the frame number that voice to be evaluated includes, namely and user speed It is unrelated, the alignment relation between text feature and acoustic feature is only accounted for, the aforementioned different word speed users referred to are able to solve When expressing same answer text, the different disadvantage of evaluation result, namely use the word grade attention matrix of the present embodiment as pair Neat information, evaluation and test accuracy are higher.

In another embodiment of the application, above-mentioned steps S120 is determined described to be evaluated according to the alignment information Voice is surveyed to be introduced relative to the process of the evaluation result of the answer text.

It is understood that alignment information based in the present embodiment, can be above-mentioned frame level attention matrix, it can also To be upper predicate grade attention matrix.Then, according to alignment information, determine that the process of evaluation result may include:

1), according to the alignment information, the matching degree of the voice to be evaluated Yu the answer text is determined.

Specifically, aforementioned to have determined that alignment information, it can be frame level attention matrix or word grade pay attention to torque Battle array.Based on the alignment information, the matching degree between voice to be evaluated and answer text can be determined.

Under a kind of optional mode, the convolution unit that can use neural network model handles the alignment information, described Convolution unit is configured to receive and process the alignment information, to generate of the voice to be evaluated Yu the answer text Internal state with degree indicates.

Wherein, the alignment information of the convolution unit of neural network model is inputted, matrix size can be fixed, can be with The matrix size is determined according to the length of common answer text, such as general answer text is no more than 20 words, Then matrix size can be 20*20.Insufficient element can be filled with 0.

2), according to the matching degree, evaluation result of the voice to be evaluated relative to the answer text is determined.

Under a kind of optional mode, the full articulamentum of third that can use neural network model handles the matching degree, institute It states the full articulamentum of third and is configured to receive and process the matching degree, to generate the voice to be evaluated relative to the answer The internal state of the evaluation result of text indicates.

Wherein, the full articulamentum of third can indicate are as follows:

Y=Fx+g

Wherein, x is matching degree, and y is the evaluation result returned out, can be numeric form, and F is characterized mapping matrix, and g is Biasing.

Wherein, evaluation result can be the specific score returned out, and the size of score indicates the good of voice to be evaluated The matching degree of bad degree namely voice to be evaluated and evaluating standard.In addition, evaluation result, which can also be, indicates voice to be evaluated The probability for belonging to some classification, can preset several classification here, and different classifications indicate voice and evaluating standard to be evaluated Different matching degrees, namely indicate the fine or not degree of voice to be evaluated, example is such as divided into three classification, be respectively as follows: it is excellent, It is good, poor.

It should be noted that the neural network model referred in the various embodiments described above, can be the same neural network mould Type handles respective data using the different levels structure of a neural network model, neural network mould such as can be used Several hidden layers of type convert spectrum signature, are converted using other several hidden layers to term vector, connect entirely using first It connects layer and generates frame level attention matrix, generate word grade attention matrix using the second full articulamentum, generate institute using convolution unit The matching degree for stating voice to be evaluated Yu the answer text generates the voice to be evaluated relative to institute using the full articulamentum of third State the evaluation result etc. of answer text.Based on this, the voice training data for being labeled with artificial evaluation result can be obtained in advance, and Answer text, is trained neural network model, updates different layers in neural network model by back-propagation algorithm iteration The parameter of grade, each parameter is fixed after training.

It is illustrated so that evaluation result is evaluation and test form-separating as an example, it, can be with base when being trained to neural network model In data to mode as objective function, each data require artificial evaluation and test point to have certain difference building mode, so that Model acquires the difference between different evaluation and tests point, and the expression formula of objective function is as follows:

Wherein, y_iAnd y_i+1For in training data i-th and the model prediction point of i+1 sample, z_iAnd z_i+1For in training data I-th and i+1 sample artificial evaluation and test point.

The purpose of above-mentioned objective function is so that model prediction point and a point difference for artificial evaluation and test point minimize, and make adjacent The difference that the artificial evaluation and test of the difference that the model prediction of two samples is divided to closer to two samples divides, so that model acquires difference Difference between evaluation and test point.

Referring to figs. 2 and 3, the neural network model for illustrating two kinds of different structures carries out the process signal of speech evaluating Figure.

Word grade attention matrix is used to determine evaluation result as alignment information, and based on the alignment information in Fig. 2.

Frame level attention matrix is used to determine evaluation result as alignment information, and based on the alignment information in Fig. 3.

As shown in Fig. 2, wherein dotted box portion is neural network model internal processes, as shown in Figure 2, language to be evaluated Sound extracts acoustic feature and answer Text Feature Extraction text feature, as the input of neural network model, respectively passes through one RNN hidden layer extracts deep layer acoustic feature matrix and deep layer text feature matrix respectively, and by inputting the first full articulamentum, by First full articulamentum inputs frame level attention matrix, frame level attention matrix and the available word of deep layer acoustic feature matrix dot product Grade acoustics alignment matrix, the input of word grade acoustics alignment matrix and deep layer text feature matrix as the second full articulamentum, by the Two full articulamentum output stage attention matrixes, word grade attention Input matrix CNN convolution unit, the matching degree that obtains that treated to Amount, and it is input to the full articulamentum of third, it is returned out by the full articulamentum of third and evaluates and tests score.

The neural network model can be by back-propagation algorithm training, and iteration updates the ginseng of wherein each hierarchical structure Number.

Dotted box portion is neural network model internal processes in Fig. 3, for Fig. 2, the exemplary nerve of Fig. 3 Network model has lacked the second full articulamentum.In corresponding process flow, the frame level attention matrix of the first full articulamentum output It is based on frame level attention Output matrix matching degree vector directly as the input of CNN convolution unit, and by CNN convolution unit, after Afterflow journey is consistent.Process compared to Fig. 2 eliminates in Fig. 3 and obtains the mistake of word grade attention matrix by the second full articulamentum Journey.

Similarly, which can be by back-propagation algorithm training, and iteration updates wherein each level knot The parameter of structure.

Explanation is needed further exist for, the neural network model referred in the various embodiments described above can also be multiple independent Neural network model, and cooperate between multiple independent neural network models, to complete entire speech evaluating process.Example Such as, spectrum signature is converted, the neural network model for obtaining deep layer acoustic feature can be an independent model, such as make It uses speech recognition modeling as the independent neural network model, and spectrum signature is carried out using the hidden layer of speech recognition modeling Conversion, the hidden layer feature after being converted is as deep layer acoustic feature.

Speech evaluating device provided by the embodiments of the present application is described below, speech evaluating device described below with Above-described speech evaluating method can correspond to each other reference.

Referring to fig. 4, Fig. 4 is a kind of speech evaluating apparatus structure schematic diagram disclosed in the embodiment of the present application.As shown in figure 4, The apparatus may include:

Data capture unit 11, for obtaining voice to be evaluated, and the answer text as evaluating standard；

Alignment information determination unit 12, the text for acoustic feature and the answer text based on the voice to be evaluated Eigen determines the alignment information of the voice to be evaluated Yu the answer text；

Evaluation result determination unit 13, for determining the voice to be evaluated relative to described according to the alignment information The evaluation result of answer text.

Optionally, the device of the application can also include: acoustic feature acquiring unit, for obtaining the sound of voice to be evaluated Learn feature.Specifically, acoustic feature acquiring unit may include:

Or,

Optionally, the device of the application can also include: text feature acquiring unit, for obtaining the text of answer text Feature.Specifically, text feature acquiring unit may include:

Or,

Optionally, the alignment information determination unit may include:

Optionally, the frame level attention matrix determination unit may include:

Optionally, the alignment information determination unit can also include:

Optionally, institute's predicate grade attention matrix determination unit may include:

Optionally, the evaluation result determination unit may include:

Optionally, the matching degree determination unit may include:

Optionally, the matching degree applying unit may include:

Speech evaluating device provided by the embodiments of the present application can be applied to speech evaluating equipment, such as PC terminal, cloud platform, clothes Business device and server cluster etc..Optionally, Fig. 5 shows the hardware block diagram of speech evaluating equipment, and referring to Fig. 5, voice is commented The hardware configuration of measurement equipment may include: at least one processor 1, at least one communication interface 2,3 He of at least one processor At least one communication bus 4；

In the embodiment of the present application, processor 1, communication interface 2, memory 3, communication bus 4 quantity be at least one, And processor 1, communication interface 2, memory 3 complete mutual communication by communication bus 4；

Processor 1 may be a central processor CPU or specific integrated circuit ASIC (Application Specific Integrated Circuit), or be arranged to implement the integrated electricity of one or more of the embodiment of the present invention Road etc.；

Memory 3 may include high speed RAM memory, it is also possible to further include nonvolatile memory (non-volatile Memory) etc., a for example, at least magnetic disk storage；

Wherein, memory is stored with program, the program that processor can call memory to store, and described program is used for:

Obtain voice to be evaluated, and the answer text as evaluating standard；

Optionally, the refinement function of described program and extension function can refer to above description.

The embodiment of the present application also provides a kind of readable storage medium storing program for executing, which can be stored with and hold suitable for processor Capable program, described program are used for:

Obtain voice to be evaluated, and the answer text as evaluating standard；

Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that the process, method, article or equipment for including a series of elements not only includes that A little elements, but also including other elements that are not explicitly listed, or further include for this process, method, article or The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged Except there is also other identical elements in the process, method, article or apparatus that includes the element.

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.

The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims

1. a kind of speech evaluating method characterized by comprising

Obtain voice to be evaluated, and the answer text as evaluating standard；

The text feature of acoustic feature and the answer text based on the voice to be evaluated, determine the voice to be evaluated with The alignment information of the answer text；

2. the method according to claim 1, wherein the acquisition process of the acoustic feature of the voice to be evaluated, Include:

Or,

Obtain the spectrum signature of the voice to be evaluated；

3. the method according to claim 1, wherein the acquisition process of the text feature of the answer text, packet It includes:

The vector for obtaining the answer text, as text feature；

Or,

Obtain the vector of the answer text；

4. the method according to claim 1, wherein the acoustic feature and institute based on the voice to be evaluated The text feature for stating answer text determines the alignment information of the voice to be evaluated Yu the answer text, comprising:

The text feature of acoustic feature and the answer text based on the voice to be evaluated, determines frame level attention matrix, The frame level attention matrix include: for any one text unit in the answer text, it is every in the voice to be evaluated Alignment probability of the one frame voice to the text unit.

5. according to the method described in claim 4, it is characterized in that, the acoustic feature and institute based on the voice to be evaluated The text feature for stating answer text determines frame level attention matrix, comprising:

The acoustic feature and the answer text of the voice to be evaluated are handled using the first full articulamentum of neural network model Text feature, the first full articulamentum is configured to receive and process the acoustic feature and the text feature, with life It is indicated at the internal state of frame level attention matrix.

6. according to the method described in claim 4, it is characterized in that, the acoustic feature and institute based on the voice to be evaluated The text feature for stating answer text determines the alignment information of the voice to be evaluated Yu the answer text, further includes:

Based on the frame level attention matrix and the acoustic feature, word grade acoustics alignment matrix, institute's predicate grade acoustics pair are determined It is neat matrix includes: with the acoustic information that each text unit is aligned in the answer text, the acoustic information includes with institute It is weight that text unit, which is stated, with the probability that is aligned of every frame voice, and the result of summation is weighted to the acoustic feature of every frame voice；

Based on institute's predicate grade acoustics alignment matrix and the text feature, word grade attention matrix, institute's predicate grade attention are determined Matrix includes: the text feature for any one text unit in the answer text, each text in the answer text The acoustic information of this unit is aligned probability to it.

7. according to the method described in claim 6, it is characterized in that, described be based on institute's predicate grade acoustics alignment matrix and the text Eigen determines word grade attention matrix, comprising:

It is described using second full articulamentum processing institute's predicate grade acoustics alignment matrix of neural network model and the text feature Second full articulamentum is configured to receive and process institute's predicate grade acoustics alignment matrix and the text feature, to generate word grade note The internal state for torque battle array of anticipating indicates.

8. method according to claim 1-7, which is characterized in that it is described according to the alignment information, determine institute State evaluation result of the voice to be evaluated relative to the answer text, comprising:

9. according to the method described in claim 8, determining described to be evaluated it is characterized in that, described according to the alignment information The matching degree of voice and the answer text, comprising:

The alignment information is handled using the convolution unit of neural network model, the convolution unit is configured to receive and process The alignment information is indicated with generating the internal state of matching degree of the voice to be evaluated and the answer text.

10. according to the method described in claim 8, determining the language to be evaluated it is characterized in that, described according to the matching degree Evaluation result of the sound relative to the answer text, comprising:

The matching degree is handled using the full articulamentum of the third of neural network model, the full articulamentum of third is configured as receiving And the matching degree is handled, to generate internal state table of the voice to be evaluated relative to the evaluation result of the answer text Show.

11. a kind of speech evaluating device characterized by comprising

Alignment information determination unit, the text for acoustic feature and the answer text based on the voice to be evaluated are special Sign, determines the alignment information of the voice to be evaluated Yu the answer text；

Evaluation result determination unit, for determining the voice to be evaluated relative to the answer text according to the alignment information This evaluation result.

12. device according to claim 11, which is characterized in that the alignment information determination unit includes:

Frame level attention matrix determination unit, the text for acoustic feature and the answer text based on the voice to be evaluated Eigen determines that frame level attention matrix, the frame level attention matrix include: for any one text in the answer text This unit, alignment probability of each frame voice to the text unit in the voice to be evaluated.

13. device according to claim 12, which is characterized in that the alignment information determination unit further include:

Word grade acoustics alignment matrix determination unit determines word for being based on the frame level attention matrix and the acoustic feature Grade acoustics alignment matrix, institute's predicate grade acoustics alignment matrix includes: to be aligned with each text unit in the answer text Acoustic information, the acoustic information include being aligned probability as weight, to every frame voice using the text unit and every frame voice Acoustic feature be weighted the result of summation；

Word grade attention matrix determination unit determines word for being based on institute's predicate grade acoustics alignment matrix and the text feature Grade attention matrix, institute's predicate grade attention matrix includes: the text for any one text unit in the answer text Feature, the acoustic information of each text unit is aligned probability to it in the answer text.

14. the described in any item devices of 1-13 according to claim 1, which is characterized in that the evaluation result determination unit includes:

Matching degree determination unit, for determining of the voice to be evaluated Yu the answer text according to the alignment information With degree；

Matching degree applying unit, for determining the voice to be evaluated relative to the answer text according to the matching degree Evaluation result.

15. a kind of speech evaluating equipment, which is characterized in that including memory and processor；

The memory, for storing program；

The processor realizes such as speech evaluating method of any of claims 1-10 for executing described program Each step.

16. a kind of readable storage medium storing program for executing, is stored thereon with computer program, which is characterized in that the computer program is processed When device executes, each step such as speech evaluating method of any of claims 1-10 is realized.