CN109215632A - A kind of speech evaluating method, device, equipment and readable storage medium storing program for executing - Google Patents
A kind of speech evaluating method, device, equipment and readable storage medium storing program for executing Download PDFInfo
- Publication number
- CN109215632A CN109215632A CN201811162964.0A CN201811162964A CN109215632A CN 109215632 A CN109215632 A CN 109215632A CN 201811162964 A CN201811162964 A CN 201811162964A CN 109215632 A CN109215632 A CN 109215632A
- Authority
- CN
- China
- Prior art keywords
- text
- voice
- evaluated
- feature
- answer text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 70
- 238000011156 evaluation Methods 0.000 claims abstract description 70
- 239000011159 matrix material Substances 0.000 claims description 131
- 238000003062 neural network model Methods 0.000 claims description 59
- 238000001228 spectrum Methods 0.000 claims description 25
- 230000015654 memory Effects 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 12
- 238000006243 chemical reaction Methods 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 4
- 238000013481 data capture Methods 0.000 claims description 3
- 238000012360 testing method Methods 0.000 description 28
- 238000012549 training Methods 0.000 description 9
- 238000004891 communication Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 238000013528 artificial neural network Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 6
- 230000000306 recurrent effect Effects 0.000 description 6
- 239000000284 extract Substances 0.000 description 5
- 238000013507 mapping Methods 0.000 description 4
- 230000006403 short-term memory Effects 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 235000013399 edible fruits Nutrition 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000003252 repetitive effect Effects 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- NGVDGCNFYWLIFO-UHFFFAOYSA-N pyridoxal 5'-phosphate Chemical compound CC1=NC=C(COP(O)(O)=O)C(C=O)=C1O NGVDGCNFYWLIFO-UHFFFAOYSA-N 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/69—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
This application discloses a kind of speech evaluating method, device, equipment and readable storage medium storing program for executing, the application obtains voice to be evaluated and the answer text as evaluating standard, the text feature of acoustic feature and answer text based on voice to be evaluated, it can determine the alignment information of voice to be evaluated Yu answer text, it can be understood that, alignment information shows the alignment relation of voice to be evaluated Yu answer text, and then evaluation result of the determination of the automation voice to be evaluated relative to answer text can be realized according to alignment information.Due to not needing manually to be evaluated and tested, interference of the subjective impact to evaluation result of people is not only avoided, but also reduce the consumption of cost of labor.
Description
Technical field
This application involves voice processing technology fields, more specifically to a kind of speech evaluating method, device, equipment
And readable storage medium storing program for executing.
Background technique
With deepening continuously for educational reform, speaking test is carried out throughout the country.It is spoken for written exam
Examination can evaluate and test the spoken language proficiency of examinee.
Existing speaking test is mostly the answer by professional teacher, according to the corresponding correct option information of topic, to examinee
It is evaluated and tested.This mode manually evaluated and tested extremely is easy the subjective impact by people, causes evaluation result by human interference, and
And it can also consume a large amount of cost of labor.
Summary of the invention
In view of this, this application provides a kind of speech evaluating method, device, equipment and readable storage medium storing program for executing, for solving
Disadvantage present in the mode of certainly existing artificial carry out speaking test evaluation and test.
To achieve the goals above, it is proposed that scheme it is as follows:
A kind of speech evaluating method, comprising:
Obtain voice to be evaluated, and the answer text as evaluating standard;
The text feature of acoustic feature and the answer text based on the voice to be evaluated, determines the language to be evaluated
The alignment information of sound and the answer text;
According to the alignment information, evaluation result of the voice to be evaluated relative to the answer text is determined.
Preferably, the acquisition process of the acoustic feature of the voice to be evaluated, comprising:
The spectrum signature for obtaining the voice to be evaluated, as acoustic feature;
Or,
Obtain the spectrum signature of the voice to be evaluated;
The hidden layer of neural network model is obtained to the hidden layer feature after spectrum signature conversion, as acoustic feature.
Preferably, the acquisition process of the text feature of the answer text, comprising:
The vector for obtaining the answer text, as text feature;
Or,
Obtain the vector of the answer text;
The hidden layer of neural network model is obtained to the hidden layer feature after vector conversion, as text feature.
Preferably, the text feature of the acoustic feature based on the voice to be evaluated and the answer text determines
The alignment information of the voice to be evaluated and the answer text, comprising:
The text feature of acoustic feature and the answer text based on the voice to be evaluated, determines that frame level pays attention to torque
Battle array, the frame level attention matrix include: for any one text unit in the answer text, in the voice to be evaluated
Alignment probability of each frame voice to the text unit.
Preferably, the text feature of the acoustic feature based on the voice to be evaluated and the answer text determines
Frame level attention matrix, comprising:
Using the first full articulamentum of neural network model handle the voice to be evaluated acoustic feature and the answer
The text feature of text, the first full articulamentum are configured to receive and process the acoustic feature and the text feature,
It is indicated with generating the internal state of frame level attention matrix.
Preferably, the text feature of the acoustic feature based on the voice to be evaluated and the answer text determines
The alignment information of the voice to be evaluated and the answer text, further includes:
Based on the frame level attention matrix and the acoustic feature, word grade acoustics alignment matrix, institute's predicate grade sound are determined
Learning alignment matrix includes: to include with the acoustic information that each text unit is aligned in the answer text, the acoustic information
Probability is aligned as weight using the text unit and every frame voice, and the knot of summation is weighted to the acoustic feature of every frame voice
Fruit;
Based on institute's predicate grade acoustics alignment matrix and the text feature, word grade attention matrix, institute's predicate grade note are determined
Meaning torque battle array includes: text feature for any one text unit in the answer text, each in the answer text
The acoustic information of a text unit is aligned probability to it.
Preferably, described to be based on institute's predicate grade acoustics alignment matrix and the text feature, determine word grade attention matrix,
Include:
Institute's predicate grade acoustics alignment matrix and the text feature are handled using the second full articulamentum of neural network model,
The second full articulamentum is configured to receive and process institute's predicate grade acoustics alignment matrix and the text feature, to generate word
The internal state of grade attention matrix indicates.
Preferably, described according to the alignment information, determine voice to be evaluated the commenting relative to the answer text
Survey result, comprising:
According to the alignment information, the matching degree of the voice to be evaluated Yu the answer text is determined;
According to the matching degree, evaluation result of the voice to be evaluated relative to the answer text is determined.
It is preferably, described that the matching degree of the voice to be evaluated Yu the answer text is determined according to the alignment information,
Include:
The alignment information is handled using the convolution unit of neural network model, the convolution unit is configured as receiving simultaneously
The alignment information is handled, is indicated with generating the internal state of matching degree of the voice to be evaluated and the answer text.
Preferably, described according to the matching degree, determine evaluation and test of the voice to be evaluated relative to the answer text
As a result, comprising:
The matching degree is handled using the full articulamentum of the third of neural network model, the full articulamentum of third is configured as
The matching degree is received and processed, to generate inside shape of the voice to be evaluated relative to the evaluation result of the answer text
State indicates.
A kind of speech evaluating device, comprising:
Data capture unit, for obtaining voice to be evaluated, and the answer text as evaluating standard;
Alignment information determination unit, the text for acoustic feature and the answer text based on the voice to be evaluated
Feature determines the alignment information of the voice to be evaluated Yu the answer text;
Evaluation result determination unit, for determining that the voice to be evaluated is answered relative to described according to the alignment information
The evaluation result of case text.
It preferably, further include acoustic feature acquiring unit, comprising:
First acoustic feature obtains subelement, for obtaining the spectrum signature of the voice to be evaluated, as acoustic feature;
Or,
Second acoustic feature obtains subelement, for obtaining the spectrum signature of the voice to be evaluated;
Third acoustic feature obtains subelement, after the hidden layer for obtaining neural network model converts the spectrum signature
Hidden layer feature, as acoustic feature.
Preferably, further includes: text feature acquiring unit, comprising:
First text feature obtains subelement, for obtaining the vector of the answer text, as text feature;
Or,
Second text feature obtains subelement, for obtaining the vector of the answer text;
Third text feature obtains subelement, for obtaining the hidden layer of neural network model to hidden after vector conversion
Layer feature, as text feature.
Preferably, the alignment information determination unit includes:
Frame level attention matrix determination unit, for acoustic feature and the answer text based on the voice to be evaluated
Text feature, determine that frame level attention matrix, the frame level attention matrix include: for any one in the answer text
A text unit, alignment probability of each frame voice to the text unit in the voice to be evaluated.
Preferably, the frame level attention matrix determination unit includes:
It is special to handle the acoustics for the first full articulamentum using neural network model for first full articulamentum processing unit
It seeks peace the text feature, the first full articulamentum is configured to receive and process the acoustic feature and the text is special
Sign, is indicated with generating the internal state of frame level attention matrix.
Preferably, the alignment information determination unit further include:
Word grade acoustics alignment matrix determination unit, for being based on the frame level attention matrix and the acoustic feature, really
Determine word grade acoustics alignment matrix, institute's predicate grade acoustics alignment matrix includes: and each text unit pair in the answer text
Neat acoustic information, the acoustic information include being aligned probability as weight, to every frame using the text unit and every frame voice
The acoustic feature of voice is weighted the result of summation;
Word grade attention matrix determination unit, for being based on institute's predicate grade acoustics alignment matrix and the text feature, really
Determine word grade attention matrix, institute's predicate grade attention matrix includes: for any one text unit in the answer text
Text feature, the acoustic information of each text unit is aligned probability to it in the answer text.
Preferably, institute's predicate grade attention matrix determination unit includes:
Second full articulamentum processing unit, for handling institute's predicate grade sound using the second full articulamentum of neural network model
Alignment matrix and the text feature are learned, the second full articulamentum is configured to receive and process institute's predicate grade acoustics alignment square
Battle array and the text feature, are indicated with generating the internal state of word grade attention matrix.
Preferably, the evaluation result determination unit includes:
Matching degree determination unit, for determining the voice to be evaluated and the answer text according to the alignment information
Matching degree;
Matching degree applying unit, for determining the voice to be evaluated relative to the answer text according to the matching degree
This evaluation result.
Preferably, the matching degree determination unit includes:
Convolution unit processing unit handles the alignment information for the convolution unit using neural network model, described
Convolution unit is configured to receive and process the alignment information, to generate of the voice to be evaluated Yu the answer text
Internal state with degree indicates.
Preferably, the matching degree applying unit includes:
The full articulamentum processing unit of third handles the matching for the full articulamentum of third using neural network model
Degree, the full articulamentum of third is configured to receive and process the matching degree, to generate the voice to be evaluated relative to institute
The internal state for stating the evaluation result of answer text indicates.
A kind of speech evaluating equipment, including memory and processor;
The memory, for storing program;
The processor realizes each step of speech evaluating method as described above for executing described program.
A kind of readable storage medium storing program for executing is stored thereon with computer program, real when the computer program is executed by processor
Now each step of speech evaluating method as described above.
It can be seen from the above technical scheme that speech evaluating method provided by the embodiments of the present application, obtains language to be evaluated
Sound and answer text as evaluating standard, the text feature of acoustic feature and answer text based on voice to be evaluated can be with
Determine the alignment information of voice to be evaluated Yu answer text, it is to be understood that alignment information shows voice to be evaluated and answers
The alignment relation of case text, and then the determination of automation voice to be evaluated can be realized relative to answer text according to alignment information
This evaluation result.Due to not needing manually to be evaluated and tested, interference of the subjective impact to evaluation result of people is not only avoided, but also subtract
The consumption of cost of labor is lacked.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
The embodiment of application for those of ordinary skill in the art without creative efforts, can also basis
The attached drawing of offer obtains other attached drawings.
Fig. 1 is a kind of speech evaluating method flow chart disclosed in the embodiment of the present application;
Fig. 2 illustrates a kind of flow diagram of neural network model progress speech evaluating;
Fig. 3 illustrates the flow diagram that another neural network model carries out speech evaluating;
Fig. 4 is a kind of speech evaluating apparatus structure schematic diagram disclosed in the embodiment of the present application;
Fig. 5 is a kind of hardware block diagram of speech evaluating equipment disclosed in the embodiment of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on
Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall in the protection scope of this application.
In order to solve existing spoken assessment by manually, causes evaluation result by human interference and waste of manpower cost is asked
Topic, present inventor first proposed a solution, it can be carried out using speech recognition modeling to voice to be evaluated
Identification obtains identification text, and extracts keyword from answer text, and then calculates identification text to the hit rate of keyword,
The evaluation result that voice to be evaluated is determined according to the hit rate can determine that evaluation and test score is higher if hit rate is higher.
But further study show that, the above-mentioned solution that inventor proposes is due to needing first to know voice to be evaluated
Not Wei text, which can use speech recognition modeling.If using general speech recognition modeling to it is different examination scenes to
Evaluation and test voice is identified can there is a problem of that recognition accuracy is low, and then causes evaluation result inaccurate.If being examined for difference
Speech recognition modeling is respectively trained in examination hall scape, then needs manually to be trained data marking for every equal schedule ahead of examination,
Expend a large amount of costs of labor.
On this basis, inventor further studies, and finally realizes from actively finding voice to be evaluated and answer
The angle of the alignment information of text is set out, and realizes the speech evaluating of automation.The speech evaluating method of the application can be based on tool
The electronic equipment of standby data-handling capacity is realized, such as intelligent terminal, server, cloud platform.
The speech evaluating scheme of the application can be adapted for speaking test evaluation and test scene and other be related to pronunciation level
The scene evaluated and tested.
Next, being illustrated in conjunction with speech evaluating method of the attached drawing 1 to the application, this method may include:
Step S100, voice to be evaluated, and the answer text as evaluating standard are obtained.
Specifically, by taking speaking test scene as an example, voice to be evaluated can be spoken answer that examinee provides and record.It is corresponding
, the answer text as evaluating standard can be preset in the present embodiment.By taking material reads speaking test topic as an example, make
It can be the text information extracted from reading material for the answer text of evaluating standard.In addition to this, it is asked for other types
The speaking test of topic, the answer text as evaluating standard can be the corresponding answer content of problem.
In this step, the acquisition modes of voice to be evaluated can be to be received by sound pick-up outfit, and sound pick-up outfit may include
Microphone, such as head microphone.
Step S110, the text feature of acoustic feature and the answer text based on the voice to be evaluated, determines institute
State the alignment information of voice to be evaluated Yu the answer text.
Wherein, the acoustic feature of voice to be evaluated reflects the acoustic information of voice to be evaluated.The text of answer text is special
Sign reflects the text information of answer text.The type of acoustic feature can there are many, similarly, the type of text feature can also be with
There are many.
Acoustic feature and text feature are based in the present embodiment, that actively finds voice to be evaluated and answer text is aligned letter
Breath, the alignment information reflect the alignment relation of voice to be evaluated Yu answer text.It is understood that for meeting evaluation and test mark
Quasi- voice to be evaluated, should be also very high with the integrity degree that is aligned of answer text, on the contrary, for do not meet evaluating standard to
Evaluate and test voice, with answer text to be aligned integrity degree very low.
Step S120, according to the alignment information, evaluation and test of the voice to be evaluated relative to the answer text is determined
As a result.
According to above-mentioned discussion it is found that alignment information reflects the alignment relation of voice to be evaluated Yu answer text, with to
Whether evaluation and test voice meets evaluating standard and related to evaluating standard matching degree, therefore can be according to alignment letter in this step
Breath, determines evaluation result of the voice to be evaluated relative to answer text.
Speech evaluating method provided by the embodiments of the present application can realize that the determination of automation is to be evaluated according to alignment information
Survey evaluation result of the voice relative to answer text.Due to not needing manually to be evaluated and tested, the subjective impact pair of people has both been avoided
The interference of evaluation result, and reduce the consumption of cost of labor.
Further, since this case is from the angle for the alignment information for actively finding voice to be evaluated and answer text,
It determines evaluation result, does not need to carry out speech recognition to voice to be evaluated using speech recognition modeling, and to identification text
With the calculating of answer text key word hit rate, asking for evaluation result inaccuracy caused by speech recognition result inaccuracy is avoided
Topic, speech evaluating result is more accurate, and scheme can be adapted for various speech evaluating scenes, and robustness is stronger, does not need
Ancillary cost manpower is given a mark under different scenes to determine training data, and human cost is saved.
In another embodiment of the application, the acoustics of the voice to be evaluated mentioned in above-mentioned steps S110 is described
The acquisition process of the text feature of feature and answer text.
The acquisition process of the acoustic feature of voice to be evaluated is introduced first:
A kind of optional mode, can directly acquire the spectrum signature of voice to be evaluated, and using the spectrum signature as to
Evaluate and test the acoustic feature of voice.
Wherein, spectrum signature may include mel-frequency cepstrum coefficient (MelFrequency Cepstrum
Coefficient, MFCC) feature or perception linear prediction (PerceptualLinear Predictive, PLP) feature etc..
For the ease of statement, defining voice to be evaluated includes T frame.
Then when obtaining the spectrum signature of voice to be evaluated, sub-frame processing first can be carried out to voice to be evaluated, and to point
Voice to be evaluated after frame carries out preemphasis, and then extracts the spectrum signature of every frame voice.
Another optional mode, the spectrum signature of available voice to be evaluated further obtain neural network model
Hidden layer to the spectrum signature conversion after hidden layer feature, as acoustic feature.
Here, neural network model can use multiple structural forms, as RNN (Recurrent Neural Network,
Recurrent neural network), LSTM (Long Short-TermMemory, long in short-term memory network), GRU (Gated Recurrent
Unit, gating cycle unit) etc..
Spectrum signature is converted by the hidden layer of neural network model, deep layer mapping can be carried out to spectrum signature,
Obtained hidden layer feature is deeper compared to spectrum signature level, is better able to embody the acoustic characteristic of voice to be evaluated, therefore can
Using by hidden layer feature as acoustic feature.
Acoustic feature can be indicated by following matrix form:
Wherein, ht(t=1,2 ..., T) indicates the acoustic feature of t frame voice, and the dimension of the acoustic feature of each frame
It remains unchanged, is defined as m dimension.
Further, the acquisition process of the text feature of voice to be evaluated is introduced:
A kind of optional mode can directly acquire the vector of answer text, and using the vector as the text of answer text
Eigen.
Wherein, the vector of answer text can be the combination of the term vector of the text unit of composition answer text, alternatively, literary
Vector result of the term vector of this unit after certain calculation process.Show for example, using neural network model to text unit
Term vector extract hidden layer feature, the vector result as text unit.The representation method of the term vector of text unit can not
Excessive limitation is done, one-hot or embedding method can be used such as to indicate term vector.
Further, the text unit of answer text can freely be set, and such as use word grade, phoneme level or root grade text list
Member.
For the ease of statement, defining answer text includes C text unit.
Then its term vector can be obtained for each text unit in answer text, finally according to C text unit
Term vector determines the text feature of answer text.
Another optional mode, the vector of available answer text further obtain the hidden layer of neural network model
To the hidden layer feature after vector conversion, as text feature.
Ibid, neural network model can use multiple structural forms, as RNN (Recurrent Neural Network,
Recurrent neural network), LSTM (Long Short-Term Memory, long in short-term memory network), GRU (Gated Recurrent
Unit, gating cycle unit) etc..
Converted by the vector of the hidden layer answer case text of neural network model, can answer the vector of case text into
The mapping of row deep layer, obtained hidden layer feature is deeper compared to the vector level of answer text, is better able to embody answer text
Text characteristics, therefore can be using hidden layer feature as text feature.
Text feature can be indicated by following matrix form:
Wherein, si(i=1,2 ..., C) indicates the text feature of i-th of text unit, and the text of each text unit
The dimension of eigen remains unchanged, and is defined as n dimension.
In another embodiment of the application, to above-mentioned steps S110, the acoustic feature based on the voice to be evaluated
With the text feature of the answer text, determine that the process of the alignment information of the voice to be evaluated and the answer text carries out
It introduces.
In the present embodiment, can acoustic feature and answer text based on voice to be evaluated text feature, determine frame level
Attention matrix, the frame level attention matrix include: for any one text unit in answer text, it is each in voice to be measured
Alignment probability of the frame voice to the text unit.
The frame level attention matrix of above-mentioned determination can be used as the alignment information of voice to be evaluated Yu answer text.It connects down
Come, above-mentioned alignment probability illustrated by formula:
eit=a (ht,si)=wT(Wsi+Vht+b)
Wherein, eitIndicate the alignment information of the text feature of i-th of text unit and the acoustic feature of t frame voice;ait
It indicates for i-th of text unit, alignment probability of the t frame voice to i-th of text unit;siIndicate i-th of text list
The text feature of member, is a n-dimensional vector;htIt indicates the acoustic feature of t frame voice, is a m dimensional vector;W, V, w, b are
Four parameters, wherein W can be a k*n dimension matrix, V can be a k*m dimension matrix, w can be a k tie up to
Amount, these three parameters are used for Feature Mapping, and b is a biasing, can be a k dimensional vector.
Above-mentioned frame level attention matrix can be expressed as form:
In the present embodiment, one kind is provided based on attention mechanism, frame level attention is determined by neural network model
The optional embodiment of matrix, can specifically include:
The acoustic feature and the text feature are handled using the first full articulamentum of neural network model, described first
Full articulamentum is configured to receive and process the acoustic feature and the text feature, to generate in frame level attention matrix
Portion's state indicates.
Wherein, the first of neural network model the full articulamentum can be expressed as above-mentioned eitAnd aitFormula form.And W, V,
W, parameter of tetra- parameters of b as the first full articulamentum.By repetitive exercise neural network model, above-mentioned four can be updated with iteration
A parameter, until four parameters are fixed after model training.
The frame level attention matrix as alignment information that the present embodiment determines includes each frame in the voice to be evaluated
Voice, the alignment probability of any one text unit in answer case text, namely obtained the frame level alignment letter of voice to be evaluated
Breath, the frame level attention matrix is related relative to the matching degree of evaluating standard to voice to be evaluated, therefore subsequent can be based on
The frame level attention matrix determines evaluation result of the voice to be evaluated relative to answer text.
Further, it is contemplated that the difference of different user word speed, when expressing same answer text, different user is generated
Voice duration may be different, and then the frame number for causing voice to include is different.According to above scheme determine as alignment information
Frame level attention matrix since frame number difference causes frame level attention matrix also with regard to different, and then is based on frame level attention matrix
Determining evaluation result also can be different.And in actual conditions, what it is due to different user expression is same answer text, evaluation and test knot
Fruit ought to be identical.Based on this problem, another scheme for determining alignment information is present embodiments provided.
In the text feature of the acoustic feature and answer text based on voice to be evaluated of above-described embodiment introduction, frame is obtained
On the basis of grade attention matrix, the present embodiment further increases following processing links:
1, it is based on the frame level attention matrix and the acoustic feature, determines word grade acoustics alignment matrix, institute's predicate grade
Acoustics alignment matrix includes: and the acoustic information that each text unit is aligned in the answer text, the acoustic information packet
It includes and probability is aligned as weight using the text unit and every frame voice, summation is weighted to the acoustic feature of every frame voice
As a result.
Specifically, the expression way for the acoustic information being aligned in word grade acoustics alignment matrix with i-th of text unit is as follows:
Wherein, aitAnd htMeaning referring to introducing above.
Upper predicate grade acoustics alignment matrix can indicate are as follows:
Wherein, ci(i=1,2 ..., C) indicates the acoustics alignment information of i-th of text unit, ciFor m dimension.
2, it is based on institute's predicate grade acoustics alignment matrix and the text feature, determines word grade attention matrix, institute's predicate grade
Attention matrix includes: the text feature for any one text unit in the answer text, every in the answer text
The acoustic feature of one text unit is aligned probability to it.
The word grade attention matrix that this step determines can be used as the alignment information of voice to be evaluated Yu answer text.It connects down
Come, by formula come declarer grade attention matrix:
Wherein, KijIndicate the acoustic feature of i-th of text unit and the text feature of j-th of text unit is aligned letter
Breath;IijIndicate the acoustic information of i-th of text unit to the alignment probability of the text feature of j-th of text unit;For sj's
Transposition, ciIndicate the acoustics alignment information of i-th of text unit;sjIndicate that the text feature of j-th of text unit, U are parameter,
Point multiplication operation is carried out for word grade acoustics alignment feature to be mapped to text feature identical dimensional.
Word grade attention matrix can be expressed as form:
In the present embodiment, one kind is provided based on attention mechanism, word grade attention is determined by neural network model
The optional embodiment of matrix, can specifically include:
Institute's predicate grade acoustics alignment matrix and the text feature are handled using the second full articulamentum of neural network model,
The second full articulamentum is configured to receive and process institute's predicate grade acoustics alignment matrix and the text feature, to generate word
The internal state of grade attention matrix indicates.
Wherein, the second of neural network model the full articulamentum can be expressed as above-mentioned KijAnd IijFormula form.And U this
Parameter of one parameter as the second full articulamentum.By repetitive exercise neural network model, above-mentioned parameter U can be updated with iteration,
Until model training terminates parameter U and fixes.
The word grade attention matrix as alignment information that the present embodiment determines includes each text list in answer text
The acoustic feature of member, to the alignment probability of the text feature of any one text unit, namely has obtained word grade attention matrix,
The word grade attention matrix is related relative to the matching degree of evaluating standard to voice to be evaluated, therefore subsequent can be based on the word
Grade attention matrix determines evaluation result of the voice to be evaluated relative to answer text.
Further, since word grade attention matrix is unrelated with the frame number that voice to be evaluated includes, namely and user speed
It is unrelated, the alignment relation between text feature and acoustic feature is only accounted for, the aforementioned different word speed users referred to are able to solve
When expressing same answer text, the different disadvantage of evaluation result, namely use the word grade attention matrix of the present embodiment as pair
Neat information, evaluation and test accuracy are higher.
In another embodiment of the application, above-mentioned steps S120 is determined described to be evaluated according to the alignment information
Voice is surveyed to be introduced relative to the process of the evaluation result of the answer text.
It is understood that alignment information based in the present embodiment, can be above-mentioned frame level attention matrix, it can also
To be upper predicate grade attention matrix.Then, according to alignment information, determine that the process of evaluation result may include:
1), according to the alignment information, the matching degree of the voice to be evaluated Yu the answer text is determined.
Specifically, aforementioned to have determined that alignment information, it can be frame level attention matrix or word grade pay attention to torque
Battle array.Based on the alignment information, the matching degree between voice to be evaluated and answer text can be determined.
Under a kind of optional mode, the convolution unit that can use neural network model handles the alignment information, described
Convolution unit is configured to receive and process the alignment information, to generate of the voice to be evaluated Yu the answer text
Internal state with degree indicates.
Wherein, the alignment information of the convolution unit of neural network model is inputted, matrix size can be fixed, can be with
The matrix size is determined according to the length of common answer text, such as general answer text is no more than 20 words,
Then matrix size can be 20*20.Insufficient element can be filled with 0.
2), according to the matching degree, evaluation result of the voice to be evaluated relative to the answer text is determined.
Under a kind of optional mode, the full articulamentum of third that can use neural network model handles the matching degree, institute
It states the full articulamentum of third and is configured to receive and process the matching degree, to generate the voice to be evaluated relative to the answer
The internal state of the evaluation result of text indicates.
Wherein, the full articulamentum of third can indicate are as follows:
Y=Fx+g
Wherein, x is matching degree, and y is the evaluation result returned out, can be numeric form, and F is characterized mapping matrix, and g is
Biasing.
Wherein, evaluation result can be the specific score returned out, and the size of score indicates the good of voice to be evaluated
The matching degree of bad degree namely voice to be evaluated and evaluating standard.In addition, evaluation result, which can also be, indicates voice to be evaluated
The probability for belonging to some classification, can preset several classification here, and different classifications indicate voice and evaluating standard to be evaluated
Different matching degrees, namely indicate the fine or not degree of voice to be evaluated, example is such as divided into three classification, be respectively as follows: it is excellent,
It is good, poor.
It should be noted that the neural network model referred in the various embodiments described above, can be the same neural network mould
Type handles respective data using the different levels structure of a neural network model, neural network mould such as can be used
Several hidden layers of type convert spectrum signature, are converted using other several hidden layers to term vector, connect entirely using first
It connects layer and generates frame level attention matrix, generate word grade attention matrix using the second full articulamentum, generate institute using convolution unit
The matching degree for stating voice to be evaluated Yu the answer text generates the voice to be evaluated relative to institute using the full articulamentum of third
State the evaluation result etc. of answer text.Based on this, the voice training data for being labeled with artificial evaluation result can be obtained in advance, and
Answer text, is trained neural network model, updates different layers in neural network model by back-propagation algorithm iteration
The parameter of grade, each parameter is fixed after training.
It is illustrated so that evaluation result is evaluation and test form-separating as an example, it, can be with base when being trained to neural network model
In data to mode as objective function, each data require artificial evaluation and test point to have certain difference building mode, so that
Model acquires the difference between different evaluation and tests point, and the expression formula of objective function is as follows:
Wherein, yiAnd yi+1For in training data i-th and the model prediction point of i+1 sample, ziAnd zi+1For in training data
I-th and i+1 sample artificial evaluation and test point.
The purpose of above-mentioned objective function is so that model prediction point and a point difference for artificial evaluation and test point minimize, and make adjacent
The difference that the artificial evaluation and test of the difference that the model prediction of two samples is divided to closer to two samples divides, so that model acquires difference
Difference between evaluation and test point.
Referring to figs. 2 and 3, the neural network model for illustrating two kinds of different structures carries out the process signal of speech evaluating
Figure.
Word grade attention matrix is used to determine evaluation result as alignment information, and based on the alignment information in Fig. 2.
Frame level attention matrix is used to determine evaluation result as alignment information, and based on the alignment information in Fig. 3.
As shown in Fig. 2, wherein dotted box portion is neural network model internal processes, as shown in Figure 2, language to be evaluated
Sound extracts acoustic feature and answer Text Feature Extraction text feature, as the input of neural network model, respectively passes through one
RNN hidden layer extracts deep layer acoustic feature matrix and deep layer text feature matrix respectively, and by inputting the first full articulamentum, by
First full articulamentum inputs frame level attention matrix, frame level attention matrix and the available word of deep layer acoustic feature matrix dot product
Grade acoustics alignment matrix, the input of word grade acoustics alignment matrix and deep layer text feature matrix as the second full articulamentum, by the
Two full articulamentum output stage attention matrixes, word grade attention Input matrix CNN convolution unit, the matching degree that obtains that treated to
Amount, and it is input to the full articulamentum of third, it is returned out by the full articulamentum of third and evaluates and tests score.
The neural network model can be by back-propagation algorithm training, and iteration updates the ginseng of wherein each hierarchical structure
Number.
Dotted box portion is neural network model internal processes in Fig. 3, for Fig. 2, the exemplary nerve of Fig. 3
Network model has lacked the second full articulamentum.In corresponding process flow, the frame level attention matrix of the first full articulamentum output
It is based on frame level attention Output matrix matching degree vector directly as the input of CNN convolution unit, and by CNN convolution unit, after
Afterflow journey is consistent.Process compared to Fig. 2 eliminates in Fig. 3 and obtains the mistake of word grade attention matrix by the second full articulamentum
Journey.
Similarly, which can be by back-propagation algorithm training, and iteration updates wherein each level knot
The parameter of structure.
Explanation is needed further exist for, the neural network model referred in the various embodiments described above can also be multiple independent
Neural network model, and cooperate between multiple independent neural network models, to complete entire speech evaluating process.Example
Such as, spectrum signature is converted, the neural network model for obtaining deep layer acoustic feature can be an independent model, such as make
It uses speech recognition modeling as the independent neural network model, and spectrum signature is carried out using the hidden layer of speech recognition modeling
Conversion, the hidden layer feature after being converted is as deep layer acoustic feature.
Speech evaluating device provided by the embodiments of the present application is described below, speech evaluating device described below with
Above-described speech evaluating method can correspond to each other reference.
Referring to fig. 4, Fig. 4 is a kind of speech evaluating apparatus structure schematic diagram disclosed in the embodiment of the present application.As shown in figure 4,
The apparatus may include:
Data capture unit 11, for obtaining voice to be evaluated, and the answer text as evaluating standard;
Alignment information determination unit 12, the text for acoustic feature and the answer text based on the voice to be evaluated
Eigen determines the alignment information of the voice to be evaluated Yu the answer text;
Evaluation result determination unit 13, for determining the voice to be evaluated relative to described according to the alignment information
The evaluation result of answer text.
Optionally, the device of the application can also include: acoustic feature acquiring unit, for obtaining the sound of voice to be evaluated
Learn feature.Specifically, acoustic feature acquiring unit may include:
First acoustic feature obtains subelement, for obtaining the spectrum signature of the voice to be evaluated, as acoustic feature;
Or,
Second acoustic feature obtains subelement, for obtaining the spectrum signature of the voice to be evaluated;
Third acoustic feature obtains subelement, after the hidden layer for obtaining neural network model converts the spectrum signature
Hidden layer feature, as acoustic feature.
Optionally, the device of the application can also include: text feature acquiring unit, for obtaining the text of answer text
Feature.Specifically, text feature acquiring unit may include:
First text feature obtains subelement, for obtaining the vector of the answer text, as text feature;
Or,
Second text feature obtains subelement, for obtaining the vector of the answer text;
Third text feature obtains subelement, for obtaining the hidden layer of neural network model to hidden after vector conversion
Layer feature, as text feature.
Optionally, the alignment information determination unit may include:
Frame level attention matrix determination unit, for acoustic feature and the answer text based on the voice to be evaluated
Text feature, determine that frame level attention matrix, the frame level attention matrix include: for any one in the answer text
A text unit, alignment probability of each frame voice to the text unit in the voice to be evaluated.
Optionally, the frame level attention matrix determination unit may include:
It is special to handle the acoustics for the first full articulamentum using neural network model for first full articulamentum processing unit
It seeks peace the text feature, the first full articulamentum is configured to receive and process the acoustic feature and the text is special
Sign, is indicated with generating the internal state of frame level attention matrix.
Optionally, the alignment information determination unit can also include:
Word grade acoustics alignment matrix determination unit, for being based on the frame level attention matrix and the acoustic feature, really
Determine word grade acoustics alignment matrix, institute's predicate grade acoustics alignment matrix includes: and each text unit pair in the answer text
Neat acoustic information, the acoustic information include being aligned probability as weight, to every frame using the text unit and every frame voice
The acoustic feature of voice is weighted the result of summation;
Word grade attention matrix determination unit, for being based on institute's predicate grade acoustics alignment matrix and the text feature, really
Determine word grade attention matrix, institute's predicate grade attention matrix includes: for any one text unit in the answer text
Text feature, the acoustic information of each text unit is aligned probability to it in the answer text.
Optionally, institute's predicate grade attention matrix determination unit may include:
Second full articulamentum processing unit, for handling institute's predicate grade sound using the second full articulamentum of neural network model
Alignment matrix and the text feature are learned, the second full articulamentum is configured to receive and process institute's predicate grade acoustics alignment square
Battle array and the text feature, are indicated with generating the internal state of word grade attention matrix.
Optionally, the evaluation result determination unit may include:
Matching degree determination unit, for determining the voice to be evaluated and the answer text according to the alignment information
Matching degree;
Matching degree applying unit, for determining the voice to be evaluated relative to the answer text according to the matching degree
This evaluation result.
Optionally, the matching degree determination unit may include:
Convolution unit processing unit handles the alignment information for the convolution unit using neural network model, described
Convolution unit is configured to receive and process the alignment information, to generate of the voice to be evaluated Yu the answer text
Internal state with degree indicates.
Optionally, the matching degree applying unit may include:
The full articulamentum processing unit of third handles the matching for the full articulamentum of third using neural network model
Degree, the full articulamentum of third is configured to receive and process the matching degree, to generate the voice to be evaluated relative to institute
The internal state for stating the evaluation result of answer text indicates.
Speech evaluating device provided by the embodiments of the present application can be applied to speech evaluating equipment, such as PC terminal, cloud platform, clothes
Business device and server cluster etc..Optionally, Fig. 5 shows the hardware block diagram of speech evaluating equipment, and referring to Fig. 5, voice is commented
The hardware configuration of measurement equipment may include: at least one processor 1, at least one communication interface 2,3 He of at least one processor
At least one communication bus 4;
In the embodiment of the present application, processor 1, communication interface 2, memory 3, communication bus 4 quantity be at least one,
And processor 1, communication interface 2, memory 3 complete mutual communication by communication bus 4;
Processor 1 may be a central processor CPU or specific integrated circuit ASIC (Application
Specific Integrated Circuit), or be arranged to implement the integrated electricity of one or more of the embodiment of the present invention
Road etc.;
Memory 3 may include high speed RAM memory, it is also possible to further include nonvolatile memory (non-volatile
Memory) etc., a for example, at least magnetic disk storage;
Wherein, memory is stored with program, the program that processor can call memory to store, and described program is used for:
Obtain voice to be evaluated, and the answer text as evaluating standard;
The text feature of acoustic feature and the answer text based on the voice to be evaluated, determines the language to be evaluated
The alignment information of sound and the answer text;
According to the alignment information, evaluation result of the voice to be evaluated relative to the answer text is determined.
Optionally, the refinement function of described program and extension function can refer to above description.
The embodiment of the present application also provides a kind of readable storage medium storing program for executing, which can be stored with and hold suitable for processor
Capable program, described program are used for:
Obtain voice to be evaluated, and the answer text as evaluating standard;
The text feature of acoustic feature and the answer text based on the voice to be evaluated, determines the language to be evaluated
The alignment information of sound and the answer text;
According to the alignment information, evaluation result of the voice to be evaluated relative to the answer text is determined.
Optionally, the refinement function of described program and extension function can refer to above description.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by
One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation
Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning
Covering non-exclusive inclusion, so that the process, method, article or equipment for including a series of elements not only includes that
A little elements, but also including other elements that are not explicitly listed, or further include for this process, method, article or
The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged
Except there is also other identical elements in the process, method, article or apparatus that includes the element.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other
The difference of embodiment, the same or similar parts in each embodiment may refer to each other.
The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application.
Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein
General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application
It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one
The widest scope of cause.
Claims (16)
1. a kind of speech evaluating method characterized by comprising
Obtain voice to be evaluated, and the answer text as evaluating standard;
The text feature of acoustic feature and the answer text based on the voice to be evaluated, determine the voice to be evaluated with
The alignment information of the answer text;
According to the alignment information, evaluation result of the voice to be evaluated relative to the answer text is determined.
2. the method according to claim 1, wherein the acquisition process of the acoustic feature of the voice to be evaluated,
Include:
The spectrum signature for obtaining the voice to be evaluated, as acoustic feature;
Or,
Obtain the spectrum signature of the voice to be evaluated;
The hidden layer of neural network model is obtained to the hidden layer feature after spectrum signature conversion, as acoustic feature.
3. the method according to claim 1, wherein the acquisition process of the text feature of the answer text, packet
It includes:
The vector for obtaining the answer text, as text feature;
Or,
Obtain the vector of the answer text;
The hidden layer of neural network model is obtained to the hidden layer feature after vector conversion, as text feature.
4. the method according to claim 1, wherein the acoustic feature and institute based on the voice to be evaluated
The text feature for stating answer text determines the alignment information of the voice to be evaluated Yu the answer text, comprising:
The text feature of acoustic feature and the answer text based on the voice to be evaluated, determines frame level attention matrix,
The frame level attention matrix include: for any one text unit in the answer text, it is every in the voice to be evaluated
Alignment probability of the one frame voice to the text unit.
5. according to the method described in claim 4, it is characterized in that, the acoustic feature and institute based on the voice to be evaluated
The text feature for stating answer text determines frame level attention matrix, comprising:
The acoustic feature and the answer text of the voice to be evaluated are handled using the first full articulamentum of neural network model
Text feature, the first full articulamentum is configured to receive and process the acoustic feature and the text feature, with life
It is indicated at the internal state of frame level attention matrix.
6. according to the method described in claim 4, it is characterized in that, the acoustic feature and institute based on the voice to be evaluated
The text feature for stating answer text determines the alignment information of the voice to be evaluated Yu the answer text, further includes:
Based on the frame level attention matrix and the acoustic feature, word grade acoustics alignment matrix, institute's predicate grade acoustics pair are determined
It is neat matrix includes: with the acoustic information that each text unit is aligned in the answer text, the acoustic information includes with institute
It is weight that text unit, which is stated, with the probability that is aligned of every frame voice, and the result of summation is weighted to the acoustic feature of every frame voice;
Based on institute's predicate grade acoustics alignment matrix and the text feature, word grade attention matrix, institute's predicate grade attention are determined
Matrix includes: the text feature for any one text unit in the answer text, each text in the answer text
The acoustic information of this unit is aligned probability to it.
7. according to the method described in claim 6, it is characterized in that, described be based on institute's predicate grade acoustics alignment matrix and the text
Eigen determines word grade attention matrix, comprising:
It is described using second full articulamentum processing institute's predicate grade acoustics alignment matrix of neural network model and the text feature
Second full articulamentum is configured to receive and process institute's predicate grade acoustics alignment matrix and the text feature, to generate word grade note
The internal state for torque battle array of anticipating indicates.
8. method according to claim 1-7, which is characterized in that it is described according to the alignment information, determine institute
State evaluation result of the voice to be evaluated relative to the answer text, comprising:
According to the alignment information, the matching degree of the voice to be evaluated Yu the answer text is determined;
According to the matching degree, evaluation result of the voice to be evaluated relative to the answer text is determined.
9. according to the method described in claim 8, determining described to be evaluated it is characterized in that, described according to the alignment information
The matching degree of voice and the answer text, comprising:
The alignment information is handled using the convolution unit of neural network model, the convolution unit is configured to receive and process
The alignment information is indicated with generating the internal state of matching degree of the voice to be evaluated and the answer text.
10. according to the method described in claim 8, determining the language to be evaluated it is characterized in that, described according to the matching degree
Evaluation result of the sound relative to the answer text, comprising:
The matching degree is handled using the full articulamentum of the third of neural network model, the full articulamentum of third is configured as receiving
And the matching degree is handled, to generate internal state table of the voice to be evaluated relative to the evaluation result of the answer text
Show.
11. a kind of speech evaluating device characterized by comprising
Data capture unit, for obtaining voice to be evaluated, and the answer text as evaluating standard;
Alignment information determination unit, the text for acoustic feature and the answer text based on the voice to be evaluated are special
Sign, determines the alignment information of the voice to be evaluated Yu the answer text;
Evaluation result determination unit, for determining the voice to be evaluated relative to the answer text according to the alignment information
This evaluation result.
12. device according to claim 11, which is characterized in that the alignment information determination unit includes:
Frame level attention matrix determination unit, the text for acoustic feature and the answer text based on the voice to be evaluated
Eigen determines that frame level attention matrix, the frame level attention matrix include: for any one text in the answer text
This unit, alignment probability of each frame voice to the text unit in the voice to be evaluated.
13. device according to claim 12, which is characterized in that the alignment information determination unit further include:
Word grade acoustics alignment matrix determination unit determines word for being based on the frame level attention matrix and the acoustic feature
Grade acoustics alignment matrix, institute's predicate grade acoustics alignment matrix includes: to be aligned with each text unit in the answer text
Acoustic information, the acoustic information include being aligned probability as weight, to every frame voice using the text unit and every frame voice
Acoustic feature be weighted the result of summation;
Word grade attention matrix determination unit determines word for being based on institute's predicate grade acoustics alignment matrix and the text feature
Grade attention matrix, institute's predicate grade attention matrix includes: the text for any one text unit in the answer text
Feature, the acoustic information of each text unit is aligned probability to it in the answer text.
14. the described in any item devices of 1-13 according to claim 1, which is characterized in that the evaluation result determination unit includes:
Matching degree determination unit, for determining of the voice to be evaluated Yu the answer text according to the alignment information
With degree;
Matching degree applying unit, for determining the voice to be evaluated relative to the answer text according to the matching degree
Evaluation result.
15. a kind of speech evaluating equipment, which is characterized in that including memory and processor;
The memory, for storing program;
The processor realizes such as speech evaluating method of any of claims 1-10 for executing described program
Each step.
16. a kind of readable storage medium storing program for executing, is stored thereon with computer program, which is characterized in that the computer program is processed
When device executes, each step such as speech evaluating method of any of claims 1-10 is realized.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811162964.0A CN109215632B (en) | 2018-09-30 | 2018-09-30 | Voice evaluation method, device and equipment and readable storage medium |
JP2018223934A JP6902010B2 (en) | 2018-09-30 | 2018-11-29 | Audio evaluation methods, devices, equipment and readable storage media |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811162964.0A CN109215632B (en) | 2018-09-30 | 2018-09-30 | Voice evaluation method, device and equipment and readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109215632A true CN109215632A (en) | 2019-01-15 |
CN109215632B CN109215632B (en) | 2021-10-08 |
Family
ID=64982845
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811162964.0A Active CN109215632B (en) | 2018-09-30 | 2018-09-30 | Voice evaluation method, device and equipment and readable storage medium |
Country Status (2)
Country | Link |
---|---|
JP (1) | JP6902010B2 (en) |
CN (1) | CN109215632B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109979482A (en) * | 2019-05-21 | 2019-07-05 | 科大讯飞股份有限公司 | A kind of evaluating method and device for audio |
CN110223689A (en) * | 2019-06-10 | 2019-09-10 | 秒针信息技术有限公司 | The determination method and device of the optimization ability of voice messaging, storage medium |
CN110600006A (en) * | 2019-10-29 | 2019-12-20 | 福建天晴数码有限公司 | Speech recognition evaluation method and system |
CN110782917A (en) * | 2019-11-01 | 2020-02-11 | 广州美读信息技术有限公司 | Poetry reciting style classification method and system |
CN111027794A (en) * | 2019-03-29 | 2020-04-17 | 广东小天才科技有限公司 | Dictation operation correcting method and learning equipment |
CN111128120A (en) * | 2019-12-31 | 2020-05-08 | 苏州思必驰信息科技有限公司 | Text-to-speech method and device |
CN111652165A (en) * | 2020-06-08 | 2020-09-11 | 北京世纪好未来教育科技有限公司 | Mouth shape evaluating method, mouth shape evaluating equipment and computer storage medium |
CN112257407A (en) * | 2020-10-20 | 2021-01-22 | 网易(杭州)网络有限公司 | Method and device for aligning text in audio, electronic equipment and readable storage medium |
CN112837673A (en) * | 2020-12-31 | 2021-05-25 | 平安科技(深圳)有限公司 | Speech synthesis method, apparatus, computer device and medium based on artificial intelligence |
CN113506585A (en) * | 2021-09-09 | 2021-10-15 | 深圳市一号互联科技有限公司 | Quality evaluation method and system for voice call |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100704542B1 (en) * | 2006-09-12 | 2007-04-09 | 주식회사 보경이엔지건축사사무소 | Auxiliary door of vestibule be more in air-conditioning and heating for apartment house |
CN113707178B (en) * | 2020-05-22 | 2024-02-06 | 苏州声通信息科技有限公司 | Audio evaluation method and device and non-transient storage medium |
CN111862957A (en) * | 2020-07-14 | 2020-10-30 | 杭州芯声智能科技有限公司 | Single track voice keyword low-power consumption real-time detection method |
CN112256841B (en) * | 2020-11-26 | 2024-05-07 | 支付宝(杭州)信息技术有限公司 | Text matching and countermeasure text recognition method, device and equipment |
CN112562724B (en) * | 2020-11-30 | 2024-05-17 | 携程计算机技术(上海)有限公司 | Speech quality assessment model, training assessment method, training assessment system, training assessment equipment and medium |
CN113379234B (en) * | 2021-06-08 | 2024-06-18 | 北京猿力未来科技有限公司 | Evaluation result generation method and device |
CN113707148B (en) * | 2021-08-05 | 2024-04-19 | 中移(杭州)信息技术有限公司 | Method, device, equipment and medium for determining speech recognition accuracy |
CN114155831A (en) * | 2021-12-06 | 2022-03-08 | 科大讯飞股份有限公司 | Voice evaluation method, related equipment and readable storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104347071A (en) * | 2013-08-02 | 2015-02-11 | 安徽科大讯飞信息科技股份有限公司 | Method and system for generating oral test reference answer |
CN104361895A (en) * | 2014-12-04 | 2015-02-18 | 上海流利说信息技术有限公司 | Voice quality evaluation equipment, method and system |
CN104810017A (en) * | 2015-04-08 | 2015-07-29 | 广东外语外贸大学 | Semantic analysis-based oral language evaluating method and system |
CN107230475A (en) * | 2017-05-27 | 2017-10-03 | 腾讯科技(深圳)有限公司 | A kind of voice keyword recognition method, device, terminal and server |
CN107818795A (en) * | 2017-11-15 | 2018-03-20 | 苏州驰声信息科技有限公司 | The assessment method and device of a kind of Oral English Practice |
CN109192224A (en) * | 2018-09-14 | 2019-01-11 | 科大讯飞股份有限公司 | A kind of speech evaluating method, device, equipment and readable storage medium storing program for executing |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH05333896A (en) * | 1992-06-01 | 1993-12-17 | Nec Corp | Conversational sentence recognition system |
US8231389B1 (en) * | 2004-04-29 | 2012-07-31 | Wireless Generation, Inc. | Real-time observation assessment with phoneme segment capturing and scoring |
JP2008052178A (en) * | 2006-08-28 | 2008-03-06 | Toyota Motor Corp | Voice recognition device and voice recognition method |
JP5834291B2 (en) * | 2011-07-13 | 2015-12-16 | ハイウエア株式会社 | Voice recognition device, automatic response method, and automatic response program |
JP6217304B2 (en) * | 2013-10-17 | 2017-10-25 | ヤマハ株式会社 | Singing evaluation device and program |
JP6674706B2 (en) * | 2016-09-14 | 2020-04-01 | Kddi株式会社 | Program, apparatus and method for automatically scoring from dictation speech of learner |
CN108154735A (en) * | 2016-12-06 | 2018-06-12 | 爱天教育科技(北京)有限公司 | Oral English Practice assessment method and device |
CN106847260B (en) * | 2016-12-20 | 2020-02-21 | 山东山大鸥玛软件股份有限公司 | Automatic English spoken language scoring method based on feature fusion |
US20190362703A1 (en) * | 2017-02-15 | 2019-11-28 | Nippon Telegraph And Telephone Corporation | Word vectorization model learning device, word vectorization device, speech synthesis device, method thereof, and program |
-
2018
- 2018-09-30 CN CN201811162964.0A patent/CN109215632B/en active Active
- 2018-11-29 JP JP2018223934A patent/JP6902010B2/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104347071A (en) * | 2013-08-02 | 2015-02-11 | 安徽科大讯飞信息科技股份有限公司 | Method and system for generating oral test reference answer |
CN104361895A (en) * | 2014-12-04 | 2015-02-18 | 上海流利说信息技术有限公司 | Voice quality evaluation equipment, method and system |
CN104810017A (en) * | 2015-04-08 | 2015-07-29 | 广东外语外贸大学 | Semantic analysis-based oral language evaluating method and system |
CN107230475A (en) * | 2017-05-27 | 2017-10-03 | 腾讯科技(深圳)有限公司 | A kind of voice keyword recognition method, device, terminal and server |
CN107818795A (en) * | 2017-11-15 | 2018-03-20 | 苏州驰声信息科技有限公司 | The assessment method and device of a kind of Oral English Practice |
CN109192224A (en) * | 2018-09-14 | 2019-01-11 | 科大讯飞股份有限公司 | A kind of speech evaluating method, device, equipment and readable storage medium storing program for executing |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111027794B (en) * | 2019-03-29 | 2023-09-26 | 广东小天才科技有限公司 | Correction method and learning equipment for dictation operation |
CN111027794A (en) * | 2019-03-29 | 2020-04-17 | 广东小天才科技有限公司 | Dictation operation correcting method and learning equipment |
CN109979482B (en) * | 2019-05-21 | 2021-12-07 | 科大讯飞股份有限公司 | Audio evaluation method and device |
CN109979482A (en) * | 2019-05-21 | 2019-07-05 | 科大讯飞股份有限公司 | A kind of evaluating method and device for audio |
CN110223689A (en) * | 2019-06-10 | 2019-09-10 | 秒针信息技术有限公司 | The determination method and device of the optimization ability of voice messaging, storage medium |
CN110600006A (en) * | 2019-10-29 | 2019-12-20 | 福建天晴数码有限公司 | Speech recognition evaluation method and system |
CN110600006B (en) * | 2019-10-29 | 2022-02-11 | 福建天晴数码有限公司 | Speech recognition evaluation method and system |
CN110782917B (en) * | 2019-11-01 | 2022-07-12 | 广州美读信息技术有限公司 | Poetry reciting style classification method and system |
CN110782917A (en) * | 2019-11-01 | 2020-02-11 | 广州美读信息技术有限公司 | Poetry reciting style classification method and system |
CN111128120A (en) * | 2019-12-31 | 2020-05-08 | 苏州思必驰信息科技有限公司 | Text-to-speech method and device |
CN111652165A (en) * | 2020-06-08 | 2020-09-11 | 北京世纪好未来教育科技有限公司 | Mouth shape evaluating method, mouth shape evaluating equipment and computer storage medium |
CN112257407A (en) * | 2020-10-20 | 2021-01-22 | 网易(杭州)网络有限公司 | Method and device for aligning text in audio, electronic equipment and readable storage medium |
CN112257407B (en) * | 2020-10-20 | 2024-05-14 | 网易(杭州)网络有限公司 | Text alignment method and device in audio, electronic equipment and readable storage medium |
CN112837673A (en) * | 2020-12-31 | 2021-05-25 | 平安科技(深圳)有限公司 | Speech synthesis method, apparatus, computer device and medium based on artificial intelligence |
CN112837673B (en) * | 2020-12-31 | 2024-05-10 | 平安科技(深圳)有限公司 | Speech synthesis method, device, computer equipment and medium based on artificial intelligence |
CN113506585A (en) * | 2021-09-09 | 2021-10-15 | 深圳市一号互联科技有限公司 | Quality evaluation method and system for voice call |
Also Published As
Publication number | Publication date |
---|---|
JP6902010B2 (en) | 2021-07-14 |
JP2020056982A (en) | 2020-04-09 |
CN109215632B (en) | 2021-10-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109215632A (en) | A kind of speech evaluating method, device, equipment and readable storage medium storing program for executing | |
CN110782921B (en) | Voice evaluation method and device, storage medium and electronic device | |
CN112508334B (en) | Personalized paper grouping method and system integrating cognition characteristics and test question text information | |
CN107133303A (en) | Method and apparatus for output information | |
CN111833853A (en) | Voice processing method and device, electronic equipment and computer readable storage medium | |
CN114254208A (en) | Identification method of weak knowledge points and planning method and device of learning path | |
CN109273023A (en) | A kind of data evaluating method, device, equipment and readable storage medium storing program for executing | |
CN103594087A (en) | Method and system for improving oral evaluation performance | |
US10283142B1 (en) | Processor-implemented systems and methods for determining sound quality | |
Prabhudesai et al. | Automatic short answer grading using Siamese bidirectional LSTM based regression | |
CN111460101A (en) | Knowledge point type identification method and device and processor | |
CN113314100A (en) | Method, device, equipment and storage medium for evaluating and displaying results of spoken language test | |
CN112015862A (en) | User abnormal comment detection method and system based on hierarchical multichannel attention | |
CN110956142A (en) | Intelligent interactive training system | |
CN114429212A (en) | Intelligent learning knowledge ability tracking method, electronic device and storage medium | |
CN110852071B (en) | Knowledge point detection method, device, equipment and readable storage medium | |
WO2004053834A2 (en) | Systems and methods for dynamically analyzing temporality in speech | |
KR20210071713A (en) | Speech Skill Feedback System | |
CN108876677B (en) | Teaching effect evaluation method based on big data and artificial intelligence and robot system | |
CN117079504B (en) | Wrong question data management method of big data accurate teaching and reading system | |
CN117151548A (en) | Music online learning method and system based on hand motion judgment | |
CN112116181B (en) | Classroom quality model training method, classroom quality evaluation method and classroom quality evaluation device | |
CN107578785A (en) | The continuous emotional feature analysis evaluation method of music based on Gamma distributional analysis | |
Cheng et al. | Towards accurate recognition for children's oral reading fluency | |
CN109979482A (en) | A kind of evaluating method and device for audio |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |