CN109215632B - Voice evaluation method, device and equipment and readable storage medium - Google Patents

Voice evaluation method, device and equipment and readable storage medium Download PDF

Info

Publication number
CN109215632B
CN109215632B CN201811162964.0A CN201811162964A CN109215632B CN 109215632 B CN109215632 B CN 109215632B CN 201811162964 A CN201811162964 A CN 201811162964A CN 109215632 B CN109215632 B CN 109215632B
Authority
CN
China
Prior art keywords
text
speech
evaluated
acoustic
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811162964.0A
Other languages
Chinese (zh)
Other versions
CN109215632A (en
Inventor
金海�
吴奎
胡阳
朱群
竺博
魏思
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201811162964.0A priority Critical patent/CN109215632B/en
Priority to JP2018223934A priority patent/JP6902010B2/en
Publication of CN109215632A publication Critical patent/CN109215632A/en
Application granted granted Critical
Publication of CN109215632B publication Critical patent/CN109215632B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/69Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a voice evaluation method, a device, equipment and a readable storage medium, the application obtains a voice to be evaluated and an answer text serving as an evaluation standard, based on acoustic characteristics of the voice to be evaluated and text characteristics of the answer text, alignment information of the voice to be evaluated and the answer text can be determined, it can be understood that the alignment information shows alignment relation of the voice to be evaluated and the answer text, and then an evaluation result of the voice to be evaluated relative to the answer text can be automatically determined according to the alignment information. Because the evaluation is not carried out manually, the interference of subjective influence of people on the evaluation result is avoided, and the consumption of labor cost is reduced.

Description

Voice evaluation method, device and equipment and readable storage medium
Technical Field
The present application relates to the field of speech processing technologies, and in particular, to a speech evaluation method, apparatus, device, and readable storage medium.
Background
With the continuous deepening of education reform, spoken language examinations are developed all over the country. Spoken language examinations are capable of assessing the level of the examinee's spoken language relative to written examinations.
Most of the existing spoken language examinations evaluate answers of examinees through professional teachers according to correct answer information corresponding to questions. The manual evaluation mode is extremely easily influenced by human subjectivity, so that the evaluation result is artificially interfered, and a large amount of labor cost is consumed.
Disclosure of Invention
In view of the above, the present application provides a speech evaluation method, apparatus, device and readable storage medium, which are used to solve the disadvantages of the existing manual oral test evaluation method.
In order to achieve the above object, the following solutions are proposed:
a speech evaluation method comprises the following steps:
acquiring a voice to be evaluated and answer texts serving as evaluation standards;
determining alignment information of the speech to be evaluated and the answer text based on the acoustic features of the speech to be evaluated and the text features of the answer text;
and determining an evaluation result of the speech to be evaluated relative to the answer text according to the alignment information.
Preferably, the process of obtaining the acoustic feature of the speech to be evaluated includes:
acquiring the frequency spectrum characteristic of the speech to be evaluated as an acoustic characteristic;
or the like, or, alternatively,
acquiring the frequency spectrum characteristics of the speech to be evaluated;
and acquiring hidden layer characteristics of the hidden layer of the neural network model after the spectrum characteristics are converted, and taking the hidden layer characteristics as acoustic characteristics.
Preferably, the process of acquiring text features of the answer text includes:
obtaining a vector of the answer text as a text feature;
or the like, or, alternatively,
obtaining a vector of the answer text;
and acquiring hidden layer characteristics of the hidden layer of the neural network model after vector conversion as text characteristics.
Preferably, the determining the alignment information of the speech to be evaluated and the answer text based on the acoustic features of the speech to be evaluated and the text features of the answer text includes:
determining a frame-level attention matrix based on the acoustic features of the speech to be evaluated and the text features of the answer text, wherein the frame-level attention matrix comprises: and for any text unit in the answer text, the alignment probability of each frame of speech in the speech to be evaluated to the text unit.
Preferably, the determining a frame-level attention matrix based on the acoustic features of the speech to be evaluated and the text features of the answer text includes:
processing the acoustic features of the speech to be evaluated and the text features of the answer text with a first fully-connected layer of a neural network model, the first fully-connected layer configured to receive and process the acoustic features and the text features to generate an internal state representation of a frame-level attention matrix.
Preferably, the determining the alignment information between the speech to be evaluated and the answer text based on the acoustic features of the speech to be evaluated and the text features of the answer text further includes:
determining a word-level acoustic alignment matrix based on the frame-level attention matrix and the acoustic features, the word-level acoustic alignment matrix comprising: acoustic information aligned with each text unit in the answer text, wherein the acoustic information comprises a result of weighted summation of acoustic features of each frame of voice by taking the alignment probability of the text unit and each frame of voice as a weight;
determining a word-level attention matrix based on the word-level acoustic alignment matrix and the text features, wherein the word-level attention matrix comprises: and for the text features of any text unit in the answer text, the acoustic information of each text unit in the answer text is aligned to the probability.
Preferably, said determining a word-level attention matrix based on said word-level acoustic alignment matrix and said text features comprises:
processing the word-level acoustic alignment matrix and the text features with a second fully-connected layer of a neural network model configured to receive and process the word-level acoustic alignment matrix and the text features to generate an internal state representation of a word-level attention matrix.
Preferably, the determining, according to the alignment information, an evaluation result of the speech to be evaluated with respect to the answer text includes:
according to the alignment information, determining the matching degree of the speech to be evaluated and the answer text;
and determining an evaluation result of the speech to be evaluated relative to the answer text according to the matching degree.
Preferably, the determining the matching degree of the speech to be evaluated and the answer text according to the alignment information includes:
and processing the alignment information by utilizing a convolution unit of a neural network model, wherein the convolution unit is configured to receive and process the alignment information so as to generate an internal state representation of the matching degree of the speech to be evaluated and the answer text.
Preferably, the determining, according to the matching degree, an evaluation result of the speech to be evaluated with respect to the answer text includes:
and processing the matching degree by utilizing a third fully-connected layer of a neural network model, wherein the third fully-connected layer is configured to receive and process the matching degree so as to generate an internal state representation of the evaluation result of the speech to be evaluated relative to the answer text.
A speech evaluation apparatus comprising:
the data acquisition unit is used for acquiring the speech to be evaluated and answer text serving as an evaluation standard;
the alignment information determining unit is used for determining the alignment information of the speech to be evaluated and the answer text based on the acoustic characteristics of the speech to be evaluated and the text characteristics of the answer text;
and the evaluation result determining unit is used for determining the evaluation result of the speech to be evaluated relative to the answer text according to the alignment information.
Preferably, the acoustic feature acquisition unit is further included, and includes:
the first acoustic feature obtaining subunit is configured to obtain a frequency spectrum feature of the speech to be evaluated, as an acoustic feature;
or the like, or, alternatively,
the second acoustic feature obtaining subunit is used for obtaining the frequency spectrum feature of the speech to be evaluated;
and the third acoustic feature acquisition subunit is used for acquiring hidden layer features of the hidden layer of the neural network model after the spectrum features are converted, and the hidden layer features are used as the acoustic features.
Preferably, the method further comprises the following steps: a text feature acquisition unit comprising:
the first text feature obtaining subunit is used for obtaining a vector of the answer text as a text feature;
or the like, or, alternatively,
the second text feature obtaining subunit is used for obtaining a vector of the answer text;
and the third text feature acquisition subunit is used for acquiring hidden layer features of the hidden layer of the neural network model after vector conversion as text features.
Preferably, the alignment information determining unit includes:
a frame-level attention matrix determination unit, configured to determine a frame-level attention matrix based on the acoustic features of the speech to be evaluated and the text features of the answer text, where the frame-level attention matrix includes: and for any text unit in the answer text, the alignment probability of each frame of speech in the speech to be evaluated to the text unit.
Preferably, the frame-level attention matrix determining unit includes:
a first fully-connected layer processing unit to process the acoustic features and the text features with a first fully-connected layer of a neural network model, the first fully-connected layer configured to receive and process the acoustic features and the text features to generate an internal state representation of a frame-level attention matrix.
Preferably, the alignment information determining unit further includes:
a word-level acoustic alignment matrix determination unit, configured to determine a word-level acoustic alignment matrix based on the frame-level attention matrix and the acoustic features, where the word-level acoustic alignment matrix includes: acoustic information aligned with each text unit in the answer text, wherein the acoustic information comprises a result of weighted summation of acoustic features of each frame of voice by taking the alignment probability of the text unit and each frame of voice as a weight;
a word-level attention moment array determining unit, configured to determine a word-level attention matrix based on the word-level acoustic alignment matrix and the text feature, where the word-level attention moment array includes: and for the text features of any text unit in the answer text, the acoustic information of each text unit in the answer text is aligned to the probability.
Preferably, the word-level attention moment array determining unit includes:
a second fully-connected layer processing unit to process the word-level acoustic alignment matrix and the text features with a second fully-connected layer of a neural network model, the second fully-connected layer configured to receive and process the word-level acoustic alignment matrix and the text features to generate an internal state representation of a word-level attention matrix.
Preferably, the evaluation result determination unit includes:
the matching degree determining unit is used for determining the matching degree of the speech to be evaluated and the answer text according to the alignment information;
and the matching degree application unit is used for determining the evaluation result of the speech to be evaluated relative to the answer text according to the matching degree.
Preferably, the matching degree determination unit includes:
and the convolution unit processing unit is used for processing the alignment information by utilizing a convolution unit of a neural network model, and the convolution unit is configured to receive and process the alignment information so as to generate an internal state representation of the matching degree of the speech to be evaluated and the answer text.
Preferably, the matching degree applying unit includes:
and the third fully-connected layer processing unit is used for processing the matching degree by utilizing a third fully-connected layer of a neural network model, and the third fully-connected layer is configured to receive and process the matching degree so as to generate an internal state representation of the evaluation result of the speech to be evaluated relative to the answer text.
A speech evaluating apparatus includes a memory and a processor;
the memory is used for storing programs;
the processor is used for executing the program to realize the steps of the voice evaluation method.
A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech evaluation method as described above.
According to the technical scheme, the speech evaluation method provided by the embodiment of the application obtains the speech to be evaluated and the answer text serving as the evaluation standard, and can determine the alignment information of the speech to be evaluated and the answer text based on the acoustic features of the speech to be evaluated and the text features of the answer text. Because the evaluation is not carried out manually, the interference of subjective influence of people on the evaluation result is avoided, and the consumption of labor cost is reduced.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a speech evaluation method disclosed in the embodiments of the present application;
FIG. 2 is a schematic flow chart illustrating speech evaluation by a neural network model;
FIG. 3 illustrates a flow diagram of another neural network model for speech assessment;
FIG. 4 is a schematic structural diagram of a speech evaluation device disclosed in the embodiments of the present application;
fig. 5 is a block diagram of a hardware structure of a speech evaluation apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In order to solve the problems that the existing spoken language evaluation depends on manual work, so that the evaluation result is interfered by human beings and the labor cost is wasted, the inventor of the application firstly provides a solution, namely, a speech to be evaluated is identified by using a speech identification model to obtain an identification text, keywords are extracted from an answer text, the hit rate of the identification text to the keywords is further calculated, the evaluation result of the speech to be evaluated is determined according to the hit rate, and if the hit rate is higher, the evaluation score is higher.
However, further research shows that the above solution proposed by the inventor needs to recognize the speech to be evaluated as text, and the process uses a speech recognition model. If the universal speech recognition model is used for recognizing the speech to be evaluated in different test scenes, the problem of low recognition accuracy exists, and further the evaluation result is inaccurate. If the voice recognition models are respectively trained aiming at different examination scenes, manual work needs to be arranged in advance for scoring training data aiming at each examination, and a large amount of labor cost is consumed.
On the basis, the inventor of the present application further studies, and finally realizes automatic speech evaluation from the perspective of actively searching for the alignment information of the speech to be evaluated and the answer text. The voice evaluation method can be realized based on electronic equipment with data processing capacity, such as an intelligent terminal, a server, a cloud platform and the like.
The voice evaluation scheme can be suitable for the evaluation scene of the spoken language test and other scenes related to evaluation of the pronunciation level.
Next, the speech evaluation method of the present application is described with reference to fig. 1, and the method may include:
and S100, obtaining the voice to be evaluated and answer text serving as an evaluation standard.
Specifically, taking a spoken language test scenario as an example, the speech to be evaluated may be a spoken language answer recording given by an examinee. Correspondingly, in this embodiment, an answer text as an evaluation criterion may be preset. Taking the material reading spoken test questions as an example, the answer text as the evaluation criterion may be text information extracted from the reading material. In addition, for spoken language examinations of other types of questions, the answer text as the evaluation criterion may be the answer content corresponding to the question.
In this step, the obtaining mode of the speech to be evaluated may be receiving through a recording device, and the recording device may include a microphone, such as a head-mounted microphone.
Step S110, based on the acoustic characteristics of the speech to be evaluated and the text characteristics of the answer text, determining the alignment information of the speech to be evaluated and the answer text.
The acoustic characteristics of the speech to be evaluated reflect the acoustic information of the speech to be evaluated. The text features of the answer text reflect the text information of the answer text. The type of acoustic features may be varied, and similarly, the type of text features may be varied.
In this embodiment, based on the acoustic features and the text features, the alignment information of the speech to be evaluated and the answer text is actively searched, and the alignment information reflects the alignment relationship between the speech to be evaluated and the answer text. It can be understood that the integrity of the alignment between the speech to be evaluated and the answer text is high for the speech to be evaluated meeting the evaluation criterion, and the integrity of the alignment between the speech to be evaluated and the answer text is low for the speech to be evaluated not meeting the evaluation criterion.
And step S120, determining an evaluation result of the speech to be evaluated relative to the answer text according to the alignment information.
According to the above discussion, the alignment information reflects the alignment relationship between the speech to be evaluated and the answer text, and is related to whether the speech to be evaluated meets the evaluation standard or not and the degree of the speech to be evaluated meets the evaluation standard, so that in this step, the evaluation result of the speech to be evaluated relative to the answer text can be determined according to the alignment information.
The speech evaluation method provided by the embodiment of the application can automatically determine the evaluation result of the speech to be evaluated relative to the answer text according to the alignment information. Because the evaluation is not carried out manually, the interference of subjective influence of people on the evaluation result is avoided, and the consumption of labor cost is reduced.
Furthermore, the evaluation result is determined from the perspective of actively searching the alignment information of the speech to be evaluated and the answer text, the speech recognition model does not need to be used for performing speech recognition on the speech to be evaluated, the hit rate of the recognized text and the keyword of the answer text is calculated, the problem of inaccurate evaluation result caused by inaccurate speech recognition result is avoided, the speech evaluation result is more accurate, the scheme can be suitable for various speech evaluation scenes, the robustness is higher, additional manpower is not needed to be spent for scoring in different scenes to determine the training data, and the labor cost is saved.
In another embodiment of the present application, the process of obtaining the acoustic features of the speech to be evaluated and the text features of the answer text mentioned in step S110 is described.
Firstly, introducing an acquisition process of acoustic features of speech to be evaluated:
an optional mode can be that the spectral feature of the speech to be evaluated can be directly obtained and taken as the acoustic feature of the speech to be evaluated.
The spectral features may include mel-frequency Cepstrum Coefficient (MFCC) features or Perceptual Linear Prediction (PLP) features.
For convenience of description, the speech to be evaluated is defined to include T frames.
When obtaining the spectrum feature of the speech to be evaluated, the speech to be evaluated may be subjected to framing processing, and the framed speech to be evaluated may be subjected to pre-emphasis, so as to extract the spectrum feature of each frame of speech.
Another optional mode may be to obtain a spectral feature of the speech to be evaluated, and further, to obtain a hidden layer feature of the neural network model after converting the spectral feature, as an acoustic feature.
Here, the Neural Network model may take various structural forms, such as RNN (Recurrent Neural Network), LSTM (Long Short-term memory Network), GRU (Gated Recurrent Unit), and the like.
The spectrum features are converted through the hidden layer of the neural network model, deep mapping can be carried out on the spectrum features, the obtained hidden layer features are deeper than the spectrum feature levels, acoustic characteristics of the speech to be evaluated can be reflected better, and therefore the hidden layer features can be used as the acoustic features.
The acoustic features can be represented in the form of a matrix as follows:
Figure BDA0001820422060000091
wherein h ist(T-1, 2, …, T) represents the acoustic features of the T-th frame of speech, and the dimension of the acoustic features of each frame remains unchanged, defined as the m-dimension.
Further, the process of obtaining the text features of the speech to be evaluated is introduced:
in an alternative mode, a vector of the answer text may be directly obtained and used as a text feature of the answer text.
The vector of the answer text may be a combination of word vectors of text units constituting the answer text, or a vector result obtained by subjecting the word vectors of the text units to a certain operation. For example, hidden layer features are extracted for word vectors of text units using a neural network model as vector results for the text units. The word vector of the text unit may be represented by a one-hot method or an embedding method, for example.
Further, the text unit of the answer text may be freely set, such as using a word-level, phoneme-level or root-level text unit.
For convenience of presentation, the answer text is defined to contain C text units.
Then, a word vector of each text unit in the answer text can be obtained, and finally, the text features of the answer text are determined according to the word vectors of the C text units.
Another optional mode may be to obtain a vector of the answer text, and further obtain hidden layer features of the hidden layer of the neural network model after converting the vector as text features.
As above, the Neural Network model may adopt various structural forms, such as RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory), GRU (Gated Recurrent Unit), and the like.
The hidden layer features obtained are deeper than the vector level of the answer text and can reflect the text characteristics of the answer text better, so that the hidden layer features can be used as text features.
Text features can be represented in the form of a matrix as follows:
Figure BDA0001820422060000101
wherein s isi(i-1, 2, …, C) represents the text feature of the ith text unit, and the dimension of the text feature of each text unit remains unchanged, defined as n-dimension.
In another embodiment of the present application, a process of determining alignment information between the speech to be evaluated and the answer text based on the acoustic features of the speech to be evaluated and the text features of the answer text in step S110 is described.
In this embodiment, a frame-level attention matrix may be determined based on the acoustic features of the speech to be evaluated and the text features of the answer text, where the frame-level attention matrix includes: and for any text unit in the answer text, the alignment probability of each frame of voice in the voice to be detected to the text unit.
The determined frame-level attention moment matrix can be used as alignment information of the speech to be evaluated and the answer text. Next, the above alignment probability is described by the formula:
eit=a(ht,si)=wT(Wsi+Vht+b)
Figure BDA0001820422060000102
wherein e isitAlignment information representing the text features of the ith text unit and the acoustic features of the t frame of speech; a isitRepresenting the alignment probability of the ith text unit to the ith frame voice for the ith text unit; siThe text feature representing the ith text unit is an n-dimensional vector; h istThe acoustic feature representing the t frame voice is an m-dimensional vector; w, V, W, b are four parameters, where W may be a matrix in dimensions k x n, V may be a matrix in dimensions k x m, W may be a vector in dimensions k, these three parameters are used for feature mapping, b is a bias, and may be a vector in dimensions k.
The above frame level attention matrix can be expressed in the form:
Figure BDA0001820422060000103
in this embodiment, an optional implementation of determining a frame-level attention matrix through a neural network model based on an attention mechanism is provided, which may specifically include:
processing the acoustic features and the textual features with a first fully-connected layer of a neural network model, the first fully-connected layer configured to receive and process the acoustic features and the textual features to generate an internal state representation of a frame-level attention matrix.
Wherein the first fully-connected layer of the neural network model can be represented as e aboveitAnd aitIn the form of a formula (c). And the four parameters of W, V, W and b are taken as the parameters of the first full connection layer. The four parameters can be updated iteratively through the iterative training of the neural network model until the four parameters are fixed after the model training is finished.
The frame-level attention matrix determined in this embodiment as the alignment information includes the alignment probability of each frame of speech in the speech to be evaluated to any text unit in the answer text, that is, the frame-level alignment information of the speech to be evaluated is obtained, and the frame-level attention matrix is related to the conformity degree of the speech to be evaluated with respect to the evaluation standard, so that the evaluation result of the speech to be evaluated with respect to the answer text can be subsequently determined based on the frame-level attention matrix.
Further, in consideration of the difference of the speech rates of different users, the speech durations generated by different users may be different when the same answer text is expressed, thereby causing the number of frames included in the speech to be different. The frame level attention matrix determined according to the scheme as the alignment information has different frame level attention moment arrays due to different frame numbers, and further has different evaluation results determined based on the frame level attention matrix. In practical situations, since different users express the same answer text, the evaluation results should be the same. Based on this problem, the present embodiment provides another scheme for determining alignment information.
On the basis of obtaining the frame-level attention matrix based on the acoustic features of the speech to be evaluated and the text features of the answer text introduced in the above embodiment, the following processing steps are further added in the embodiment:
1. determining a word-level acoustic alignment matrix based on the frame-level attention matrix and the acoustic features, the word-level acoustic alignment matrix comprising: and acoustic information aligned with each text unit in the answer text, wherein the acoustic information comprises a result of weighted summation of the acoustic features of each frame of voice by taking the alignment probability of the text unit and each frame of voice as a weight.
Specifically, the acoustic information aligned with the ith text unit in the word-level acoustic alignment matrix is expressed as follows:
Figure BDA0001820422060000111
wherein, aitAnd htFor meanings given above.
The above word-level acoustic alignment matrix may be represented as:
Figure BDA0001820422060000121
wherein, ci(i ═ 1,2, …, C) represents acoustic alignment information for the ith text unit, CiIs m-dimensional.
2. Determining a word-level attention matrix based on the word-level acoustic alignment matrix and the text features, wherein the word-level attention matrix comprises: and for the text features of any text unit in the answer text, the acoustic features of each text unit in the answer text are aligned to the probability.
The word-level attention moment matrix determined in the step can be used as alignment information of the speech to be evaluated and the answer text. Next, the word-level attention matrix is illustrated by the formula:
Figure BDA0001820422060000125
Figure BDA0001820422060000122
wherein, KijAcoustic representation of ith text unitAlignment information of the feature and the text feature of the jth text unit; i isijRepresenting the alignment probability of the acoustic information of the ith text unit to the text feature of the jth text unit;
Figure BDA0001820422060000123
is s isjTranspose of (c)iAcoustic alignment information representing an ith text unit; sjAnd U is a parameter and is used for mapping the word-level acoustic alignment features to the same dimension of the text features to perform point multiplication operation.
The word level attention matrix may be represented in the form:
Figure BDA0001820422060000124
in this embodiment, an optional implementation of determining a word-level attention matrix through a neural network model based on an attention mechanism is provided, which may specifically include:
processing the word-level acoustic alignment matrix and the text features with a second fully-connected layer of a neural network model configured to receive and process the word-level acoustic alignment matrix and the text features to generate an internal state representation of a word-level attention matrix.
Wherein the second fully-connected layer of the neural network model can be represented by KijAnd IijIn the form of a formula (c). And U as a parameter of the second fully-connected layer. The parameter U can be updated iteratively through the iterative training of the neural network model until the parameter U is fixed after the model training is finished.
The word-level attention matrix determined in this embodiment as the alignment information includes the acoustic features of each text unit in the answer text, and the alignment probability of the text features of any text unit, that is, the word-level attention matrix is obtained, and the word-level attention matrix is related to the conformity degree of the speech to be evaluated with respect to the evaluation criterion, so that the evaluation result of the speech to be evaluated with respect to the answer text can be subsequently determined based on the word-level attention matrix.
Furthermore, since the word-level attention matrix is not related to the number of frames contained in the speech to be evaluated, that is, is not related to the user speech rate, and only the alignment relationship between the text features and the acoustic features is considered, the defect that when the users with different speech rates express the same answer text, the evaluation results are different can be solved, that is, the word-level attention matrix of the embodiment is used as the alignment information, and the evaluation accuracy is higher.
In another embodiment of the present application, a process of determining an evaluation result of the speech to be evaluated with respect to the answer text according to the alignment information in step S120 is introduced.
It is understood that the alignment information used in this embodiment may be the frame-level attention matrix or the word-level attention matrix. Then, the process of determining the evaluation result according to the alignment information may include:
1) and determining the matching degree of the speech to be evaluated and the answer text according to the alignment information.
In particular, the foregoing has determined alignment information, which may be a frame-level attention matrix, or a word-level attention matrix. Based on the alignment information, the matching degree between the speech to be evaluated and the answer text can be determined.
In an alternative manner, the alignment information may be processed by a convolution unit of a neural network model, and the convolution unit is configured to receive and process the alignment information to generate an internal state representation of the matching degree between the speech to be evaluated and the answer text.
The matrix size of the alignment information input into the convolution unit of the neural network model may be fixed, and may be determined according to the length of the common answer text, for example, if the general answer text does not exceed 20 words at most, the matrix size may be 20 × 20. The elements for deficiency may be filled with 0.
2) And determining an evaluation result of the speech to be evaluated relative to the answer text according to the matching degree.
In an alternative manner, the matching degree may be processed by using a third fully-connected layer of the neural network model, and the third fully-connected layer is configured to receive and process the matching degree to generate an internal state representation of the evaluation result of the speech to be evaluated with respect to the answer text.
Wherein the third fully connected layer may be represented as:
y=Fx+g
wherein x is the matching degree, y is the regression evaluation result, which can be in a numerical form, F is the feature mapping matrix, and g is the bias.
The evaluation result can be a specific score obtained by regression, and the size of the score represents the quality degree of the speech to be evaluated, namely the conformity degree between the speech to be evaluated and the evaluation standard. In addition, the evaluation result may also be a probability that the speech to be evaluated belongs to a certain classification, where a plurality of classifications may be preset, and different classifications represent different degrees of conformity between the speech to be evaluated and the evaluation criteria, that is, how good or bad the speech to be evaluated, for example, the classification may be divided into three classifications, which are: excellent, good and poor.
It should be noted that the neural network models mentioned in the above embodiments may be the same neural network model, that is, different hierarchical structures of one neural network model are used to process respective data, for example, a plurality of hidden layers of the neural network model may be used to convert spectral features, a plurality of other hidden layers may be used to convert word vectors, a first fully-connected layer is used to generate a frame-level attention matrix, a second fully-connected layer is used to generate a word-level attention matrix, a convolution unit is used to generate a matching degree between the speech to be evaluated and the answer text, a third fully-connected layer is used to generate an evaluation result of the speech to be evaluated with respect to the answer text, and the like. Based on the method, the voice training data marked with the manual evaluation result and the answer text can be obtained in advance, the neural network model is trained, parameters of different levels in the neural network model are updated iteratively through a back propagation algorithm, and all the parameters are fixed after the training is finished.
The evaluation result is taken as an evaluation form for example to explain, when a neural network model is trained, a data pair mode can be taken as an objective function, each data pair construction mode requires certain difference of artificial evaluation, so that the model learns the difference between different evaluation scores, and the expression of the objective function is as follows:
Figure BDA0001820422060000141
wherein, yiAnd yi+1Model prediction scores, z, for the i and i +1 samples in the training dataiAnd zi+1The i and i +1 samples in the training data were manually scored.
The objective of the objective function is to minimize the difference between the model prediction score and the artificial score, and to make the difference between the model prediction scores of two adjacent samples closer to the difference between the artificial scores of the two samples, so that the model learns the difference between different scores.
Referring to fig. 2 and 3, there are illustrated schematic flow charts of speech evaluation of two neural network models with different structures.
In fig. 2, a word-level attention moment matrix is used as alignment information, and an evaluation result is determined based on the alignment information.
In fig. 3, a frame-level attention matrix is used as alignment information, and an evaluation result is determined based on the alignment information.
As shown in fig. 2, the dashed frame part is an internal processing flow of the neural network model, and as can be seen from fig. 2, the speech to be evaluated extracts acoustic features, and answer text extraction text characteristics which are used as input of the neural network model, respectively extract a deep acoustic characteristic matrix and a deep text characteristic matrix through an RNN hidden layer, and inputting the first full-connection layer, inputting a frame level attention matrix, the frame level attention matrix and a deep acoustic feature matrix of the first full-connection layer by dot multiplication to obtain a word level acoustic alignment matrix, outputting the word level acoustic alignment matrix and the deep text feature matrix as the input of a second full-connection layer, outputting the word level attention matrix and the word level attention matrix by a second full-connection layer, inputting the word level attention matrix into a CNN convolution unit to obtain a processed matching degree vector, inputting the processed matching degree vector to a third full-connection layer, and regressing an evaluation score by the third full-connection layer.
The neural network model can be trained through a back propagation algorithm, and parameters of each hierarchical structure are updated iteratively.
The dashed box portion in fig. 3 is an internal processing flow of the neural network model, and compared to fig. 2, the neural network model illustrated in fig. 3 lacks a second fully connected layer. In the corresponding processing flow, the frame-level attention moment array output by the first full-connection layer is directly used as the input of the CNN convolution unit, the CNN convolution unit outputs the matching degree vector based on the frame-level attention matrix, and the subsequent flows are consistent. In contrast to the flow of fig. 2, the process of obtaining the word-level attention matrix through the second fully-connected layer is omitted from fig. 3.
Similarly, the neural network model can be trained by a back propagation algorithm, and parameters of each hierarchical structure in the neural network model are updated iteratively.
It should be further noted that the neural network models mentioned in the above embodiments may also be a plurality of independent neural network models, and the plurality of independent neural network models cooperate with each other to complete the whole speech evaluation process. For example, the neural network model for converting the spectral feature to obtain the deep acoustic feature may be an independent model, for example, a speech recognition model is used as the independent neural network model, and the spectral feature is converted by using the hidden layer of the speech recognition model to obtain the converted hidden layer feature as the deep acoustic feature.
The following describes the speech evaluation device provided in the embodiment of the present application, and the speech evaluation device described below and the speech evaluation method described above may be referred to correspondingly.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a speech evaluation device disclosed in the embodiment of the present application. As shown in fig. 4, the apparatus may include:
the data acquisition unit 11 is used for acquiring a voice to be evaluated and answer texts serving as evaluation standards;
an alignment information determining unit 12, configured to determine alignment information between the speech to be evaluated and the answer text based on the acoustic feature of the speech to be evaluated and the text feature of the answer text;
and an evaluation result determining unit 13, configured to determine an evaluation result of the speech to be evaluated with respect to the answer text according to the alignment information.
Optionally, the apparatus of the present application may further include: and the acoustic feature acquisition unit is used for acquiring the acoustic features of the speech to be evaluated. Specifically, the acoustic feature acquisition unit may include:
the first acoustic feature obtaining subunit is configured to obtain a frequency spectrum feature of the speech to be evaluated, as an acoustic feature;
or the like, or, alternatively,
the second acoustic feature obtaining subunit is used for obtaining the frequency spectrum feature of the speech to be evaluated;
and the third acoustic feature acquisition subunit is used for acquiring hidden layer features of the hidden layer of the neural network model after the spectrum features are converted, and the hidden layer features are used as the acoustic features.
Optionally, the apparatus of the present application may further include: and the text characteristic acquisition unit is used for acquiring the text characteristics of the answer text. Specifically, the text feature acquisition unit may include:
the first text feature obtaining subunit is used for obtaining a vector of the answer text as a text feature;
or the like, or, alternatively,
the second text feature obtaining subunit is used for obtaining a vector of the answer text;
and the third text feature acquisition subunit is used for acquiring hidden layer features of the hidden layer of the neural network model after vector conversion as text features.
Optionally, the alignment information determining unit may include:
a frame-level attention matrix determination unit, configured to determine a frame-level attention matrix based on the acoustic features of the speech to be evaluated and the text features of the answer text, where the frame-level attention matrix includes: and for any text unit in the answer text, the alignment probability of each frame of speech in the speech to be evaluated to the text unit.
Optionally, the frame-level attention matrix determining unit may include:
a first fully-connected layer processing unit to process the acoustic features and the text features with a first fully-connected layer of a neural network model, the first fully-connected layer configured to receive and process the acoustic features and the text features to generate an internal state representation of a frame-level attention matrix.
Optionally, the alignment information determining unit may further include:
a word-level acoustic alignment matrix determination unit, configured to determine a word-level acoustic alignment matrix based on the frame-level attention matrix and the acoustic features, where the word-level acoustic alignment matrix includes: acoustic information aligned with each text unit in the answer text, wherein the acoustic information comprises a result of weighted summation of acoustic features of each frame of voice by taking the alignment probability of the text unit and each frame of voice as a weight;
a word-level attention moment array determining unit, configured to determine a word-level attention matrix based on the word-level acoustic alignment matrix and the text feature, where the word-level attention moment array includes: and for the text features of any text unit in the answer text, the acoustic information of each text unit in the answer text is aligned to the probability.
Optionally, the word-level attention moment array determining unit may include:
a second fully-connected layer processing unit to process the word-level acoustic alignment matrix and the text features with a second fully-connected layer of a neural network model, the second fully-connected layer configured to receive and process the word-level acoustic alignment matrix and the text features to generate an internal state representation of a word-level attention matrix.
Optionally, the evaluation result determining unit may include:
the matching degree determining unit is used for determining the matching degree of the speech to be evaluated and the answer text according to the alignment information;
and the matching degree application unit is used for determining the evaluation result of the speech to be evaluated relative to the answer text according to the matching degree.
Optionally, the matching degree determining unit may include:
and the convolution unit processing unit is used for processing the alignment information by utilizing a convolution unit of a neural network model, and the convolution unit is configured to receive and process the alignment information so as to generate an internal state representation of the matching degree of the speech to be evaluated and the answer text.
Optionally, the matching degree application unit may include:
and the third fully-connected layer processing unit is used for processing the matching degree by utilizing a third fully-connected layer of a neural network model, and the third fully-connected layer is configured to receive and process the matching degree so as to generate an internal state representation of the evaluation result of the speech to be evaluated relative to the answer text.
The voice evaluation device provided by the embodiment of the application can be applied to voice evaluation equipment, such as a PC terminal, a cloud platform, a server cluster and the like. Optionally, fig. 5 shows a block diagram of a hardware structure of the speech evaluation device, and referring to fig. 5, the hardware structure of the speech evaluation device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;
the processor 1 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;
the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;
wherein the memory stores a program and the processor can call the program stored in the memory, the program for:
acquiring a voice to be evaluated and answer texts serving as evaluation standards;
determining alignment information of the speech to be evaluated and the answer text based on the acoustic features of the speech to be evaluated and the text features of the answer text;
and determining an evaluation result of the speech to be evaluated relative to the answer text according to the alignment information.
Alternatively, the detailed function and the extended function of the program may be as described above.
Embodiments of the present application further provide a readable storage medium, where a program suitable for being executed by a processor may be stored, where the program is configured to:
acquiring a voice to be evaluated and answer texts serving as evaluation standards;
determining alignment information of the speech to be evaluated and the answer text based on the acoustic features of the speech to be evaluated and the text features of the answer text;
and determining an evaluation result of the speech to be evaluated relative to the answer text according to the alignment information.
Alternatively, the detailed function and the extended function of the program may be as described above.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (15)

1. A speech evaluation method, comprising:
acquiring a voice to be evaluated and an answer text serving as an evaluation standard, wherein the answer text is answer content corresponding to a question under an evaluation scene;
determining alignment information of the speech to be evaluated and the answer text based on the acoustic features of the speech to be evaluated and the text features of the answer text, wherein the alignment information reflects the alignment relationship between the speech to be evaluated and the answer text;
determining an evaluation result of the speech to be evaluated relative to the answer text according to the alignment information;
the determining the alignment information of the speech to be evaluated and the answer text based on the acoustic features of the speech to be evaluated and the text features of the answer text comprises the following steps:
determining a frame-level attention matrix based on the acoustic features of the speech to be evaluated and the text features of the answer text;
determining a word-level acoustic alignment matrix based on the frame-level attention matrix and the acoustic features;
determining a word-level attention matrix based on the word-level acoustic alignment matrix and the text features;
determining a word-level attention matrix based on the word-level acoustic alignment matrix and the text features, comprising:
processing the word-level acoustic alignment matrix and the text features with a second fully-connected layer of a neural network model configured to receive and process the word-level acoustic alignment matrix and the text features to generate an internal state representation of a word-level attention matrix.
2. The method according to claim 1, wherein the process of obtaining the acoustic features of the speech to be evaluated comprises:
acquiring the frequency spectrum characteristic of the speech to be evaluated as an acoustic characteristic;
or the like, or, alternatively,
acquiring the frequency spectrum characteristics of the speech to be evaluated;
and acquiring hidden layer characteristics of the hidden layer of the neural network model after the spectrum characteristics are converted, and taking the hidden layer characteristics as acoustic characteristics.
3. The method according to claim 1, wherein the obtaining of the text feature of the answer text comprises:
obtaining a vector of the answer text as a text feature;
or the like, or, alternatively,
obtaining a vector of the answer text;
and acquiring hidden layer characteristics of the hidden layer of the neural network model after vector conversion as text characteristics.
4. The method of claim 1, wherein the frame-level attention moment array comprises: and for any text unit in the answer text, the alignment probability of each frame of speech in the speech to be evaluated to the text unit.
5. The method according to claim 4, wherein the determining a frame-level attention matrix based on the acoustic features of the speech to be evaluated and the text features of the answer text comprises:
processing the acoustic features of the speech to be evaluated and the text features of the answer text with a first fully-connected layer of a neural network model, the first fully-connected layer configured to receive and process the acoustic features and the text features to generate an internal state representation of a frame-level attention matrix.
6. The method of claim 4, wherein the word-level acoustic alignment matrix comprises: acoustic information aligned with each text unit in the answer text, wherein the acoustic information comprises a result of weighted summation of acoustic features of each frame of voice by taking the alignment probability of the text unit and each frame of voice as a weight;
the word-level attention moment array comprises: and for the text features of any text unit in the answer text, the acoustic information of each text unit in the answer text is aligned to the probability.
7. The method according to any one of claims 1 to 6, wherein the determining an evaluation result of the speech to be evaluated with respect to the answer text according to the alignment information includes:
according to the alignment information, determining the matching degree of the speech to be evaluated and the answer text;
and determining an evaluation result of the speech to be evaluated relative to the answer text according to the matching degree.
8. The method according to claim 7, wherein the determining the matching degree of the speech to be evaluated and the answer text according to the alignment information comprises:
and processing the alignment information by utilizing a convolution unit of a neural network model, wherein the convolution unit is configured to receive and process the alignment information so as to generate an internal state representation of the matching degree of the speech to be evaluated and the answer text.
9. The method according to claim 7, wherein the determining the evaluation result of the speech to be evaluated relative to the answer text according to the matching degree comprises:
and processing the matching degree by utilizing a third fully-connected layer of a neural network model, wherein the third fully-connected layer is configured to receive and process the matching degree so as to generate an internal state representation of the evaluation result of the speech to be evaluated relative to the answer text.
10. A speech evaluation apparatus, comprising:
the data acquisition unit is used for acquiring a voice to be evaluated and an answer text serving as an evaluation standard, wherein the answer text is answer content corresponding to a question under an evaluation scene;
the alignment information determining unit is used for determining the alignment information of the speech to be evaluated and the answer text based on the acoustic characteristics of the speech to be evaluated and the text characteristics of the answer text, and the alignment information reflects the alignment relation between the speech to be evaluated and the answer text;
the evaluation result determining unit is used for determining the evaluation result of the speech to be evaluated relative to the answer text according to the alignment information;
the alignment information determination unit includes:
the frame level attention matrix determining unit is used for determining a frame level attention matrix based on the acoustic features of the speech to be evaluated and the text features of the answer text;
a word-level acoustic alignment matrix determination unit, configured to determine a word-level acoustic alignment matrix based on the frame-level attention matrix and the acoustic features;
the word level attention moment array determining unit is used for determining a word level attention matrix based on the word level acoustic alignment matrix and the text characteristics;
the word-level attention moment array determining unit includes:
a second fully-connected layer processing unit to process the word-level acoustic alignment matrix and the text features with a second fully-connected layer of a neural network model, the second fully-connected layer configured to receive and process the word-level acoustic alignment matrix and the text features to generate an internal state representation of a word-level attention matrix.
11. The apparatus of claim 10, wherein the frame-level attention moment array comprises: and for any text unit in the answer text, the alignment probability of each frame of speech in the speech to be evaluated to the text unit.
12. The apparatus of claim 11, wherein the word-level acoustic alignment matrix comprises: acoustic information aligned with each text unit in the answer text, wherein the acoustic information comprises a result of weighted summation of acoustic features of each frame of voice by taking the alignment probability of the text unit and each frame of voice as a weight;
the word-level attention moment array comprises: and for the text features of any text unit in the answer text, the acoustic information of each text unit in the answer text is aligned to the probability.
13. The apparatus according to any one of claims 10-12, wherein the evaluation result determination unit comprises:
the matching degree determining unit is used for determining the matching degree of the speech to be evaluated and the answer text according to the alignment information;
and the matching degree application unit is used for determining the evaluation result of the speech to be evaluated relative to the answer text according to the matching degree.
14. The voice evaluating device is characterized by comprising a memory and a processor;
the memory is used for storing programs;
the processor, configured to execute the program, implementing the steps of the speech evaluation method according to any of claims 1-9.
15. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the speech evaluation method according to any one of claims 1 to 9.
CN201811162964.0A 2018-09-30 2018-09-30 Voice evaluation method, device and equipment and readable storage medium Active CN109215632B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201811162964.0A CN109215632B (en) 2018-09-30 2018-09-30 Voice evaluation method, device and equipment and readable storage medium
JP2018223934A JP6902010B2 (en) 2018-09-30 2018-11-29 Audio evaluation methods, devices, equipment and readable storage media

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811162964.0A CN109215632B (en) 2018-09-30 2018-09-30 Voice evaluation method, device and equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN109215632A CN109215632A (en) 2019-01-15
CN109215632B true CN109215632B (en) 2021-10-08

Family

ID=64982845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811162964.0A Active CN109215632B (en) 2018-09-30 2018-09-30 Voice evaluation method, device and equipment and readable storage medium

Country Status (2)

Country Link
JP (1) JP6902010B2 (en)
CN (1) CN109215632B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100704542B1 (en) * 2006-09-12 2007-04-09 주식회사 보경이엔지건축사사무소 Auxiliary door of vestibule be more in air-conditioning and heating for apartment house
CN111027794B (en) * 2019-03-29 2023-09-26 广东小天才科技有限公司 Correction method and learning equipment for dictation operation
CN109979482B (en) * 2019-05-21 2021-12-07 科大讯飞股份有限公司 Audio evaluation method and device
CN110223689A (en) * 2019-06-10 2019-09-10 秒针信息技术有限公司 The determination method and device of the optimization ability of voice messaging, storage medium
CN110600006B (en) * 2019-10-29 2022-02-11 福建天晴数码有限公司 Speech recognition evaluation method and system
CN110782917B (en) * 2019-11-01 2022-07-12 广州美读信息技术有限公司 Poetry reciting style classification method and system
CN111128120B (en) * 2019-12-31 2022-05-10 思必驰科技股份有限公司 Text-to-speech method and device
CN113707178B (en) * 2020-05-22 2024-02-06 苏州声通信息科技有限公司 Audio evaluation method and device and non-transient storage medium
CN111652165B (en) * 2020-06-08 2022-05-17 北京世纪好未来教育科技有限公司 Mouth shape evaluating method, mouth shape evaluating equipment and computer storage medium
CN111862957A (en) * 2020-07-14 2020-10-30 杭州芯声智能科技有限公司 Single track voice keyword low-power consumption real-time detection method
CN112256841B (en) * 2020-11-26 2024-05-07 支付宝(杭州)信息技术有限公司 Text matching and countermeasure text recognition method, device and equipment
CN113379234A (en) * 2021-06-08 2021-09-10 北京猿力未来科技有限公司 Evaluation result generation method and device
CN113707148B (en) * 2021-08-05 2024-04-19 中移(杭州)信息技术有限公司 Method, device, equipment and medium for determining speech recognition accuracy
CN113506585A (en) * 2021-09-09 2021-10-15 深圳市一号互联科技有限公司 Quality evaluation method and system for voice call

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05333896A (en) * 1992-06-01 1993-12-17 Nec Corp Conversational sentence recognition system
US8231389B1 (en) * 2004-04-29 2012-07-31 Wireless Generation, Inc. Real-time observation assessment with phoneme segment capturing and scoring
JP2008052178A (en) * 2006-08-28 2008-03-06 Toyota Motor Corp Voice recognition device and voice recognition method
JP5834291B2 (en) * 2011-07-13 2015-12-16 ハイウエア株式会社 Voice recognition device, automatic response method, and automatic response program
CN104347071B (en) * 2013-08-02 2020-02-07 科大讯飞股份有限公司 Method and system for generating reference answers of spoken language test
JP6217304B2 (en) * 2013-10-17 2017-10-25 ヤマハ株式会社 Singing evaluation device and program
CN104361895B (en) * 2014-12-04 2018-12-18 上海流利说信息技术有限公司 Voice quality assessment equipment, method and system
CN104810017B (en) * 2015-04-08 2018-07-17 广东外语外贸大学 Oral evaluation method and system based on semantic analysis
JP6674706B2 (en) * 2016-09-14 2020-04-01 Kddi株式会社 Program, apparatus and method for automatically scoring from dictation speech of learner
CN108154735A (en) * 2016-12-06 2018-06-12 爱天教育科技(北京)有限公司 Oral English Practice assessment method and device
CN106847260B (en) * 2016-12-20 2020-02-21 山东山大鸥玛软件股份有限公司 Automatic English spoken language scoring method based on feature fusion
US20190362703A1 (en) * 2017-02-15 2019-11-28 Nippon Telegraph And Telephone Corporation Word vectorization model learning device, word vectorization device, speech synthesis device, method thereof, and program
CN110444199B (en) * 2017-05-27 2022-01-07 腾讯科技(深圳)有限公司 Voice keyword recognition method and device, terminal and server
CN107818795B (en) * 2017-11-15 2020-11-17 苏州驰声信息科技有限公司 Method and device for evaluating oral English
CN109192224B (en) * 2018-09-14 2021-08-17 科大讯飞股份有限公司 Voice evaluation method, device and equipment and readable storage medium

Also Published As

Publication number Publication date
CN109215632A (en) 2019-01-15
JP6902010B2 (en) 2021-07-14
JP2020056982A (en) 2020-04-09

Similar Documents

Publication Publication Date Title
CN109215632B (en) Voice evaluation method, device and equipment and readable storage medium
Venkataramanan et al. Emotion recognition from speech
US11450332B2 (en) Audio conversion learning device, audio conversion device, method, and program
CN108549658B (en) Deep learning video question-answering method and system based on attention mechanism on syntax analysis tree
US10157619B2 (en) Method and device for searching according to speech based on artificial intelligence
CN108766415B (en) Voice evaluation method
CN112818861B (en) Emotion classification method and system based on multi-mode context semantic features
US11409964B2 (en) Method, apparatus, device and storage medium for evaluating quality of answer
CN115329779B (en) Multi-person dialogue emotion recognition method
CN112686048B (en) Emotion recognition method and device based on fusion of voice, semantics and facial expressions
CN112017694B (en) Voice data evaluation method and device, storage medium and electronic device
CN109192224B (en) Voice evaluation method, device and equipment and readable storage medium
CN111694940A (en) User report generation method and terminal equipment
CN116011457A (en) Emotion intelligent recognition method based on data enhancement and cross-modal feature fusion
WO2022127042A1 (en) Examination cheating recognition method and apparatus based on speech recognition, and computer device
US20230070000A1 (en) Speech recognition method and apparatus, device, storage medium, and program product
CN115878832B (en) Ocean remote sensing image audio retrieval method based on fine pair Ji Panbie hash
CN110852071B (en) Knowledge point detection method, device, equipment and readable storage medium
CN114328817A (en) Text processing method and device
KR20210071713A (en) Speech Skill Feedback System
CN109727091B (en) Product recommendation method, device, medium and server based on conversation robot
CN113393841B (en) Training method, device, equipment and storage medium of voice recognition model
Shen et al. Optimized prediction of fluency of L2 English based on interpretable network using quantity of phonation and quality of pronunciation
WO1996013830A1 (en) Decision tree classifier designed using hidden markov models
CN115391534A (en) Text emotion reason identification method, system, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant