CN109215632B

CN109215632B - Voice evaluation method, device and equipment and readable storage medium

Info

Publication number: CN109215632B
Application number: CN201811162964.0A
Authority: CN
Inventors: 金海�; 吴奎; 胡阳; 朱群; 竺博; 魏思
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2018-09-30
Filing date: 2018-09-30
Publication date: 2021-10-08
Anticipated expiration: 2038-09-30
Also published as: CN109215632A; JP6902010B2; JP2020056982A

Abstract

The application discloses a voice evaluation method, a device, equipment and a readable storage medium, the application obtains a voice to be evaluated and an answer text serving as an evaluation standard, based on acoustic characteristics of the voice to be evaluated and text characteristics of the answer text, alignment information of the voice to be evaluated and the answer text can be determined, it can be understood that the alignment information shows alignment relation of the voice to be evaluated and the answer text, and then an evaluation result of the voice to be evaluated relative to the answer text can be automatically determined according to the alignment information. Because the evaluation is not carried out manually, the interference of subjective influence of people on the evaluation result is avoided, and the consumption of labor cost is reduced.

Description

Voice evaluation method, device and equipment and readable storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a speech evaluation method, apparatus, device, and readable storage medium.

Background

With the continuous deepening of education reform, spoken language examinations are developed all over the country. Spoken language examinations are capable of assessing the level of the examinee's spoken language relative to written examinations.

Most of the existing spoken language examinations evaluate answers of examinees through professional teachers according to correct answer information corresponding to questions. The manual evaluation mode is extremely easily influenced by human subjectivity, so that the evaluation result is artificially interfered, and a large amount of labor cost is consumed.

Disclosure of Invention

In view of the above, the present application provides a speech evaluation method, apparatus, device and readable storage medium, which are used to solve the disadvantages of the existing manual oral test evaluation method.

In order to achieve the above object, the following solutions are proposed:

a speech evaluation method comprises the following steps:

acquiring a voice to be evaluated and answer texts serving as evaluation standards;

determining alignment information of the speech to be evaluated and the answer text based on the acoustic features of the speech to be evaluated and the text features of the answer text;

and determining an evaluation result of the speech to be evaluated relative to the answer text according to the alignment information.

Preferably, the process of obtaining the acoustic feature of the speech to be evaluated includes:

acquiring the frequency spectrum characteristic of the speech to be evaluated as an acoustic characteristic;

or the like, or, alternatively,

acquiring the frequency spectrum characteristics of the speech to be evaluated;

and acquiring hidden layer characteristics of the hidden layer of the neural network model after the spectrum characteristics are converted, and taking the hidden layer characteristics as acoustic characteristics.

Preferably, the process of acquiring text features of the answer text includes:

obtaining a vector of the answer text as a text feature;

or the like, or, alternatively,

obtaining a vector of the answer text;

and acquiring hidden layer characteristics of the hidden layer of the neural network model after vector conversion as text characteristics.

Preferably, the determining the alignment information of the speech to be evaluated and the answer text based on the acoustic features of the speech to be evaluated and the text features of the answer text includes:

determining a frame-level attention matrix based on the acoustic features of the speech to be evaluated and the text features of the answer text, wherein the frame-level attention matrix comprises: and for any text unit in the answer text, the alignment probability of each frame of speech in the speech to be evaluated to the text unit.

Preferably, the determining a frame-level attention matrix based on the acoustic features of the speech to be evaluated and the text features of the answer text includes:

processing the acoustic features of the speech to be evaluated and the text features of the answer text with a first fully-connected layer of a neural network model, the first fully-connected layer configured to receive and process the acoustic features and the text features to generate an internal state representation of a frame-level attention matrix.

Preferably, the determining the alignment information between the speech to be evaluated and the answer text based on the acoustic features of the speech to be evaluated and the text features of the answer text further includes:

determining a word-level acoustic alignment matrix based on the frame-level attention matrix and the acoustic features, the word-level acoustic alignment matrix comprising: acoustic information aligned with each text unit in the answer text, wherein the acoustic information comprises a result of weighted summation of acoustic features of each frame of voice by taking the alignment probability of the text unit and each frame of voice as a weight;

determining a word-level attention matrix based on the word-level acoustic alignment matrix and the text features, wherein the word-level attention matrix comprises: and for the text features of any text unit in the answer text, the acoustic information of each text unit in the answer text is aligned to the probability.

Preferably, said determining a word-level attention matrix based on said word-level acoustic alignment matrix and said text features comprises:

processing the word-level acoustic alignment matrix and the text features with a second fully-connected layer of a neural network model configured to receive and process the word-level acoustic alignment matrix and the text features to generate an internal state representation of a word-level attention matrix.

Preferably, the determining, according to the alignment information, an evaluation result of the speech to be evaluated with respect to the answer text includes:

according to the alignment information, determining the matching degree of the speech to be evaluated and the answer text;

and determining an evaluation result of the speech to be evaluated relative to the answer text according to the matching degree.

Preferably, the determining the matching degree of the speech to be evaluated and the answer text according to the alignment information includes:

and processing the alignment information by utilizing a convolution unit of a neural network model, wherein the convolution unit is configured to receive and process the alignment information so as to generate an internal state representation of the matching degree of the speech to be evaluated and the answer text.

Preferably, the determining, according to the matching degree, an evaluation result of the speech to be evaluated with respect to the answer text includes:

and processing the matching degree by utilizing a third fully-connected layer of a neural network model, wherein the third fully-connected layer is configured to receive and process the matching degree so as to generate an internal state representation of the evaluation result of the speech to be evaluated relative to the answer text.

A speech evaluation apparatus comprising:

the data acquisition unit is used for acquiring the speech to be evaluated and answer text serving as an evaluation standard;

the alignment information determining unit is used for determining the alignment information of the speech to be evaluated and the answer text based on the acoustic characteristics of the speech to be evaluated and the text characteristics of the answer text;

and the evaluation result determining unit is used for determining the evaluation result of the speech to be evaluated relative to the answer text according to the alignment information.

Preferably, the acoustic feature acquisition unit is further included, and includes:

the first acoustic feature obtaining subunit is configured to obtain a frequency spectrum feature of the speech to be evaluated, as an acoustic feature;

or the like, or, alternatively,

the second acoustic feature obtaining subunit is used for obtaining the frequency spectrum feature of the speech to be evaluated;

and the third acoustic feature acquisition subunit is used for acquiring hidden layer features of the hidden layer of the neural network model after the spectrum features are converted, and the hidden layer features are used as the acoustic features.

Preferably, the method further comprises the following steps: a text feature acquisition unit comprising:

the first text feature obtaining subunit is used for obtaining a vector of the answer text as a text feature;

or the like, or, alternatively,

the second text feature obtaining subunit is used for obtaining a vector of the answer text;

and the third text feature acquisition subunit is used for acquiring hidden layer features of the hidden layer of the neural network model after vector conversion as text features.

Preferably, the alignment information determining unit includes:

a frame-level attention matrix determination unit, configured to determine a frame-level attention matrix based on the acoustic features of the speech to be evaluated and the text features of the answer text, where the frame-level attention matrix includes: and for any text unit in the answer text, the alignment probability of each frame of speech in the speech to be evaluated to the text unit.

Preferably, the frame-level attention matrix determining unit includes:

a first fully-connected layer processing unit to process the acoustic features and the text features with a first fully-connected layer of a neural network model, the first fully-connected layer configured to receive and process the acoustic features and the text features to generate an internal state representation of a frame-level attention matrix.

Preferably, the alignment information determining unit further includes:

a word-level acoustic alignment matrix determination unit, configured to determine a word-level acoustic alignment matrix based on the frame-level attention matrix and the acoustic features, where the word-level acoustic alignment matrix includes: acoustic information aligned with each text unit in the answer text, wherein the acoustic information comprises a result of weighted summation of acoustic features of each frame of voice by taking the alignment probability of the text unit and each frame of voice as a weight;

a word-level attention moment array determining unit, configured to determine a word-level attention matrix based on the word-level acoustic alignment matrix and the text feature, where the word-level attention moment array includes: and for the text features of any text unit in the answer text, the acoustic information of each text unit in the answer text is aligned to the probability.

Preferably, the word-level attention moment array determining unit includes:

a second fully-connected layer processing unit to process the word-level acoustic alignment matrix and the text features with a second fully-connected layer of a neural network model, the second fully-connected layer configured to receive and process the word-level acoustic alignment matrix and the text features to generate an internal state representation of a word-level attention matrix.

Preferably, the evaluation result determination unit includes:

the matching degree determining unit is used for determining the matching degree of the speech to be evaluated and the answer text according to the alignment information;

and the matching degree application unit is used for determining the evaluation result of the speech to be evaluated relative to the answer text according to the matching degree.

Preferably, the matching degree determination unit includes:

and the convolution unit processing unit is used for processing the alignment information by utilizing a convolution unit of a neural network model, and the convolution unit is configured to receive and process the alignment information so as to generate an internal state representation of the matching degree of the speech to be evaluated and the answer text.

Preferably, the matching degree applying unit includes:

and the third fully-connected layer processing unit is used for processing the matching degree by utilizing a third fully-connected layer of a neural network model, and the third fully-connected layer is configured to receive and process the matching degree so as to generate an internal state representation of the evaluation result of the speech to be evaluated relative to the answer text.

A speech evaluating apparatus includes a memory and a processor;

the memory is used for storing programs;

the processor is used for executing the program to realize the steps of the voice evaluation method.

A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech evaluation method as described above.

According to the technical scheme, the speech evaluation method provided by the embodiment of the application obtains the speech to be evaluated and the answer text serving as the evaluation standard, and can determine the alignment information of the speech to be evaluated and the answer text based on the acoustic features of the speech to be evaluated and the text features of the answer text. Because the evaluation is not carried out manually, the interference of subjective influence of people on the evaluation result is avoided, and the consumption of labor cost is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a speech evaluation method disclosed in the embodiments of the present application;

FIG. 2 is a schematic flow chart illustrating speech evaluation by a neural network model;

FIG. 3 illustrates a flow diagram of another neural network model for speech assessment;

FIG. 4 is a schematic structural diagram of a speech evaluation device disclosed in the embodiments of the present application;

fig. 5 is a block diagram of a hardware structure of a speech evaluation apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to solve the problems that the existing spoken language evaluation depends on manual work, so that the evaluation result is interfered by human beings and the labor cost is wasted, the inventor of the application firstly provides a solution, namely, a speech to be evaluated is identified by using a speech identification model to obtain an identification text, keywords are extracted from an answer text, the hit rate of the identification text to the keywords is further calculated, the evaluation result of the speech to be evaluated is determined according to the hit rate, and if the hit rate is higher, the evaluation score is higher.

However, further research shows that the above solution proposed by the inventor needs to recognize the speech to be evaluated as text, and the process uses a speech recognition model. If the universal speech recognition model is used for recognizing the speech to be evaluated in different test scenes, the problem of low recognition accuracy exists, and further the evaluation result is inaccurate. If the voice recognition models are respectively trained aiming at different examination scenes, manual work needs to be arranged in advance for scoring training data aiming at each examination, and a large amount of labor cost is consumed.

On the basis, the inventor of the present application further studies, and finally realizes automatic speech evaluation from the perspective of actively searching for the alignment information of the speech to be evaluated and the answer text. The voice evaluation method can be realized based on electronic equipment with data processing capacity, such as an intelligent terminal, a server, a cloud platform and the like.

The voice evaluation scheme can be suitable for the evaluation scene of the spoken language test and other scenes related to evaluation of the pronunciation level.

Next, the speech evaluation method of the present application is described with reference to fig. 1, and the method may include:

and S100, obtaining the voice to be evaluated and answer text serving as an evaluation standard.

Specifically, taking a spoken language test scenario as an example, the speech to be evaluated may be a spoken language answer recording given by an examinee. Correspondingly, in this embodiment, an answer text as an evaluation criterion may be preset. Taking the material reading spoken test questions as an example, the answer text as the evaluation criterion may be text information extracted from the reading material. In addition, for spoken language examinations of other types of questions, the answer text as the evaluation criterion may be the answer content corresponding to the question.

In this step, the obtaining mode of the speech to be evaluated may be receiving through a recording device, and the recording device may include a microphone, such as a head-mounted microphone.

Step S110, based on the acoustic characteristics of the speech to be evaluated and the text characteristics of the answer text, determining the alignment information of the speech to be evaluated and the answer text.

The acoustic characteristics of the speech to be evaluated reflect the acoustic information of the speech to be evaluated. The text features of the answer text reflect the text information of the answer text. The type of acoustic features may be varied, and similarly, the type of text features may be varied.

In this embodiment, based on the acoustic features and the text features, the alignment information of the speech to be evaluated and the answer text is actively searched, and the alignment information reflects the alignment relationship between the speech to be evaluated and the answer text. It can be understood that the integrity of the alignment between the speech to be evaluated and the answer text is high for the speech to be evaluated meeting the evaluation criterion, and the integrity of the alignment between the speech to be evaluated and the answer text is low for the speech to be evaluated not meeting the evaluation criterion.

And step S120, determining an evaluation result of the speech to be evaluated relative to the answer text according to the alignment information.

According to the above discussion, the alignment information reflects the alignment relationship between the speech to be evaluated and the answer text, and is related to whether the speech to be evaluated meets the evaluation standard or not and the degree of the speech to be evaluated meets the evaluation standard, so that in this step, the evaluation result of the speech to be evaluated relative to the answer text can be determined according to the alignment information.

The speech evaluation method provided by the embodiment of the application can automatically determine the evaluation result of the speech to be evaluated relative to the answer text according to the alignment information. Because the evaluation is not carried out manually, the interference of subjective influence of people on the evaluation result is avoided, and the consumption of labor cost is reduced.

Furthermore, the evaluation result is determined from the perspective of actively searching the alignment information of the speech to be evaluated and the answer text, the speech recognition model does not need to be used for performing speech recognition on the speech to be evaluated, the hit rate of the recognized text and the keyword of the answer text is calculated, the problem of inaccurate evaluation result caused by inaccurate speech recognition result is avoided, the speech evaluation result is more accurate, the scheme can be suitable for various speech evaluation scenes, the robustness is higher, additional manpower is not needed to be spent for scoring in different scenes to determine the training data, and the labor cost is saved.

In another embodiment of the present application, the process of obtaining the acoustic features of the speech to be evaluated and the text features of the answer text mentioned in step S110 is described.

Firstly, introducing an acquisition process of acoustic features of speech to be evaluated:

an optional mode can be that the spectral feature of the speech to be evaluated can be directly obtained and taken as the acoustic feature of the speech to be evaluated.

The spectral features may include mel-frequency Cepstrum Coefficient (MFCC) features or Perceptual Linear Prediction (PLP) features.

For convenience of description, the speech to be evaluated is defined to include T frames.

When obtaining the spectrum feature of the speech to be evaluated, the speech to be evaluated may be subjected to framing processing, and the framed speech to be evaluated may be subjected to pre-emphasis, so as to extract the spectrum feature of each frame of speech.

Another optional mode may be to obtain a spectral feature of the speech to be evaluated, and further, to obtain a hidden layer feature of the neural network model after converting the spectral feature, as an acoustic feature.

Here, the Neural Network model may take various structural forms, such as RNN (Recurrent Neural Network), LSTM (Long Short-term memory Network), GRU (Gated Recurrent Unit), and the like.

The spectrum features are converted through the hidden layer of the neural network model, deep mapping can be carried out on the spectrum features, the obtained hidden layer features are deeper than the spectrum feature levels, acoustic characteristics of the speech to be evaluated can be reflected better, and therefore the hidden layer features can be used as the acoustic features.

The acoustic features can be represented in the form of a matrix as follows:

wherein h is_t(T-1, 2, …, T) represents the acoustic features of the T-th frame of speech, and the dimension of the acoustic features of each frame remains unchanged, defined as the m-dimension.

Further, the process of obtaining the text features of the speech to be evaluated is introduced:

in an alternative mode, a vector of the answer text may be directly obtained and used as a text feature of the answer text.

The vector of the answer text may be a combination of word vectors of text units constituting the answer text, or a vector result obtained by subjecting the word vectors of the text units to a certain operation. For example, hidden layer features are extracted for word vectors of text units using a neural network model as vector results for the text units. The word vector of the text unit may be represented by a one-hot method or an embedding method, for example.

Further, the text unit of the answer text may be freely set, such as using a word-level, phoneme-level or root-level text unit.

For convenience of presentation, the answer text is defined to contain C text units.

Then, a word vector of each text unit in the answer text can be obtained, and finally, the text features of the answer text are determined according to the word vectors of the C text units.

Another optional mode may be to obtain a vector of the answer text, and further obtain hidden layer features of the hidden layer of the neural network model after converting the vector as text features.

As above, the Neural Network model may adopt various structural forms, such as RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory), GRU (Gated Recurrent Unit), and the like.

The hidden layer features obtained are deeper than the vector level of the answer text and can reflect the text characteristics of the answer text better, so that the hidden layer features can be used as text features.

Text features can be represented in the form of a matrix as follows:

wherein s is_i(i-1, 2, …, C) represents the text feature of the ith text unit, and the dimension of the text feature of each text unit remains unchanged, defined as n-dimension.

In another embodiment of the present application, a process of determining alignment information between the speech to be evaluated and the answer text based on the acoustic features of the speech to be evaluated and the text features of the answer text in step S110 is described.

In this embodiment, a frame-level attention matrix may be determined based on the acoustic features of the speech to be evaluated and the text features of the answer text, where the frame-level attention matrix includes: and for any text unit in the answer text, the alignment probability of each frame of voice in the voice to be detected to the text unit.

The determined frame-level attention moment matrix can be used as alignment information of the speech to be evaluated and the answer text. Next, the above alignment probability is described by the formula:

e_it＝a(h_t,s_i)＝w^T(Ws_i+Vh_t+b)

wherein e is_itAlignment information representing the text features of the ith text unit and the acoustic features of the t frame of speech; a is_itRepresenting the alignment probability of the ith text unit to the ith frame voice for the ith text unit; s_iThe text feature representing the ith text unit is an n-dimensional vector; h is_tThe acoustic feature representing the t frame voice is an m-dimensional vector; w, V, W, b are four parameters, where W may be a matrix in dimensions k x n, V may be a matrix in dimensions k x m, W may be a vector in dimensions k, these three parameters are used for feature mapping, b is a bias, and may be a vector in dimensions k.

The above frame level attention matrix can be expressed in the form:

in this embodiment, an optional implementation of determining a frame-level attention matrix through a neural network model based on an attention mechanism is provided, which may specifically include:

processing the acoustic features and the textual features with a first fully-connected layer of a neural network model, the first fully-connected layer configured to receive and process the acoustic features and the textual features to generate an internal state representation of a frame-level attention matrix.

Wherein the first fully-connected layer of the neural network model can be represented as e above_itAnd a_itIn the form of a formula (c). And the four parameters of W, V, W and b are taken as the parameters of the first full connection layer. The four parameters can be updated iteratively through the iterative training of the neural network model until the four parameters are fixed after the model training is finished.

The frame-level attention matrix determined in this embodiment as the alignment information includes the alignment probability of each frame of speech in the speech to be evaluated to any text unit in the answer text, that is, the frame-level alignment information of the speech to be evaluated is obtained, and the frame-level attention matrix is related to the conformity degree of the speech to be evaluated with respect to the evaluation standard, so that the evaluation result of the speech to be evaluated with respect to the answer text can be subsequently determined based on the frame-level attention matrix.

Further, in consideration of the difference of the speech rates of different users, the speech durations generated by different users may be different when the same answer text is expressed, thereby causing the number of frames included in the speech to be different. The frame level attention matrix determined according to the scheme as the alignment information has different frame level attention moment arrays due to different frame numbers, and further has different evaluation results determined based on the frame level attention matrix. In practical situations, since different users express the same answer text, the evaluation results should be the same. Based on this problem, the present embodiment provides another scheme for determining alignment information.

On the basis of obtaining the frame-level attention matrix based on the acoustic features of the speech to be evaluated and the text features of the answer text introduced in the above embodiment, the following processing steps are further added in the embodiment:

1. determining a word-level acoustic alignment matrix based on the frame-level attention matrix and the acoustic features, the word-level acoustic alignment matrix comprising: and acoustic information aligned with each text unit in the answer text, wherein the acoustic information comprises a result of weighted summation of the acoustic features of each frame of voice by taking the alignment probability of the text unit and each frame of voice as a weight.

Specifically, the acoustic information aligned with the ith text unit in the word-level acoustic alignment matrix is expressed as follows:

wherein, a_itAnd h_tFor meanings given above.

The above word-level acoustic alignment matrix may be represented as:

wherein, c_i(i ═ 1,2, …, C) represents acoustic alignment information for the ith text unit, C_iIs m-dimensional.

2. Determining a word-level attention matrix based on the word-level acoustic alignment matrix and the text features, wherein the word-level attention matrix comprises: and for the text features of any text unit in the answer text, the acoustic features of each text unit in the answer text are aligned to the probability.

The word-level attention moment matrix determined in the step can be used as alignment information of the speech to be evaluated and the answer text. Next, the word-level attention matrix is illustrated by the formula:

wherein, K_ijAcoustic representation of ith text unitAlignment information of the feature and the text feature of the jth text unit; i is_ijRepresenting the alignment probability of the acoustic information of the ith text unit to the text feature of the jth text unit;

is s is_jTranspose of (c)_iAcoustic alignment information representing an ith text unit; s_jAnd U is a parameter and is used for mapping the word-level acoustic alignment features to the same dimension of the text features to perform point multiplication operation.

The word level attention matrix may be represented in the form:

in this embodiment, an optional implementation of determining a word-level attention matrix through a neural network model based on an attention mechanism is provided, which may specifically include:

Wherein the second fully-connected layer of the neural network model can be represented by K_ijAnd I_ijIn the form of a formula (c). And U as a parameter of the second fully-connected layer. The parameter U can be updated iteratively through the iterative training of the neural network model until the parameter U is fixed after the model training is finished.

The word-level attention matrix determined in this embodiment as the alignment information includes the acoustic features of each text unit in the answer text, and the alignment probability of the text features of any text unit, that is, the word-level attention matrix is obtained, and the word-level attention matrix is related to the conformity degree of the speech to be evaluated with respect to the evaluation criterion, so that the evaluation result of the speech to be evaluated with respect to the answer text can be subsequently determined based on the word-level attention matrix.

Furthermore, since the word-level attention matrix is not related to the number of frames contained in the speech to be evaluated, that is, is not related to the user speech rate, and only the alignment relationship between the text features and the acoustic features is considered, the defect that when the users with different speech rates express the same answer text, the evaluation results are different can be solved, that is, the word-level attention matrix of the embodiment is used as the alignment information, and the evaluation accuracy is higher.

In another embodiment of the present application, a process of determining an evaluation result of the speech to be evaluated with respect to the answer text according to the alignment information in step S120 is introduced.

It is understood that the alignment information used in this embodiment may be the frame-level attention matrix or the word-level attention matrix. Then, the process of determining the evaluation result according to the alignment information may include:

1) and determining the matching degree of the speech to be evaluated and the answer text according to the alignment information.

In particular, the foregoing has determined alignment information, which may be a frame-level attention matrix, or a word-level attention matrix. Based on the alignment information, the matching degree between the speech to be evaluated and the answer text can be determined.

In an alternative manner, the alignment information may be processed by a convolution unit of a neural network model, and the convolution unit is configured to receive and process the alignment information to generate an internal state representation of the matching degree between the speech to be evaluated and the answer text.

The matrix size of the alignment information input into the convolution unit of the neural network model may be fixed, and may be determined according to the length of the common answer text, for example, if the general answer text does not exceed 20 words at most, the matrix size may be 20 × 20. The elements for deficiency may be filled with 0.

2) And determining an evaluation result of the speech to be evaluated relative to the answer text according to the matching degree.

In an alternative manner, the matching degree may be processed by using a third fully-connected layer of the neural network model, and the third fully-connected layer is configured to receive and process the matching degree to generate an internal state representation of the evaluation result of the speech to be evaluated with respect to the answer text.

Wherein the third fully connected layer may be represented as:

y＝Fx+g

wherein x is the matching degree, y is the regression evaluation result, which can be in a numerical form, F is the feature mapping matrix, and g is the bias.

The evaluation result can be a specific score obtained by regression, and the size of the score represents the quality degree of the speech to be evaluated, namely the conformity degree between the speech to be evaluated and the evaluation standard. In addition, the evaluation result may also be a probability that the speech to be evaluated belongs to a certain classification, where a plurality of classifications may be preset, and different classifications represent different degrees of conformity between the speech to be evaluated and the evaluation criteria, that is, how good or bad the speech to be evaluated, for example, the classification may be divided into three classifications, which are: excellent, good and poor.

It should be noted that the neural network models mentioned in the above embodiments may be the same neural network model, that is, different hierarchical structures of one neural network model are used to process respective data, for example, a plurality of hidden layers of the neural network model may be used to convert spectral features, a plurality of other hidden layers may be used to convert word vectors, a first fully-connected layer is used to generate a frame-level attention matrix, a second fully-connected layer is used to generate a word-level attention matrix, a convolution unit is used to generate a matching degree between the speech to be evaluated and the answer text, a third fully-connected layer is used to generate an evaluation result of the speech to be evaluated with respect to the answer text, and the like. Based on the method, the voice training data marked with the manual evaluation result and the answer text can be obtained in advance, the neural network model is trained, parameters of different levels in the neural network model are updated iteratively through a back propagation algorithm, and all the parameters are fixed after the training is finished.

The evaluation result is taken as an evaluation form for example to explain, when a neural network model is trained, a data pair mode can be taken as an objective function, each data pair construction mode requires certain difference of artificial evaluation, so that the model learns the difference between different evaluation scores, and the expression of the objective function is as follows:

wherein, y_iAnd y_i+1Model prediction scores, z, for the i and i +1 samples in the training data_iAnd z_i+1The i and i +1 samples in the training data were manually scored.

The objective of the objective function is to minimize the difference between the model prediction score and the artificial score, and to make the difference between the model prediction scores of two adjacent samples closer to the difference between the artificial scores of the two samples, so that the model learns the difference between different scores.

Referring to fig. 2 and 3, there are illustrated schematic flow charts of speech evaluation of two neural network models with different structures.

In fig. 2, a word-level attention moment matrix is used as alignment information, and an evaluation result is determined based on the alignment information.

In fig. 3, a frame-level attention matrix is used as alignment information, and an evaluation result is determined based on the alignment information.

As shown in fig. 2, the dashed frame part is an internal processing flow of the neural network model, and as can be seen from fig. 2, the speech to be evaluated extracts acoustic features, and answer text extraction text characteristics which are used as input of the neural network model, respectively extract a deep acoustic characteristic matrix and a deep text characteristic matrix through an RNN hidden layer, and inputting the first full-connection layer, inputting a frame level attention matrix, the frame level attention matrix and a deep acoustic feature matrix of the first full-connection layer by dot multiplication to obtain a word level acoustic alignment matrix, outputting the word level acoustic alignment matrix and the deep text feature matrix as the input of a second full-connection layer, outputting the word level attention matrix and the word level attention matrix by a second full-connection layer, inputting the word level attention matrix into a CNN convolution unit to obtain a processed matching degree vector, inputting the processed matching degree vector to a third full-connection layer, and regressing an evaluation score by the third full-connection layer.

The neural network model can be trained through a back propagation algorithm, and parameters of each hierarchical structure are updated iteratively.

The dashed box portion in fig. 3 is an internal processing flow of the neural network model, and compared to fig. 2, the neural network model illustrated in fig. 3 lacks a second fully connected layer. In the corresponding processing flow, the frame-level attention moment array output by the first full-connection layer is directly used as the input of the CNN convolution unit, the CNN convolution unit outputs the matching degree vector based on the frame-level attention matrix, and the subsequent flows are consistent. In contrast to the flow of fig. 2, the process of obtaining the word-level attention matrix through the second fully-connected layer is omitted from fig. 3.

Similarly, the neural network model can be trained by a back propagation algorithm, and parameters of each hierarchical structure in the neural network model are updated iteratively.

It should be further noted that the neural network models mentioned in the above embodiments may also be a plurality of independent neural network models, and the plurality of independent neural network models cooperate with each other to complete the whole speech evaluation process. For example, the neural network model for converting the spectral feature to obtain the deep acoustic feature may be an independent model, for example, a speech recognition model is used as the independent neural network model, and the spectral feature is converted by using the hidden layer of the speech recognition model to obtain the converted hidden layer feature as the deep acoustic feature.

The following describes the speech evaluation device provided in the embodiment of the present application, and the speech evaluation device described below and the speech evaluation method described above may be referred to correspondingly.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a speech evaluation device disclosed in the embodiment of the present application. As shown in fig. 4, the apparatus may include:

the data acquisition unit 11 is used for acquiring a voice to be evaluated and answer texts serving as evaluation standards;

an alignment information determining unit 12, configured to determine alignment information between the speech to be evaluated and the answer text based on the acoustic feature of the speech to be evaluated and the text feature of the answer text;

and an evaluation result determining unit 13, configured to determine an evaluation result of the speech to be evaluated with respect to the answer text according to the alignment information.

Optionally, the apparatus of the present application may further include: and the acoustic feature acquisition unit is used for acquiring the acoustic features of the speech to be evaluated. Specifically, the acoustic feature acquisition unit may include:

or the like, or, alternatively,

Optionally, the apparatus of the present application may further include: and the text characteristic acquisition unit is used for acquiring the text characteristics of the answer text. Specifically, the text feature acquisition unit may include:

or the like, or, alternatively,

Optionally, the alignment information determining unit may include:

Optionally, the frame-level attention matrix determining unit may include:

Optionally, the alignment information determining unit may further include:

Optionally, the word-level attention moment array determining unit may include:

Optionally, the evaluation result determining unit may include:

Optionally, the matching degree determining unit may include:

Optionally, the matching degree application unit may include:

The voice evaluation device provided by the embodiment of the application can be applied to voice evaluation equipment, such as a PC terminal, a cloud platform, a server cluster and the like. Optionally, fig. 5 shows a block diagram of a hardware structure of the speech evaluation device, and referring to fig. 5, the hardware structure of the speech evaluation device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a readable storage medium, where a program suitable for being executed by a processor may be stored, where the program is configured to:

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A speech evaluation method, comprising:

acquiring a voice to be evaluated and an answer text serving as an evaluation standard, wherein the answer text is answer content corresponding to a question under an evaluation scene;

determining alignment information of the speech to be evaluated and the answer text based on the acoustic features of the speech to be evaluated and the text features of the answer text, wherein the alignment information reflects the alignment relationship between the speech to be evaluated and the answer text;

determining an evaluation result of the speech to be evaluated relative to the answer text according to the alignment information;

the determining the alignment information of the speech to be evaluated and the answer text based on the acoustic features of the speech to be evaluated and the text features of the answer text comprises the following steps:

determining a frame-level attention matrix based on the acoustic features of the speech to be evaluated and the text features of the answer text;

determining a word-level acoustic alignment matrix based on the frame-level attention matrix and the acoustic features;

determining a word-level attention matrix based on the word-level acoustic alignment matrix and the text features;

determining a word-level attention matrix based on the word-level acoustic alignment matrix and the text features, comprising:

2. The method according to claim 1, wherein the process of obtaining the acoustic features of the speech to be evaluated comprises:

or the like, or, alternatively,

acquiring the frequency spectrum characteristics of the speech to be evaluated;

3. The method according to claim 1, wherein the obtaining of the text feature of the answer text comprises:

obtaining a vector of the answer text as a text feature;

or the like, or, alternatively,

obtaining a vector of the answer text;

4. The method of claim 1, wherein the frame-level attention moment array comprises: and for any text unit in the answer text, the alignment probability of each frame of speech in the speech to be evaluated to the text unit.

5. The method according to claim 4, wherein the determining a frame-level attention matrix based on the acoustic features of the speech to be evaluated and the text features of the answer text comprises:

6. The method of claim 4, wherein the word-level acoustic alignment matrix comprises: acoustic information aligned with each text unit in the answer text, wherein the acoustic information comprises a result of weighted summation of acoustic features of each frame of voice by taking the alignment probability of the text unit and each frame of voice as a weight;

the word-level attention moment array comprises: and for the text features of any text unit in the answer text, the acoustic information of each text unit in the answer text is aligned to the probability.

7. The method according to any one of claims 1 to 6, wherein the determining an evaluation result of the speech to be evaluated with respect to the answer text according to the alignment information includes:

8. The method according to claim 7, wherein the determining the matching degree of the speech to be evaluated and the answer text according to the alignment information comprises:

9. The method according to claim 7, wherein the determining the evaluation result of the speech to be evaluated relative to the answer text according to the matching degree comprises:

10. A speech evaluation apparatus, comprising:

the data acquisition unit is used for acquiring a voice to be evaluated and an answer text serving as an evaluation standard, wherein the answer text is answer content corresponding to a question under an evaluation scene;

the alignment information determining unit is used for determining the alignment information of the speech to be evaluated and the answer text based on the acoustic characteristics of the speech to be evaluated and the text characteristics of the answer text, and the alignment information reflects the alignment relation between the speech to be evaluated and the answer text;

the evaluation result determining unit is used for determining the evaluation result of the speech to be evaluated relative to the answer text according to the alignment information;

the alignment information determination unit includes:

the frame level attention matrix determining unit is used for determining a frame level attention matrix based on the acoustic features of the speech to be evaluated and the text features of the answer text;

a word-level acoustic alignment matrix determination unit, configured to determine a word-level acoustic alignment matrix based on the frame-level attention matrix and the acoustic features;

the word level attention moment array determining unit is used for determining a word level attention matrix based on the word level acoustic alignment matrix and the text characteristics;

the word-level attention moment array determining unit includes:

11. The apparatus of claim 10, wherein the frame-level attention moment array comprises: and for any text unit in the answer text, the alignment probability of each frame of speech in the speech to be evaluated to the text unit.

12. The apparatus of claim 11, wherein the word-level acoustic alignment matrix comprises: acoustic information aligned with each text unit in the answer text, wherein the acoustic information comprises a result of weighted summation of acoustic features of each frame of voice by taking the alignment probability of the text unit and each frame of voice as a weight;

13. The apparatus according to any one of claims 10-12, wherein the evaluation result determination unit comprises:

14. The voice evaluating device is characterized by comprising a memory and a processor;

the memory is used for storing programs;

the processor, configured to execute the program, implementing the steps of the speech evaluation method according to any of claims 1-9.

15. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the speech evaluation method according to any one of claims 1 to 9.