CN114743565A

CN114743565A - Voice evaluation method, device, equipment and storage medium

Info

Publication number: CN114743565A
Application number: CN202110023912.0A
Authority: CN
Inventors: 叶珑; 雷延强
Original assignee: Guangzhou Shikun Electronic Technology Co Ltd
Current assignee: Guangzhou Shikun Electronic Technology Co Ltd
Priority date: 2021-01-08
Filing date: 2021-01-08
Publication date: 2022-07-12

Abstract

The invention discloses a voice evaluation method, a device, equipment and a storage medium, wherein the voice evaluation method comprises the following steps: determining the pronunciation condition of the phoneme to be tested, wherein the pronunciation condition comprises pronunciation correctness and pronunciation error; and aiming at each pronunciation condition, determining a GOP value corresponding to the phoneme to be detected by adopting a pronunciation goodness GOP calculation scheme corresponding to the pronunciation condition. The technical scheme of the embodiment distinguishes different pronunciation conditions, and different GOP calculation schemes are adopted for different pronunciation conditions, so that GOP values are not affected when pronunciation is correct; and accurate GOP scores can be printed when pronunciation is wrong, the influence of wrong pronunciation on a voice evaluation result is reduced, and the accuracy of the voice evaluation result is improved.

Description

Voice evaluation method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to a data processing technology, in particular to a voice evaluation method, a voice evaluation device, voice evaluation equipment and a storage medium.

Background

The pronunciation quality evaluation technology is a subdivision direction of Computer-assisted language learning (CALL), and the pronunciation quality evaluation technology requires to efficiently and accurately indicate pronunciation errors of learners, give objective evaluation of phoneme levels and help learners to correct the pronunciation errors.

The Goodness Of Pronunciation (GOP) is a feature commonly used in Pronunciation evaluation, and the Goodness Of Pronunciation results a Pronunciation evaluation result at a phoneme level. The existing evaluation model usually adopts triphones as modeling units, and when a certain phoneme in a sentence has pronunciation errors, the triphone structure enables the errors of the current phoneme to influence GOP values of two phonemes before and after the current phoneme, so that a voice evaluation result is reduced.

Disclosure of Invention

The embodiment of the invention provides a voice evaluation method, a voice evaluation device, voice evaluation equipment and a storage medium, which can reduce the influence of wrong pronunciation on a voice evaluation result and improve the accuracy of the voice evaluation result.

In a first aspect, an embodiment of the present invention provides a speech evaluation method, including:

determining pronunciation conditions of the phonemes to be tested, wherein the pronunciation conditions comprise pronunciation correctness and pronunciation errors;

and aiming at each pronunciation condition, determining a GOP value corresponding to the phoneme to be detected by adopting a pronunciation goodness GOP calculation scheme corresponding to the pronunciation condition.

In one embodiment, determining the pronunciation condition of the phoneme to be tested comprises:

judging whether the phoneme corresponding to the maximum posterior probability of the state of the acoustic model is the same as the phoneme to be detected or not frame by frame within the phoneme duration;

if the phoneme corresponding to the maximum posterior probability of the state of the acoustic model is the same as the phoneme to be tested, determining that the phoneme to be tested is correct in pronunciation;

and if the phoneme corresponding to the maximum posterior probability of the state of the acoustic model is different from the phoneme to be detected, determining that the phoneme to be detected is in pronunciation error.

In one embodiment, determining the pronunciation condition of the phoneme to be tested includes:

determining the posterior probability of the accumulated state corresponding to each phoneme state within the phoneme duration;

determining a first phoneme corresponding to the maximum accumulated state posterior probability in all phoneme states;

if the first phoneme is the same as the phoneme to be detected, determining that the phoneme to be detected is a pronunciation correct phoneme;

and if the first phoneme is different from the phoneme to be detected, determining that the phoneme in the audio to be detected is a pronunciation error phoneme.

In one embodiment, for each pronunciation case, determining a GOP value corresponding to the phoneme by using a pronunciation goodness GOP calculation scheme corresponding to the pronunciation case includes:

when the pronunciation condition is correct, determining a GOP value corresponding to the phoneme to be detected by taking the sum of the state posterior probabilities of all triphones with the phoneme to be detected as the center;

and when the pronunciation condition is pronunciation error, determining the GOP value corresponding to the phoneme to be detected according to the posterior probability value of the state of the single triphone taking the phoneme to be detected as the center.

when the pronunciation condition is correct, calculating a GOP value corresponding to the phoneme to be detected through a first GOP formula;

wherein the first GOP formula is:

wherein, t_eIs the start time, t, of the phoneme to be measured_sIs the end time of the phoneme to be measured, p is the phoneme to be measured, s_pIs a triphone state belonging to a phoneme to be tested as p, S is a triphone state set, S is a triphone state belonging to S, o_tFeatures of speech for the t-th frame, P(s)_p|o_t)、P(s|o_t) Is the posterior probability of the acoustic model output.

when the pronunciation condition is pronunciation error, calculating a GOP value corresponding to the phoneme to be detected through a second GOP formula;

wherein the second GOP formula is:

wherein, t_eIs the start time, t, of the phoneme to be measured_sIs the end time of the phoneme to be tested, p is the phoneme to be tested, S is the set of triphone states, S is a triphone state belonging to S, o_tIs the speech feature of the t-th frame, P (P | o)_t)、P(s|o_t) Is the posterior probability of the acoustic model output.

In one embodiment, after determining the GOP value corresponding to the phoneme to be tested by using the pronunciation goodness GOP calculation scheme corresponding to the pronunciation situation, the method further includes:

when the GOP value corresponding to the phoneme to be detected is larger than or equal to a preset GOP threshold value, determining that the pronunciation of the phoneme to be detected is qualified;

and when the GOP value corresponding to the phoneme to be detected is smaller than a preset GOP threshold value, determining that the pronunciation of the phoneme to be detected is unqualified.

In a second aspect, an embodiment of the present invention further provides a speech evaluation device, including:

the pronunciation condition determining module is used for determining the pronunciation condition of the phoneme to be tested, wherein the pronunciation condition comprises pronunciation correctness and pronunciation error;

and the GOP determining module is used for determining the GOP value corresponding to the phoneme to be detected by adopting a pronunciation goodness GOP calculation scheme corresponding to the pronunciation condition aiming at each pronunciation condition.

In a third aspect, an embodiment of the present invention further provides a voice evaluation device, including:

one or more processors;

a memory for storing one or more programs;

the one or more programs are executable by the one or more processors to cause the one or more processors to implement a speech assessment method as provided in the first aspect above.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium on which one or more computer programs are stored, which when executed by a processor, implement the speech assessment method as provided in the first aspect above.

In the voice evaluation method, apparatus, device, and storage medium provided in the foregoing embodiments, the voice evaluation includes: determining the pronunciation condition of the phoneme to be tested, wherein the pronunciation condition comprises pronunciation correctness and pronunciation error; and aiming at each pronunciation condition, determining a GOP value corresponding to the phoneme to be detected by adopting a pronunciation goodness GOP calculation scheme corresponding to the pronunciation condition. The technical scheme of the embodiment distinguishes different pronunciation conditions, and different GOP calculation schemes are adopted for different pronunciation conditions, so that GOP values are not affected when pronunciation is correct; and accurate GOP scores can be printed when pronunciation is wrong, the influence of wrong pronunciation on a voice evaluation result is reduced, and the accuracy of the voice evaluation result is improved.

Drawings

FIG. 1 is a schematic diagram of the triphone of the word park;

FIG. 2a is a diagram illustrating an application scenario provided by an embodiment of the present invention;

FIG. 2b is a diagram illustrating another exemplary application scenario provided by an embodiment of the present invention;

FIG. 3 is a flow chart of a voice assessment method provided by an embodiment of the invention;

FIG. 4 is a diagram illustrating a phoneme corresponding to a maximum posterior probability of a frame state and a phoneme to be tested;

fig. 5 is a schematic structural diagram of a voice evaluation device according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a hardware structure of an apparatus according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

The existing speech evaluation technology usually adopts triphones as modeling units, and when a phoneme in a sentence has pronunciation errors, the triphone structure causes the errors of the current phoneme to influence GOP values of two phonemes before and after the current phoneme. For example, as illustrated in FIG. 1, the corresponding phoneme p aa r k of the word park, the corresponding triphone sequence is sil-p + aa, p-aa + r, aa-r + k, r-k + sil, where sil represents a mute phoneme, if aa is generated into er, the posterior probability of triphone p-aa + r becomes lower, the posterior probability of sil-p + aa and aa-r + k also becomes lower, and the posterior probability of sil-p + er and er-r + k becomes higher, and the forced alignment result is affected.

Therefore, based on the above problems, embodiments of the present invention provide a speech evaluation method, apparatus, device, and storage medium, which adopt different GOP calculation schemes for a phoneme with correct pronunciation and a phoneme with wrong pronunciation, so as to ensure that the GOP value is not affected when the pronunciation is correct; and accurate GOP scores can be printed when pronunciation is wrong, the influence of wrong pronunciation on a voice evaluation result is reduced, and the accuracy of the voice evaluation result is improved.

The scheme can be used for pronunciation error detection and diagnosis directions in the field of voice evaluation, such as an online or offline voice evaluation system, provides language learner pronunciation error detection, and can be based on the accuracy of the pronunciation evaluation result of the learner. For example, a user who has Chinese as the native language, learns foreign language, and so on.

Fig. 2a is a diagram illustrating an application scenario provided in an embodiment of the present invention. As shown in fig. 2a, the server 102 is configured to execute the speech assessment method according to any method embodiment of the present application, the client 101 receives a speech uttered by a user through an input device, the server 102 interacts with the client 101 to obtain the speech, and after the server 102 executes the speech assessment method, the server outputs a pronunciation assessment result to the client 101, and an output device of the client 101 notifies a learner. Further, the client 101 provides the learner with the correct pronunciation to help him correct the pronunciation.

The input device may be an input device built in a computer device, such as: built-in voice input devices, etc.; or an external input device connected to the computer device through a communication line, such as: a microphone, etc. Further, the output device may be an output device built in the computer device, such as: touch display screens and the like; or an external output device connected to the computer device through a communication line, such as: a projector, a digital TV, etc.

In this embodiment, the client 101 is described by taking a computer as an example, and the computer device may specifically be a computer device including a processor, a memory, an input device, and an output device. Such as notebook computers, desktop computers, tablet computers, intelligent terminals, learning machines, early education machines, intelligent wearable computers, etc.

Alternatively, when the client 101 has certain data processing capability, that is, when the client 101 has a processor and a memory, the client 101 may be used alone as an execution subject of the pronunciation assessment method according to any embodiment of the present application, as illustrated in fig. 2 b. In fig. 2b, the learner presses the microphone, the built-in voice collecting device of the mobile phone can collect the voice sent by the user in real time, and after the voice evaluation method is executed, the evaluation result of the voice to be evaluated is displayed to the learner through the display screen.

The pronunciation assessment method provided by the present invention will be explained below with reference to specific examples.

Fig. 3 is a flowchart of a speech evaluation method according to an embodiment of the present invention, the method is suitable for detecting whether a learner pronounces correctly, the speech evaluation method may be performed by a speech evaluation device, and the speech evaluation device may be implemented by hardware and/or software. The voice evaluation device can be formed by two or more physical entities or can be formed by one physical entity and is generally integrated in computer equipment.

As shown in fig. 3, the voice evaluation method provided by the embodiment of the present invention mainly includes the following steps:

and S21, determining the pronunciation condition of the phoneme to be tested, wherein the pronunciation condition comprises pronunciation correctness and pronunciation error.

The phoneme is the minimum voice unit divided according to the natural attributes of the voice, and is analyzed according to the pronunciation action in the syllable, and one action forms one phoneme. Generally speaking, phonemes are more detailed units of a word (Chinese: word), for example, a phoneme of one is three phonemes w ah n, a phoneme of one is two phonemes yh i _1 (each phoneme can be used as a pronunciation unit separately), which can be simply understood as a phoneme of Chinese is each initial consonant of the pronunciation of the word, and a pronunciation of an English word can be understood as a phonetic symbol of the word; in speech recognition, a phoneme is trained from a large amount of speech data, the simplest understood method: find many people w (whining), then extract the features of the audios and train the phonemes of w. A phoneme is made up of a number of states. For example, a phoneme is composed of three states, to each of which at least one frame duration is assigned. The phoneme is read for a duration greater than three frames. The phoneme to be detected can be understood as one or more phonemes divided by the speech to be detected.

The voice to be tested can be understood as the voice of the learner, which is collected by the client and is sent by the learner aiming at the pronunciation text, and further, the voice can be the voice sent by the learner and collected directly through a microphone, and the voice of the learner, which is recorded in advance, can be obtained, and the learner refers to a person who needs to train and train the pronunciation of the learner according to the pronunciation text.

In practical applications, when a learner reads a pronunciation text, a voice corresponding to the pronunciation text is generated. The electronic equipment firstly acquires the voice and determines whether the pronunciation of the learner is correct or not by evaluating the voice signal. Illustratively, the pronunciation text may be embodied as at least one word or even at least one phoneme.

Further, taking a learning machine as an example for explanation, when a learner reads a pronunciation text on a display interface of the learning machine, the learning machine acquires a voice signal through a microphone and other sound pickup devices to acquire a voice to be tested, and divides the voice to be tested into one or more phonemes through a phoneme division mode, so as to evaluate and score each phoneme.

In this embodiment, a method for determining a phoneme to be tested is provided. And decomposing the voice to be detected based on the pronunciation text and the voice to be detected to obtain phonemes and boundary information contained in the voice to be detected, and forming a phoneme state sequence by states corresponding to the phonemes. That is, the phoneme state sequence includes a phoneme to be tested corresponding to the speech to be tested.

In one embodiment, the pronounced text and the speech to be tested are aligned. Furthermore, an acoustic model trained in advance is used for calculating the acoustic score of each frame in the voice to be tested, and then a Viterbi algorithm is used for searching an optimal path in the alignment network, so that the phoneme state sequence and the boundary information of the voice to be tested are obtained.

The viterbi algorithm is a dynamic programming algorithm widely applied in machine learning, and is used for searching a viterbi path-hidden state sequence which is most likely to generate an observation event sequence, particularly in a markov information source context and a hidden markov model. The term "viterbi algorithm" is also used to find the dynamic programming algorithm for which observations are most likely to explain the correlation. In the embodiment of the invention, the optimal path is searched in the alignment network by utilizing the Viterbi algorithm to obtain the phoneme state sequence.

The acoustic Model can be constructed by Deep Neural Networks (DNN) Hidden Markov Models (HMM), namely the acoustic Model is a DNN HMM acoustic Model. Inputting the speech signal to be detected into a DNN-HMM acoustic model frame by frame, outputting state posterior probability corresponding to the frame by frame, converting the state posterior probability into acoustic scores, and searching an optimal path by using a Viterbi algorithm to obtain a first phoneme state sequence and boundary information. The purpose of the viterbi algorithm to search the path is to search for an optimal path matching the speech feature sequence in the WFST aligned network.

Further, after the phoneme to be tested is determined according to the method, the pronunciation condition of the phoneme to be tested is determined, that is, whether the phoneme to be tested is correct in pronunciation or wrong in pronunciation is determined.

In one embodiment, within the duration of any phoneme to be tested, whether the phoneme corresponding to the maximum posterior probability of the state of the acoustic model is the same as the phoneme to be tested is judged frame by frame. If the phoneme to be detected is the same as the phoneme to be detected, determining that the pronunciation of the phoneme to be detected is correct, and if the phoneme to be detected is not the same as the phoneme to be detected, determining that the pronunciation of the phoneme to be detected is wrong.

In another embodiment, for each phoneme state, accumulating the state output posterior probabilities within the duration of the phoneme to be tested, finding the phoneme corresponding to the maximum state output probability, determining that the phoneme to be tested is correctly pronounced if the phoneme to be tested is the same as the phoneme to be tested, and determining that the phoneme to be tested is incorrectly pronounced if the phoneme to be tested is not the same as the phoneme to be tested.

It should be noted that, the manner of determining the pronunciation condition of the phoneme to be detected and the distinguishing standard are various, and the manner is only exemplary and not limited.

And S22, determining a GOP value corresponding to the phoneme to be tested by adopting a pronunciation goodness GOP calculation scheme corresponding to the pronunciation situation according to each pronunciation situation.

The GOP is a feature commonly used in pronunciation evaluation, and the pronunciation goodness is a pronunciation evaluation result at a phoneme level.

In this embodiment, when the phoneme to be detected pronounces correctly, the GOP value corresponding to the phoneme to be detected is determined in a first manner, and when the phoneme to be detected pronounces incorrectly, the GOP value corresponding to the phoneme to be detected is determined in a second manner. Therefore, the problem that in the prior art, the pronunciation evaluation accuracy is not high because the same GOP calculation mode is adopted no matter whether the pronunciation of the phoneme to be tested is correct or not is solved.

The voice evaluation method provided by the embodiment comprises the following steps: determining the pronunciation condition of the phoneme to be tested, wherein the pronunciation condition comprises pronunciation correctness and pronunciation error; and aiming at each pronunciation condition, determining a GOP value corresponding to the phoneme to be detected by adopting a pronunciation goodness GOP calculation scheme corresponding to the pronunciation condition. The technical scheme of the embodiment distinguishes different pronunciation conditions, and different GOP calculation schemes are adopted for different pronunciation conditions, so that GOP values are not affected when pronunciation is correct; and accurate GOP scores can be printed when pronunciation is wrong, the influence of wrong pronunciation on a voice evaluation result is reduced, and the accuracy of the voice evaluation result is improved.

Further, after determining the GOP value corresponding to the phoneme to be tested by using the pronunciation goodness GOP calculation scheme corresponding to the pronunciation condition, the method further includes: when the GOP value corresponding to the phoneme to be detected is larger than or equal to a preset GOP threshold value, determining that the pronunciation of the phoneme to be detected is qualified; and when the GOP value corresponding to the phoneme to be detected is smaller than a preset GOP threshold value, determining that the phoneme to be detected is unqualified in pronunciation.

Specifically, the preset GOP threshold may be set according to actual conditions, and in this embodiment, the preset threshold is not specifically limited.

In one embodiment, when the GOP value corresponding to the phoneme to be tested is greater than or equal to a first threshold, determining that the evaluation result of the phoneme to be tested is excellent; when the GOP value corresponding to the phoneme to be tested is smaller than a first threshold value and larger than or equal to a second threshold value, determining that the evaluation result of the phoneme to be tested is good; when the GOP value corresponding to the phoneme to be tested is smaller than a second threshold value and is larger than or equal to a third threshold value, determining that the evaluation result of the phoneme to be tested is middle; and when the GOP value corresponding to the phoneme to be tested is smaller than a third threshold value, determining that the evaluation result of the phoneme speech to be tested is poor.

The first threshold is greater than the second threshold and the second threshold is greater than the third threshold, and the first threshold, the second threshold and the third threshold may be set according to an actual situation.

In the above embodiment, the 1 threshold and the 3 thresholds are merely used as examples, and the description is made briefly. In a specific application process, different thresholds may be set according to different situations, and the GOP value may be divided into multiple levels, for example: star level classification, and the specific classification method is not described in this embodiment.

In one embodiment, the GOP value can be directly converted into a score or a percentage value, that is, the evaluation result is displayed in a score and percentage manner, so that the evaluation result is divided more finely, and a learner can clearly know the pronunciation condition of the learner.

Specifically, when the GOP value is the highest, the score is 100, and when the GOP value is the lowest, the score is 0; and then dividing the GOP values between 0 and 100 into 99 parts in sequence, respectively corresponding 1 to 99 parts, and finally storing the corresponding relation between the GOP values and the scores.

Further, after determining the GOP value of the phoneme to be tested, inquiring in the corresponding relation between the GOP value and the score, and determining the evaluation result of the phoneme to be tested according to the score corresponding to the GOP value of the phoneme to be tested. The specific query manner is not described in detail in this embodiment.

It should be noted that after the evaluation result of the phoneme to be tested is determined, the evaluation result needs to be sent to an output device for displaying, so that the learner can correctly know the pronunciation of the learner. The output device may be a display device built in a computer device, such as: touch display screens and the like; or an external playing device connected with the computer device through a communication line, such as: projectors, digital TVs, etc., and may also be audio playback devices built into computer equipment, such as: built-in sound boxes and the like; the external playing device can also be connected with the computer device through a communication line, such as: earphones, external sound boxes and the like.

Furthermore, the evaluation result is unqualified, or the template voice is played through a playing device or playing equipment after a template voice playing instruction input by the learner is received, so that correct pronunciation is provided for the learner to perform simulation training.

On the basis of the foregoing embodiments, a method for determining a pronunciation condition of a phoneme to be tested is provided, where determining the pronunciation condition of the phoneme to be tested includes: judging whether the phoneme corresponding to the maximum posterior probability of the state of the acoustic model is the same as the phoneme to be detected or not frame by frame within the phoneme duration; if the phoneme corresponding to the maximum posterior probability of the state of the acoustic model is the same as the phoneme to be tested, determining that the pronunciation of the phoneme to be tested is correct; and if the phoneme corresponding to the maximum posterior probability of the state of the acoustic model is different from the phoneme to be detected, determining that the phoneme to be detected is in pronunciation error.

In the present embodiment, the phoneme duration may be understood as a time corresponding to which one phoneme is read, and in general, the time corresponding to which one phoneme is read is longer than a duration of three frames. The acoustic model can be constructed by a DNN-HMM, namely the acoustic model is a DNN-HMM acoustic model. And inputting the speech signal to be detected into the DNN-HMM acoustic model frame by frame, and outputting the state posterior probability corresponding to the frame by frame.

Among them, the posterior probability is one of the basic concepts of information theory. The probability that a message is sent after it is received, as known by the receiver, is called a posteriori probability. The posterior probability is calculated based on the prior probability. The posterior probability can be calculated by using the prior probability and the likelihood function according to the bayesian formula, and the calculation mode of the posterior probability is not limited in this embodiment.

Further, in bayesian statistics, the "maximum a posteriori probability" is the maximum of the posterior probability distribution. Point estimates of quantities not directly observable in the experimental data can be obtained using the maximum a posteriori probability. It has close relation with the classical method in the maximum likelihood estimation, but it uses an extended optimization target, further considering the prior probability distribution of the estimated quantity. Therefore, the maximum a posteriori probability estimate can be seen as a regularization (regularization) maximum likelihood estimate.

One phoneme generally consists of three states, and English park is taken as an example for explanation, and the corresponding phoneme p aa r k of the word park is a sequence sil-p + aa, p-aa + r, aa-r + k, r-k + sil. And determining the phoneme corresponding to the maximum posterior probability of the state in the triphone sequence p-aa + r through an acoustic model. And if the phonemes corresponding to the maximum posterior probability frame by frame within the continuous frame length of the phonemes are not the phonemes aa and the phoneme to be detected is the aa, determining that the phoneme to be detected is in pronunciation error. And if the phoneme aa exists in the phoneme corresponding to the maximum posterior probability of the state in the continuous frame length of the phoneme and the phoneme to be detected is aa, determining that the pronunciation of the phoneme to be detected is correct.

On the basis of the foregoing embodiment, a method for determining a pronunciation condition of a phoneme to be tested is provided, where determining the pronunciation condition of the phoneme to be tested includes: determining the state posterior probability corresponding to each phoneme state within the phoneme duration; determining a first phoneme corresponding to the maximum state posterior probability in all phoneme states; if the first phoneme is the same as the phoneme to be detected, determining that the phoneme to be detected is a pronunciation correct phoneme; and if the first phoneme is different from the phoneme to be detected, determining that the phoneme in the audio to be detected is a pronunciation error phoneme.

The accumulated state posterior probability can be understood as the sum of state posterior probabilities corresponding to a certain phoneme in the triphone sequence divided on the basis of a word. For example: taking English park as an example, the corresponding phoneme p aa r k of the word park, the corresponding triphone sequence is sil-p + aa, p-aa + r, aa-r + k, r-k + sil, and taking p-aa + r as an example, the state posterior probabilities of states A-aa + B0, A-aa + B1, and A-aa + B2 of all triphones A-aa + B with the phoneme aa as the center phoneme are calculated, and the sum of all state posterior probabilities is taken as the accumulated state posterior probability of the phoneme aa.

The posterior probability of each phoneme state may be determined by the acoustic model or calculated by a bayesian formula, and the specific calculation method is not described in this embodiment.

Further, after the accumulated state posterior probability corresponding to each phoneme state is determined, determining the phoneme with the maximum state posterior probability in the triphones as a first phoneme, and if the first phoneme is the same as the phoneme to be detected, determining that the phoneme to be detected is a phoneme with correct pronunciation; and if the first phoneme is not the same as the phoneme to be detected, determining that the phoneme in the audio to be detected is the pronunciation error phoneme.

In one embodiment, a method for determining GOP values is provided, wherein for each pronunciation case, a GOP calculation scheme corresponding to the pronunciation case is used to determine GOP values corresponding to phonemes, and the method comprises the following steps: when the pronunciation condition is correct, determining a GOP value corresponding to the phoneme to be detected by taking the sum of the state posterior probabilities of all triphones with the phoneme to be detected as the center; and when the pronunciation condition is pronunciation error, determining the GOP value corresponding to the phoneme to be detected according to the posterior probability value of the state of the single triphone taking the phoneme to be detected as the center.

In this embodiment, when the phoneme to be tested pronounces correctly, the sum of the state posterior probabilities of all triphones centering on the phoneme to be tested is used as a numerator, and the maximum value of the frame posterior probabilities within the duration time of the phoneme to be tested is used as a denominator to calculate the GOP value of the phoneme.

Specifically, when the pronunciation condition is correct, calculating a GOP value corresponding to the phoneme to be detected through a following first GOP formula;

wherein the first GOP formula is:

wherein, t_eIs the start time, t, of the phoneme to be measured_sIs the end time of the phoneme to be tested, p is the phoneme to be tested, s_pIs a triphone state belonging to a phoneme to be tested as p, S is a triphone state set, S is a triphone state belonging to S, o_tFeatures of speech for the t-th frame, P(s)_p|o_t)、P(s|o_t) Is the posterior probability of the acoustic model output. Therein, max_s∈SP(s|o_t) Refers to the frame maximum state posterior probability corresponding to the phoneme, and the position of the posterior probability in the DNN output state is shown in fig. 4.

And when the phoneme to be detected is in error pronunciation, taking the sum of the state posterior probability values of the single triphone taking the phoneme to be detected as the center as a numerator, and taking the maximum value of the frame posterior probability in the duration time of the phoneme to be detected as a denominator to calculate the GOP value of the phoneme.

Specifically, when the pronunciation condition is pronunciation error, a GOP value corresponding to the phoneme to be detected is calculated through a following second GOP formula;

wherein the second GOP formula is:

wherein, t_eIs the start time, t, of the phoneme to be measured_sIs the end time of the phoneme to be tested, p is the phoneme to be tested, S is the set of triphone states, S is a triphone state belonging to S, o_tIs the speech feature of the t-th frame, P (P | o)_t)、P(s|o_t) Is the posterior probability of the acoustic model output. Wherein, P (P | o)_t) The state posterior probability after forced alignment corresponding to the phoneme to be tested is referred, and the position of the state posterior probability in the DNN output state is shown in fig. 4.

In the embodiment, different GOP calculation schemes are adopted for the phonemes with correct pronunciation and the phonemes with wrong pronunciation, so that the GOP value is not influenced when the pronunciation is correct; and accurate GOP scores can be printed when pronunciation is wrong, the influence of wrong pronunciation on a voice evaluation result is reduced, and the accuracy of the voice evaluation result is improved.

Fig. 5 is a schematic structural diagram of a speech evaluation device according to an embodiment of the present invention, which is adapted to detect whether a learner pronounces correctly, and the speech evaluation device may be implemented by hardware and/or software. The voice evaluation device can be formed by two or more physical entities or can be formed by one physical entity and is generally integrated in computer equipment.

As shown in fig. 5, the speech evaluation device according to the embodiment of the present invention mainly includes a pronunciation situation determination module 51 and a GOP determination module 52.

The pronunciation condition determining module 51 is configured to determine a pronunciation condition of a phoneme to be tested, where the pronunciation condition includes pronunciation correctness and pronunciation error;

and the GOP determining module 52 is configured to determine, for each pronunciation situation, a GOP value corresponding to the phoneme to be detected by using a pronunciation goodness GOP calculation scheme corresponding to the pronunciation situation.

The voice evaluation device provided by the embodiment of the invention executes the following operations: determining the pronunciation condition of the phoneme to be detected, wherein the pronunciation condition comprises pronunciation correctness and pronunciation error; and aiming at each pronunciation condition, determining a GOP value corresponding to the phoneme to be detected by adopting a pronunciation goodness GOP calculation scheme corresponding to the pronunciation condition. The technical scheme of the embodiment distinguishes different pronunciation conditions, and different GOP calculation schemes are adopted for different pronunciation conditions, so that the GOP value is not influenced when pronunciation is correct; and accurate GOP scores can be printed when pronunciation is wrong, the influence of wrong pronunciation on a voice evaluation result is reduced, and the accuracy of the voice evaluation result is improved.

In one embodiment, the pronunciation situation determination module 51 includes:

the first judgment unit is used for judging whether the phoneme corresponding to the maximum posterior probability of the state of the acoustic model is the same as the phoneme to be detected or not frame by frame within the phoneme duration;

the pronunciation correctness determining unit is used for determining that the pronunciation of the phoneme to be tested is correct if the phoneme corresponding to the maximum posterior probability of the state of the acoustic model is the same as the phoneme to be tested;

and the pronunciation error determining unit is used for determining the pronunciation error of the phoneme to be tested if the phoneme corresponding to the maximum posterior probability of the state of the acoustic model is different from the phoneme to be tested.

the accumulated state posterior probability determining unit is used for determining the accumulated state posterior probability corresponding to each phoneme state within the phoneme duration;

the first phoneme determining unit is used for determining a first phoneme corresponding to the maximum accumulated state posterior probability in all phoneme states;

the pronunciation correct determining unit is further used for determining that the phoneme to be detected is a pronunciation correct phoneme if the first phoneme is the same as the phoneme to be detected;

and the pronunciation error determining unit is further used for determining that the phoneme in the audio to be tested is the pronunciation error phoneme if the first phoneme is not the same as the phoneme to be tested.

In one embodiment, the GOP determination module 52 includes:

a first GOP determining unit, configured to determine a GOP value corresponding to the phoneme to be detected, based on a sum of state posterior probabilities of all triphones centering on the phoneme to be detected when the pronunciation condition is a correct pronunciation;

and the second GOP determining unit is used for determining the GOP value corresponding to the phoneme to be detected by using the state posterior probability value of the single triphone taking the phoneme to be detected as the center when the pronunciation condition is pronunciation error.

In an embodiment, the first GOP determining unit is specifically configured to, when the pronunciation condition is that the pronunciation is correct, calculate a GOP value corresponding to the phoneme to be detected by using a first GOP formula as follows;

wherein the first GOP formula is:

In an embodiment, the second GOP determining unit is specifically configured to, when the pronunciation condition is pronunciation error, calculate a GOP value corresponding to the phoneme to be detected by using a second GOP formula as follows;

wherein the second GOP formula is:

wherein, t_eIs the start time, t, of the phoneme to be measured_sIs the end time of the phoneme to be tested, p is the phoneme to be tested, S is a set of triphone states, S is a triphone state belonging to S, o_tIs the speech feature of the t-th frame, P (P | o)_t)、P(s|o_t) Is the posterior probability of the acoustic model output.

In one embodiment, the apparatus further comprises:

the pronunciation qualification determining module is used for determining that the pronunciation of the phoneme to be detected is qualified when the GOP value corresponding to the phoneme to be detected is greater than or equal to a preset GOP threshold value; and when the GOP value corresponding to the phoneme to be detected is smaller than a preset GOP threshold value, determining that the pronunciation of the phoneme to be detected is unqualified.

Fig. 6 is a schematic diagram of a hardware structure of an apparatus according to an embodiment of the present invention, as shown in fig. 6, the apparatus includes a processor 601, a memory 602, an input device 603, and an output device 604; the number of processors 601 in the device may be one or more, and one processor 601 is taken as an example in fig. 6; the processor 601, the memory 602, the input device 603 and the output device 604 in the apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 6.

The memory 602 is used as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the speech evaluation method in the embodiment of the present invention (for example, the modules in the speech evaluation device shown in fig. 5 include the pronunciation situation determination module 51 and the GOP determination module 52). The processor 601 executes various functional applications and data processing of the device by running software programs, instructions and modules stored in the memory 602, that is, the voice evaluation method described above is implemented.

The memory 602 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 602 may further include memory located remotely from the processor 601, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

And, when the one or more programs included in the above-described apparatus are executed by the one or more processors 601, the programs perform the following operations:

The input device 603 may be used to receive input numeric or character information and generate key signal inputs relating to user settings and function controls of the apparatus. The output device 604 may include a display device such as a display screen.

An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processing apparatus, implements a voice evaluation method provided in an embodiment of the present invention, and the method includes:

Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the voice evaluation method provided by any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the voice evaluation device, the included units and modules are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be realized; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in some detail by the above embodiments, the invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the invention, and the scope of the invention is determined by the scope of the appended claims.

Claims

1. A speech assessment method, comprising:

2. The method of claim 1, wherein determining the pronunciation of the phoneme to be tested comprises:

if the phoneme corresponding to the maximum posterior probability of the state of the acoustic model is the same as the phoneme to be tested, determining that the pronunciation of the phoneme to be tested is correct;

3. The method of claim 1, wherein determining the pronunciation of the phoneme to be tested comprises:

4. The method according to claim 1, wherein for each pronunciation case, determining a GOP value corresponding to the phoneme using a GOP calculation scheme corresponding to the pronunciation case comprises:

and when the pronunciation condition is pronunciation error, determining the GOP value corresponding to the phoneme to be tested by using the posterior probability value of the state of the single triphone taking the phoneme to be tested as the center.

5. The method according to claim 1, wherein for each pronunciation case, determining a GOP value corresponding to the phoneme using a GOP calculation scheme corresponding to the pronunciation case comprises:

wherein the first GOP formula is:

wherein, t_eIs the start time, t, of the phoneme to be measured_sIs the end time of the phoneme to be measured, p is the phoneme to be measured, s_pBelongs to a triphone state with the phoneme to be detected as p, S is a triphone state set, S is a triphone state belonging to S, o_tFeatures of speech for the t-th frame, P(s)_p|o_t)、P(s|o_t) Is the posterior probability of the acoustic model output.

6. The method according to claim 1, wherein for each pronunciation case, determining a GOP value corresponding to the phoneme by using a pronunciation goodness GOP calculation scheme corresponding to the pronunciation case comprises:

wherein the second GOP formula is:

7. The method according to claim 1, wherein after determining the GOP value corresponding to the phoneme to be tested by using the pronunciation goodness GOP calculation scheme corresponding to the pronunciation situation, the method further comprises:

8. A speech evaluation device characterized by comprising:

9. A speech evaluation apparatus characterized by comprising:

one or more processors;

a memory for storing one or more programs;

the one or more programs being executable by the one or more processors to cause the one or more processors to implement the speech assessment method of any one of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the speech assessment method according to any one of claims 1 to 7.