CN114743566A

CN114743566A - Voice evaluation method, device, equipment and storage medium

Info

Publication number: CN114743566A
Application number: CN202110023913.5A
Authority: CN
Inventors: 叶珑; 雷延强; 佘爽
Original assignee: Guangzhou Shikun Electronic Technology Co Ltd
Current assignee: Guangzhou Shikun Electronic Technology Co Ltd
Priority date: 2021-01-08
Filing date: 2021-01-08
Publication date: 2022-07-12

Abstract

The embodiment of the invention discloses a voice evaluation method, a device, equipment and a storage medium, wherein the voice evaluation method comprises the following steps: determining a first pronunciation goodness GOP sequence corresponding to the voice to be tested and a second GOP sequence corresponding to the template voice, determining a relative entropy KL divergence between the first GOP sequence and the second GOP sequence, and determining an evaluation result of the voice to be tested based on the KL divergence. The technical scheme of the embodiment utilizes the characteristic that the GOP characteristics are probability distribution, and the similarity between the template pronunciation GOP sequence and the template pronunciation GOP sequence is measured by KL divergence, so that the accuracy of an evaluation result is improved.

Description

Voice evaluation method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to a data processing technology, in particular to a voice evaluation method, a voice evaluation device, voice evaluation equipment and a storage medium.

Background

The Pronunciation Quality evaluation technology (PQA) is a subdivision direction of Computer-assisted language learning (CALL), and the Pronunciation Quality evaluation technology requires to efficiently and accurately indicate Pronunciation errors of learners, give objective evaluation of phoneme levels and help learners to correct Pronunciation errors.

The Pronunciation Goodness Of Pronunciation (GOP) is a common feature in Pronunciation evaluation, and can only obtain a Pronunciation evaluation result at a phoneme level, and for direct evaluation Of a whole sentence, the existing evaluation method mainly uses a weighted phoneme level GOP as a sentence level GOP, so that the Pronunciation evaluation result at the sentence level is indirectly obtained.

The whole-sentence speech evaluation method does not solve the influence caused by different phoneme GOP value distribution differences, and can cause inaccurate whole-sentence speech evaluation.

Disclosure of Invention

The embodiment of the invention provides a voice evaluation method, a voice evaluation device, voice evaluation equipment and a storage medium, and improves the accuracy of a whole-sentence voice evaluation result.

In a first aspect, an embodiment of the present invention provides a speech evaluation method, including:

determining a first pronunciation goodness GOP sequence corresponding to the voice to be detected and a second GOP sequence corresponding to the template voice;

determining a relative entropy KL divergence between the first GOP sequence and the second GOP sequence;

and determining the evaluation result of the voice to be tested based on the KL divergence.

In an embodiment, determining a first GOP sequence corresponding to a speech to be detected and a second GOP sequence corresponding to a template speech includes:

acquiring a first phoneme state sequence and first boundary information of the voice to be detected;

determining a first GOP sequence corresponding to the voice to be detected based on the first phoneme state sequence and the first boundary information;

acquiring a second phoneme state sequence and second boundary information of the template voice;

and determining a second GOP sequence corresponding to the template voice based on the second phoneme state sequence and the second boundary information.

In an embodiment, determining a first GOP sequence corresponding to a to-be-detected speech based on the first phoneme state sequence and first boundary information, and determining a second GOP sequence corresponding to a template speech based on the second phoneme state sequence and second boundary information includes:

determining a first GOP value corresponding to each non-silent phoneme in the first phoneme state sequence based on the first phoneme state sequence and first boundary information;

arranging first GOP values corresponding to the multiple non-silent phonemes according to a first preset sequence to obtain a first GOP sequence;

determining a second GOP value corresponding to each non-mute phoneme in the second phoneme state sequence based on the second phoneme state sequence and second boundary information;

and arranging second GOP values corresponding to the plurality of non-mute phonemes according to a second preset sequence to obtain a second GOP sequence.

In an embodiment, after determining the second GOP sequence corresponding to the template speech based on the second phoneme state sequence and the second boundary information, the method further includes:

for each phoneme state, comparing GOP values in the first GOP sequence and GOP values in the second GOP sequence;

and if the GOP value in the second GOP sequence is smaller than that of the first GOP sequence, replacing the GOP value in the second GOP sequence with the GOP value in the first GOP sequence to obtain a new second GOP sequence.

In one embodiment, determining a relative entropy KL divergence between the first GOP sequence and the second GOP sequence includes:

calculating KL divergence between the first GOP sequence and the second GOP sequence by a KL divergence formula.

In one embodiment, the KL divergence formula is:

wherein D is_KL(P | | Q) is the KL divergence between the first GOP sequence and the second GOP sequence, P (i) is the ith GOP value in the second GOP sequence, and Q (i) is the ith GOP value in the first GOP sequence.

In an embodiment, the determining the evaluation result of the speech to be tested by the KL divergence includes:

when the KL divergence is larger than or equal to a preset threshold value, determining that the evaluation result of the voice to be tested is correct pronunciation;

and when the KL divergence is smaller than the preset threshold value, determining that the evaluation result of the voice to be tested is pronunciation error.

In a second aspect, an embodiment of the present invention further provides a speech evaluation device, which includes:

the GOP sequence determining module is used for determining a first sound-emitting goodness GOP sequence corresponding to the voice to be detected and a second GOP sequence corresponding to the template voice;

a KL divergence determining module for determining a relative entropy KL divergence between the first GOP sequence and the second GOP sequence;

and the evaluation result determining module is used for determining the evaluation result of the voice to be tested based on the KL divergence.

In a third aspect, an embodiment of the present invention further provides a voice evaluation device, including:

one or more processors;

a memory for storing one or more programs;

the one or more programs are executable by the one or more processors to cause the one or more processors to implement a speech assessment method as provided in the first aspect above.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium on which one or more computer programs are stored, which when executed by a processor, implement the speech assessment method as provided in the first aspect above.

In the voice evaluation method, the apparatus, the device, and the storage medium provided in the above embodiments, the voice evaluation method includes: determining a first pronunciation goodness GOP sequence corresponding to the voice to be tested and a second GOP sequence corresponding to the template voice, determining a relative entropy KL divergence between the first GOP sequence and the second GOP sequence, and determining an evaluation result of the voice to be tested based on the KL divergence. The technical scheme of the embodiment utilizes the characteristic that the GOP characteristics are probability distribution, and the similarity between the template pronunciation GOP sequence and the template pronunciation GOP sequence is measured by KL divergence, so that the accuracy of an evaluation result is improved.

Drawings

Fig. la is an exemplary diagram of an application scenario provided in an embodiment of the present invention;

FIG. lb is a diagram of another exemplary application scenario provided in the embodiment of the present invention

FIG. 2 is a flow chart of a method for providing a voice assessment in accordance with an embodiment of the present invention;

FIG. 3 is a flow chart of another speech assessment method according to an embodiment of the present invention;

FIG. 4 is a flow chart of another speech assessment method according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a voice evaluation device according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a hardware structure of an apparatus according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

The existing speech evaluation technical research can only obtain pronunciation evaluation results at a phoneme level generally, and for the direct evaluation of the whole sentence, the existing evaluation method mainly uses a weighted phoneme level GOP value as a sentence level GOP value generally, so that the pronunciation evaluation results at the sentence level are indirectly obtained. The whole-sentence speech evaluation method does not solve the influence caused by different phoneme GOP value distribution differences, and can cause inaccurate whole-sentence speech evaluation.

Therefore, in view of the above problems, embodiments of the present invention provide a speech evaluation method, apparatus, device and storage medium, which obtain a sentence-level pronunciation evaluation result by calculating a KL divergence between a phoneme GOP sequence of a template pronunciation and a phoneme GOP sequence of a speech to be evaluated. And dividing a plurality of pronunciation grade levels according to the KL divergence to realize the pronunciation evaluation of sentence grades.

The scheme can be used for pronunciation error detection and diagnosis directions in the field of speech evaluation, such as an online or offline speech evaluation system, provides pronunciation error detection of the whole sentence of a language learner, and can realize pronunciation evaluation at the sentence level. For example, a user who has Chinese as the native language, learns foreign languages, and so on.

Fig. la is an exemplary diagram of an application scenario provided in the embodiment of the present invention. As shown in fig. la, the server 102 is configured to execute the pronunciation assessment method according to any method embodiment of the present application, the client 101 receives a voice to be tested sent by a user through an input device, the server 102 interacts with the client 101 to obtain the voice to be tested, after the server 102 executes the pronunciation assessment method, the server outputs a pronunciation assessment result to the client 101, and an output device of the client 101 notifies a learner. Further, the client 101 provides the learner with the correct pronunciation to help him correct the pronunciation.

The input device may be an input device built in a computer device, such as: built-in voice input devices, etc.; or an external input device connected to the computer device through a communication line, such as: a microphone, etc. Further, the output device may be an output device built in the computer device, such as: touch display screens and the like; or an external output device connected with the computer device through a communication line, such as: a projector, a digital TV, etc.

It should be noted that, in this embodiment, the client 101 is described by taking a computer as an example, and the computer device may specifically be a computer device including a processor, a memory, an input device, and an output device. Such as notebook computers, desktop computers, tablet computers, intelligent terminals, learning machines, early education machines, intelligent wearable computers, etc.

Alternatively, when the client 101 has certain data processing capability, that is, when the client 101 has a processor and a memory, the client 101 may be used alone as an execution subject of the pronunciation assessment method according to any embodiment of the present application, as illustrated in fig. 1 b. In fig. 1b, the learner presses the microphone, the built-in voice collecting device of the mobile phone can collect the voice to be tested sent by the user in real time, and after the pronunciation evaluation method is executed, the evaluation result of the voice to be tested is displayed to the learner through the display screen.

The pronunciation assessment method provided by the present invention will be explained below with reference to specific examples.

Fig. 2 is a flowchart of a speech evaluation method according to an embodiment of the present invention, the method is suitable for detecting whether a learner pronounces correctly, the speech evaluation method may be performed by a speech evaluation device, and the speech evaluation device may be implemented by hardware and/or software. The voice evaluation device can be formed by two or more physical entities or can be formed by one physical entity and is generally integrated in computer equipment.

As shown in fig. 2, the voice evaluation method provided by the embodiment of the present invention mainly includes the following steps:

s11, determining a first pronunciation goodness GOP sequence corresponding to the voice to be tested and a second GOP sequence corresponding to the template voice.

In this embodiment, the voice to be tested may be acquired voice uttered by the learner with respect to the pronunciation text, and further may be acquired by a microphone directly or by a prerecorded voice of the learner who needs to train the learner to pronounce according to the pronunciation text. The template voice refers to the standard pronunciation of the voice to be trained by the learner, and is recorded according to the standard pronunciation of various voices, if the Chinese language is recorded according to the standard pronunciation of the national Mandarin, the English language can be recorded according to the standard American pronunciation or the English pronunciation standard. The template pronunciation can also be understood as correct pronunciation and is used for helping learners to correct wrong pronunciation and prompting correct pronunciation.

It should be noted that the speech to be detected and the template speech are the speech of the same pronunciation text.

In practical applications, when a learner reads a text, a speech signal corresponding to the text is generated. The electronic equipment firstly acquires the voice signal and determines whether the pronunciation of the learner is correct or not by detecting the voice signal.

Illustratively, the text may be embodied as at least one word, or even at least one phoneme. The phoneme is the minimum voice unit divided according to the natural attribute of the voice, and is analyzed according to the pronunciation action in the syllable, and one action forms one phoneme. And a phoneme consists of a number of states. For example, a phoneme is composed of three states, each of which is assigned a duration of at least one frame. The phoneme is read for a duration greater than three frames. The text is the pronunciation text described in the embodiment of the present application, and the speech signal is the speech to be detected.

Taking the learning machine as an example, when the learner reads the pronunciation text on the display interface of the learning machine, the learning machine acquires the voice signal through the pickup equipment such as the microphone to acquire the voice to be detected.

In this embodiment, the first GOP sequence refers to analyzing the speech to be detected to obtain a series of GOP values, and the second GOP sequence refers to analyzing the template speech to obtain a series of GOP values.

Further, the sequence of the GOP values in the first GOP sequence is determined according to the first phoneme state sequence of the speech to be detected. The arrangement order of the GOP values in the second GOP sequence is determined by the second phoneme state sequence of the template voice. Wherein the length of the first GOP sequence is equal to the length of the second GOP sequence.

In one embodiment, a method of determining a first sequence of GOPs is provided. Specifically, the speech to be detected is decomposed based on the pronunciation text and the speech to be detected, phonemes and boundary information included in the speech to be detected are obtained, and states corresponding to the phonemes form a first phoneme state sequence. And calculating GOP values corresponding to the phonemes, and sequentially arranging the GOP values corresponding to the phonemes in the first phoneme state sequence according to the sequence of the phonemes to obtain a first GOP sequence.

In one embodiment, a method of determining a second sequence of GOPs is provided. Specifically, the standard speech is decomposed based on the pronunciation text and the standard speech to obtain phonemes and boundary information included in the standard speech, and states corresponding to the phonemes form a second phoneme state sequence. And calculating GOP values corresponding to the phonemes, and sequentially arranging the GOP values corresponding to the phonemes in the second phoneme state sequence according to the sequence of the phonemes to obtain a second GOP sequence.

It should be noted that, for the first GOP sequence corresponding to the speech to be detected, since the learner may have different speeches issued for the same pronunciation text each time, the scheme for determining the first GOP sequence needs to be executed after the speech signal to be detected is acquired each time.

In an embodiment, for the second GOP sequence corresponding to the template voice, after the voice signal to be detected is acquired each time, a scheme of acquiring the template voice corresponding to the voice to be detected and determining the second GOP sequence corresponding to the template voice may be performed.

In the two embodiments, for the second GOP sequence corresponding to the template speech, since the standard pronunciation of one pronunciation text is the same, after the second GOP sequence corresponding to the template speech is determined, the corresponding relationship between the second GOP sequence and the pronunciation text is stored, after the speech to be detected corresponding to the pronunciation text and the first GOP sequence corresponding to the speech to be detected are obtained, the second GOP sequence corresponding to the pronunciation text is directly obtained in the memory or the storage unit, and then the second GOP sequence corresponding to the template speech is determined.

S12, determining the relative entropy KL divergence between the first GOP sequence and the second GOP sequence.

Among them, the relative entropy (relative entropy), also called Kullback-Leibler divergence or information divergence, is a measure of asymmetry of the difference between two probability distributions (probability distributions). KL divergence is a method used to measure the degree of difference between two probability distributions P and Q, P representing the true distribution, observed distribution of data, and Q representing the theoretical distribution, estimation, approximate model distribution of data. The KL divergence, also known as the KL distance, is non-negative in value.

Specifically, the KL divergence is used to measure the degree of difference between the first GOP sequence and the second GOP sequence. In the present embodiment, the specific method of KL divergence is not limited. The GOP values in the first GOP sequence and the second GOP sequence can be directly substituted into a KL divergence formula to obtain KL divergence; or a mathematical derivation method can be adopted to obtain the KL divergence of the first GOP sequence and the second GOP sequence. In the present embodiment, only the calculation method of the KL divergence is described, but not limited. In the present embodiment, the KL divergence method is described as a method of measuring the degree of difference between two sequences, but is not limited thereto, and other methods such as euclidean distance may be used.

In one embodiment, determining a relative entropy KL divergence between the first GOP sequence and the second GOP sequence comprises: calculating KL divergence between the first GOP sequence and the second GOP sequence by a KL divergence formula.

In this embodiment, the KL divergence formula may be a KL conclusion formula or a KL derivation formula, which is not limited in this embodiment.

And S13, determining the evaluation result of the voice to be tested based on the KL divergence.

In this embodiment, the evaluation result may be a correct pronunciation or an incorrect pronunciation, or the evaluation result may be a division of multiple grades of good or medium, or a score or a percentage value, which is not limited in this embodiment.

In one embodiment, the KL divergence determining the evaluation result of the speech to be tested includes: when the KL divergence is larger than or equal to a preset threshold value, determining that the evaluation result of the voice to be tested is correct pronunciation; and when the KL divergence is smaller than the preset threshold value, determining that the evaluation result of the voice to be tested is pronunciation error.

Specifically, the preset threshold may be set according to an actual situation, and in this embodiment, the preset threshold is not specifically limited.

In one embodiment, the KL divergence determining the evaluation result of the speech to be tested includes: when the KL divergence is larger than or equal to a first threshold value, determining that the evaluation result of the voice to be tested is excellent; when the KL divergence is smaller than a first threshold and larger than or equal to a second threshold, determining that the evaluation result of the voice to be tested is good; when the KL divergence is smaller than a second threshold and larger than or equal to a third threshold, determining that the evaluation result of the voice to be tested is middle; and when the KL divergence is smaller than a third threshold value, determining that the evaluation result of the voice to be tested is poor.

The first threshold is greater than the second threshold and the second threshold is greater than the third threshold, and the first threshold, the second threshold and the third threshold may be set according to an actual situation.

In the above embodiment, the 1 threshold and the 3 thresholds are merely used as examples, and the description is made briefly. In a specific application process, different thresholds can be set according to different situations, and the evaluation result is classified into multiple grades, and the specific grade classification method is not described in this embodiment again.

In one embodiment, the KL divergence can be directly converted into a score or a percentage value, that is, the evaluation result is displayed in a score and percentage manner, so that the evaluation result is divided more finely, and the learner can clearly know the pronunciation condition of the learner.

Specifically, when the phonemes of the voice to be detected and the standard voice are completely the same, the KL divergence corresponds to 100 points, and when the phonemes of the voice to be detected and the standard voice are completely different, the KL divergence corresponds to 0 point; and then sequentially dividing the KL divergence between 0 and 100 into 99 parts, respectively corresponding to 1 to 99 parts, and finally storing the corresponding relation between the KL divergence and the fraction.

Further, after the KL divergence of the voice to be tested is determined, the corresponding relation between the KL divergence and the score is inquired, and the score corresponding to the KL divergence of the voice to be tested is determined as the evaluation result of the voice to be tested. The specific query method is not described in detail in this embodiment.

It should be noted that after the evaluation result of the speech to be tested is determined, the evaluation result needs to be sent to an output device for displaying, so that the learner can correctly know the pronunciation of the learner. The output device may be a display device built in a computer device, such as: touch display screens and the like; or an external playing device connected with the computer device through a communication line, such as: projectors, digital TVs, etc., and may also be audio playback devices built into computer equipment, such as: built-in sound boxes and the like; the external playing device can also be connected with the computer device through a communication line, such as: earphones, external sound boxes and the like.

Furthermore, the evaluation result is unqualified, or the template voice is played through a playing device or playing equipment after a template voice playing instruction input by the learner is received, so that correct pronunciation is provided for the learner to perform simulation training.

The voice evaluation method provided by the embodiment of the invention comprises the following steps: determining a first pronunciation goodness GOP sequence corresponding to the voice to be tested and a second GOP sequence corresponding to the template voice, determining a relative entropy KL divergence between the first GOP sequence and the second GOP sequence, and determining an evaluation result of the voice to be tested based on the KL divergence. The technical scheme of the embodiment utilizes the characteristic that the GOP characteristic is probability distribution, and the similarity between the template pronunciation GOP sequence and the template pronunciation GOP sequence is measured by KL divergence, so that the accuracy of an evaluation result is improved.

Fig. 3 is another speech evaluation method provided in the embodiment of the present invention, and as shown in fig. 3, the speech evaluation method provided in the embodiment of the present invention mainly includes the following steps:

s21, acquiring a first phoneme state sequence and first boundary information of the voice to be detected.

In this embodiment, the pronunciation text and the speech to be tested are aligned. Further, an acoustic score of each frame in the speech to be detected is calculated by using a pre-trained acoustic model, and then an optimal path is searched in the alignment network by using a Viterbi algorithm to obtain a first phoneme state sequence and first boundary information of the speech to be detected.

The viterbi algorithm is a dynamic programming algorithm widely applied in machine learning, and is used for searching a viterbi path-hidden state sequence which is most likely to generate an observation event sequence, particularly in a markov information source context and a hidden markov model. The terms "viterbi path" and "viterbi algorithm" are also used to find the dynamic programming algorithm for which observations are most likely to explain the correlation. In the embodiment of the invention, the Viterbi algorithm is utilized to search an optimal path in the alignment network to obtain a first phoneme state sequence.

The acoustic Model can be constructed by Deep Neural Networks (DNN) Hidden Markov Models (HMM), namely the acoustic Model is a DNN HMM acoustic Model. Inputting the speech signal to be detected into a DNN-HMM acoustic model frame by frame, outputting state posterior probability corresponding to the frame by frame, converting the state posterior probability into acoustic scores, and searching an optimal path by using a Viterbi algorithm to obtain a first phoneme state sequence and boundary information. The purpose of the viterbi algorithm to search the path is to search for an optimal path matching the speech feature sequence in the WFST aligned network.

And S22, determining a first GOP sequence corresponding to the voice to be tested based on the first phoneme state sequence and the first boundary information.

In the present embodiment, the first phoneme state sequence and the first boundary are known from the alignment result, and the GOP score is calculated based on the GOP calculation formula in units of phoneme lengths. The GOP calculation formula is a phoneme sequence likelihood value obtained by forced alignment, and the denominator is a sequence likelihood value obtained by freely decoding phonemes, wherein the freely decoding refers to a decoding process based on a cyclic phoneme network.

The GOP calculation formula is as follows:

where T is the phoneme duration, O represents the speech feature sequence corresponding to the speech signal within the duration of phoneme P, Q is the set of phonemes, P (O | P) is the observation probability of phoneme P, and P (P) is the prior probability of phoneme P. And only calculating the GOP value of the non-silent phoneme to obtain the GOP score of each non-silent phoneme in the phoneme sequence.

In one embodiment, a first GOP value corresponding to each non-silent phoneme in the first phoneme state sequence is determined based on the first phoneme state sequence and first boundary information; arranging first GOP values corresponding to the multiple non-mute phonemes according to a first preset sequence to obtain a first GOP sequence.

The first preset sequence refers to the sequence of each phoneme in the speech to be detected in the first phoneme state sequence.

And S23, acquiring a second phoneme state sequence and second boundary information of the template voice.

And S24, determining a second GOP sequence corresponding to the template voice based on the second phoneme state sequence and the second boundary information.

In this embodiment, the specific method for acquiring the second phoneme state sequence and the second boundary information of the template speech and determining the second GOP sequence corresponding to the template speech based on the second phoneme state sequence and the second boundary information is the same as that of steps S21 and S24, and specifically, reference may be made to the description in the foregoing embodiment, which is not repeated in this embodiment.

In one embodiment, a second GOP value corresponding to each non-silent phoneme in the second phoneme state sequence is determined based on the second phoneme state sequence and the second boundary information; and arranging second GOP values corresponding to the plurality of non-mute phonemes according to a second preset sequence to obtain a second GOP sequence.

Wherein the second preset order refers to an order of each phoneme in the template speech in the second phoneme state sequence.

S25, comparing the GOP value in the first GOP sequence and the GOP value in the second GOP sequence for each phoneme state.

S26, if the GOP value in the second GOP sequence is smaller than that of the first GOP sequence, replacing the GOP value in the second GOP sequence with the GOP value in the first GOP sequence to obtain a new second GOP sequence.

In this embodiment, the second GOP sequence corresponding to the template speech and the first GOP sequence corresponding to the speech to be tested have already been obtained in S21-S24. Considering that the output probability of the acoustic model may not be accurate, comparing, on a phone-by-phone basis, whether the GOP value of the phoneme of the second GOP sequence is smaller than the GOP value of the phoneme in the first GOP sequence, if so, temporarily replacing the GOP value of the phoneme in the second GOP sequence with the GOP value of the corresponding phoneme in the first GOP sequence, and keeping the GOP value of the phoneme in the first GOP sequence unchanged; if not, the operation is not carried out, so that the stability and the reliability of the scoring are ensured.

Illustratively, the second GOP sequence corresponding to the template speech sequentially includes: the GOP value a1 corresponding to phoneme A, the GOP value B1 corresponding to phoneme B and the GOP value C1 corresponding to phoneme C. The first GOP sequence corresponding to the voice to be detected sequentially comprises: the GOP value a2 corresponding to the phoneme A, the GOP value B2 corresponding to the phoneme B, and the GOP value C2 corresponding to the phoneme C. The phoneme-by-phoneme comparison is to compare GOP value a1 corresponding to phoneme a with GOP value a2 corresponding to phoneme a, GOP value B1 corresponding to phoneme B with GOP value B2 corresponding to phoneme B, and GOP value C1 corresponding to phoneme C with GOP value C2 corresponding to phoneme C. If a1 is less than a2, a1 in the second GOP sequence is replaced with a 2. If a1 is greater than or equal to a2, then no action is taken with respect to the GOP values in the second GOP sequence.

Further, the second GOP sequence is a1, b1 and c1, and the first GOP sequence is a2, b2 and c 2. If a1 is less than a2, b1 is not less than b2, and c1 is not less than c2, the second GOP sequences after comparison are a2, b1, and c1, and the first GOP sequences are a2, b2, and c 2.

S27, calculating KL divergence between the first GOP sequence and the second GOP sequence through a KL divergence formula.

In this embodiment, the KL divergence is calculated for the updated GOP sequences of equal length. The KL divergence, also called relative entropy, is a method for measuring the degree of difference between two probability distributions P and Q, P representing the true distribution and the observed distribution of data, and Q representing the theoretical distribution, the estimation and the approximate model distribution of data.

Wherein the KL divergence formula is:

wherein D is_KL(P | | Q) is the KL divergence between the first GOP sequence and the second GOP sequence, P (i) is the ith GOP value in the first GOP sequence, and Q (i) is the ith GOP value in the second GOP sequence.

And S28, determining the evaluation result of the voice to be tested based on the KL divergence.

In the embodiment, a sentence voice evaluation method based on KL divergence is used. Because the GOP distribution of different phonemes has obvious difference, the influence caused by difference of GOP values of different phonemes can be effectively avoided by considering the similarity degree of a sample to be tested and a template, rather than simply carrying out weighted average on the phoneme evaluation result. The characteristic that GOP characteristics are probability distribution is utilized, the similarity with a template pronunciation GOP sequence is measured by KL divergence, and the accuracy of an evaluation result is ensured.

In an application example, fig. 4 is a flowchart of another speech evaluation method provided in the embodiment of the present invention, and as shown in fig. 4, the speech evaluation method provided in the embodiment mainly includes:

and S31, forcibly aligning the text and the audio to acquire a phoneme state sequence and boundary information.

In this embodiment, the audio includes a template audio and an audio to be tested. Correspondingly, the phoneme state sequence comprises a first phoneme state sequence corresponding to the voice to be detected and a second phoneme state sequence corresponding to the template voice. The boundary information comprises first boundary information corresponding to the voice to be detected and second boundary information corresponding to the template voice.

S32, calculating the GOP value of the non-silence phoneme.

In this embodiment, the template speech and the speech to be detected are both subjected to the same calculation process, so as to obtain a template GOP sequence and a GOP sequence to be detected. The specific calculation process may refer to the description in the above embodiment, and the embodiment is not limited.

And S33, judging whether the GOP value of the template is smaller than the GOP value to be detected or not by one phoneme, if so, executing a step S34, and if not, executing a step S35.

S34, replacing the GOP value of the template with the GOP value to be detected, and executing the step S35.

And S35, calculating KL divergence between the template GOP sequence and the GOP sequence to be detected.

The KL divergence specific calculation method may refer to the description in the above embodiment, and is not limited in this embodiment.

S36, judging whether the KL divergence is larger than a preset threshold value, if so, executing S37, and otherwise, executing S38.

And S37, the pronunciation of the sentence is correct.

And S38, making the sentence mispronounced.

In the embodiment, the characteristic that the GOP characteristics are probability distribution is utilized, and the similarity with the template pronunciation GOP sequence is measured by using KL divergence, so that the accuracy of an evaluation result is ensured.

Fig. 5 is a schematic structural diagram of a speech evaluation device according to an embodiment of the present invention, the device is adapted to detect whether a pronunciation of a learner is correct, and the speech evaluation device may be implemented by hardware and/or software. The voice evaluation device can be formed by two or more physical entities or can be formed by one physical entity and is generally integrated in computer equipment.

As shown in fig. 2, the speech evaluation device according to the embodiment of the present invention mainly includes a GOP sequence determining module 51, a KL distance determining module 52, and an evaluation result determining module 53.

The GOP sequence determining module 51 is configured to determine a first pronunciation goodness GOP sequence corresponding to the voice to be detected and a second GOP sequence corresponding to the template voice;

a KL divergence determination module 52 for determining a relative entropy KL divergence between the first GOP sequence and the second GOP sequence;

and the evaluation result determining module 53 is configured to determine an evaluation result of the to-be-tested voice based on the KL divergence.

The voice evaluation device provided by the embodiment of the invention is used for executing the following operations: determining a first pronunciation goodness GOP sequence corresponding to the voice to be tested and a second GOP sequence corresponding to the template voice, determining a relative entropy KL divergence between the first GOP sequence and the second GOP sequence, and determining an evaluation result of the voice to be tested based on the KL divergence. The technical scheme of the embodiment utilizes the characteristic that the GOP characteristics are probability distribution, and the similarity between the template pronunciation GOP sequence and the template pronunciation GOP sequence is measured by KL divergence, so that the accuracy of an evaluation result is improved.

Further, the GOP sequence determining module 51 includes: a first GOP sequence determining unit and a second GOP sequence determining unit;

the first GOP sequence determining unit is used for acquiring a first phoneme state sequence and first boundary information of the voice to be detected; determining a first GOP sequence corresponding to the voice to be detected based on the first phoneme state sequence and the first boundary information;

a second GOP sequence determining unit, configured to acquire a second phoneme state sequence and second boundary information of the template speech; and determining a second GOP sequence corresponding to the template voice based on the second phoneme state sequence and the second boundary information.

Further, the first GOP sequence determining unit is specifically configured to determine, based on the first phoneme state sequence and the first boundary information, a first GOP value corresponding to each non-silent phoneme in the first phoneme state sequence; arranging first GOP values corresponding to the multiple non-silent phonemes according to a first preset sequence to obtain a first GOP sequence;

a second GOP sequence determining unit, configured to determine, based on the second phoneme state sequence and the second boundary information, a second GOP value corresponding to each non-silent phoneme in the second phoneme state sequence; and arranging second GOP values corresponding to the plurality of non-mute phonemes according to a second preset sequence to obtain a second GOP sequence.

Further, the second GOP sequence determining unit is further specifically configured to compare, for each phoneme state, a GOP value in the first GOP sequence with a GOP value of the second GOP sequence;

and if the GOP value in the second GOP sequence is smaller than that of the first GOP sequence, replacing the GOP value of the second GOP sequence with the GOP value in the first GOP sequence to obtain a new second GOP sequence.

Further, the KL divergence determining module 52 is specifically configured to determine a relative entropy KL divergence between the first GOP sequence and the second GOP sequence, and includes: calculating KL divergence between the first GOP sequence and the second GOP sequence by a KL divergence formula.

Wherein the KL divergence formula is:

Further, the evaluation result determining module 53 is specifically configured to determine that the evaluation result of the to-be-tested speech is pronunciation correct when the KL divergence is greater than or equal to a preset threshold; and when the KL divergence is smaller than the preset threshold value, determining that the evaluation result of the voice to be tested is pronunciation error.

The voice evaluation device provided by the embodiment of the invention can execute the voice evaluation method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Fig. 6 is a schematic diagram of a hardware structure of an apparatus according to an embodiment of the present invention, as shown in fig. 6, the apparatus includes a processor 601, a memory 602, an input device 603, and an output device 604; the number of processors 601 in the device may be one or more, and one processor 601 is taken as an example in fig. 6; the processor 601, the memory 602, the input device 603 and the output device 604 of the apparatus may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.

The memory 602 is used as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the voice evaluation method in the embodiment of the present invention (for example, the modules in the voice evaluation apparatus shown in fig. 5 include a GOP sequence determining module 51, a KL distance determining module 52, and an evaluation result determining module 53). The processor 601 executes various functional applications and data processing of the device by running software programs, instructions and modules stored in the memory 602, that is, the voice evaluation method described above is implemented.

The memory 602 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 602 may further include memory located remotely from the processor 601, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

And, when the one or more programs included in the above-described apparatus are executed by the one or more processors 601, the programs perform the following operations:

The input device 603 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the apparatus. The output device 604 may include a display device such as a display screen.

An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processing apparatus, implements a voice evaluation method provided in an embodiment of the present invention, and the method includes:

Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the voice evaluation method provided by any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the voice evaluation device, the included units and modules are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be realized; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. Those skilled in the art will appreciate that the present invention is not limited to the particular embodiments described herein, and that various obvious changes, rearrangements and substitutions will now be apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A speech assessment method, comprising:

2. The method of claim 1, wherein determining a first GOP sequence for the speech to be tested and a second GOP sequence for the template speech comprises:

3. The method of claim 2, wherein determining a first GOP sequence corresponding to the speech to be tested based on the first phoneme state sequence and the first boundary information, and determining a second GOP sequence corresponding to the template speech based on the second phoneme state sequence and the second boundary information comprises:

determining a second GOP value corresponding to each non-mute phoneme in the second phoneme state sequence based on the second phoneme state sequence and the second boundary information;

4. The method of claim 2, wherein after determining a second GOP sequence corresponding to the template speech based on the second phoneme state sequence and the second boundary information, further comprising:

comparing, for each phoneme state, a GOP value in the first GOP sequence and a GOP value of the second GOP sequence;

5. The method according to claim 1, wherein determining a relative entropy KL divergence between the first GOP sequence and the second GOP sequence comprises:

6. The method according to claim 5, wherein the KL divergence formula is:

7. The method according to claim 1, wherein the KL divergence determines the evaluation result of the speech to be tested, comprising: when the KL divergence is larger than or equal to a preset threshold value, determining that the evaluation result of the voice to be tested is correct pronunciation; and when the KL divergence is smaller than the preset threshold value, determining that the evaluation result of the voice to be tested is pronunciation error.

8. A speech evaluation device characterized by comprising:

the GOP sequence determining module is used for determining a first pronunciation goodness GOP sequence corresponding to the voice to be detected and a second GOP sequence corresponding to the template voice;

9. A voice evaluation apparatus characterized by comprising:

one or more processors;

a memory for storing one or more programs;

the one or more programs are executable by the one or more processors to cause the one or more processors to implement the speech assessment method of any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the speech assessment method according to any one of claims 1 to 7.