CN111816210B

CN111816210B - Voice scoring method and device

Info

Publication number: CN111816210B
Application number: CN202010583611.9A
Authority: CN
Inventors: 胡月志; 杨占磊; 肖龙帅
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-06-23
Filing date: 2020-06-23
Publication date: 2022-08-19
Anticipated expiration: 2040-06-23
Also published as: CN111816210A

Abstract

The application relates to the technical field of artificial intelligence and discloses a voice scoring method and device. And acquiring the posterior probability that each sounding phoneme included in a piece of audio is each first phoneme, and traversing all phonemes included in the phoneme set by the first phonemes. And then determining whether the sounding phoneme deviates from the target phoneme according to the posterior probability. And when the target phoneme is not deviated, determining the probability score of the predetermined sounding phoneme as a target probability score, and converting the target probability score into a score in the M system. And when the target phoneme deviates from the target phoneme, reducing the probability score, and determining the probability score as a target probability score so as to determine the score in the M system corresponding to the target probability score. Whether the pronunciation of the user is accurate can be determined by identifying whether the pronunciation phoneme of the user deviates from the target phoneme, so that the accuracy of voice recognition is improved. And when the pronunciation of the user is inaccurate, the score of the user is reduced so that the user can clearly know the accuracy of the pronunciation.

Description

Voice scoring method and device

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to a voice scoring method and device.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision and reasoning, human-computer interaction, recommendation and search, AI basic theory, and the like.

A voice evaluation technology relates to a plurality of fields of acoustics, linguistics, digital signal processing, computer science and the like, and solves the main problems of automatically evaluating pronunciation level, correcting pronunciation errors, positioning and analyzing pronunciation defects. For example, the evaluator is evaluated in English.

When the voice evaluation is performed, the pronunciation of the evaluator may or may not be accurate. How to determine whether the pronunciation of the evaluator is accurate and feed back reasonable pronunciation conditions to the evaluator is a technical problem to be solved.

Disclosure of Invention

The embodiment of the application provides a voice scoring method and device, which are used for solving the problems of determining whether the pronunciation of an evaluator is accurate and feeding back a reasonable pronunciation condition to the evaluator.

Firstly, inputting a section of audio and text information corresponding to the audio into a pre-trained acoustic model, so as to obtain the acoustic measure of each sounding phoneme included in the audio. The acoustic measure of the vocalized phoneme includes a posterior probability that the vocalized phoneme is each of the first phonemes; the first phoneme is each phoneme included in a phoneme set, and the phoneme set is a phoneme set of a language type corresponding to the text information. Secondly, determining whether the sounding phoneme deviates from a target phoneme according to the posterior probability that the sounding phoneme is respectively the first phoneme, wherein the target phoneme is a phoneme obtained after the text information is decomposed. Next, when it is determined that the vocalized phoneme does not deviate from the target phoneme, a predetermined probability score of the vocalized phoneme may be determined as a target probability score of the vocalized phoneme; when it is determined that the sounding phoneme deviates from the target phoneme, the predetermined probability score of the sounding phoneme may be determined as a target probability score of the sounding phoneme after being subjected to reduction processing. Further, the score in the M-score corresponding to the target probability score of the vocalized phoneme may be determined according to a pre-trained scoring model. Specifically, the target probability score may be output to a pre-trained scoring model, and the scoring model outputs a score in the M-score corresponding to the target probability score of the sounding phoneme. M is generally 100 or 10 percent, i.e., ten percent.

Whether the sounding phoneme deviates from the target phoneme or not is determined through the posterior probability of the sounding phoneme, whether the pronunciation of the user is accurate or not can be determined, and the accuracy of voice recognition can be improved. And feeding back pronunciation conditions to the user through the scores in the M scores. And when the sounding phoneme deviates from the target phoneme, the probability score of the sounding phoneme is reduced, and the score in the M-score corresponding to the probability score is also reduced correspondingly. In this way, by suppressing the score of the non-target phoneme, the final score in the M score can reflect the pronunciation of the user more reasonably.

In a possible implementation, when determining whether the vocalized phoneme deviates from the target phoneme according to the posterior probabilities of the vocalized phonemes being the first phonemes, the maximum value of the posterior probabilities of the vocalized phonemes being the first phonemes may be determined. And when the maximum value is larger than (alternatively larger than or equal to) a set threshold value and the maximum value is larger than the posterior probability that the sounding phoneme is the target phoneme, determining that the sounding phoneme deviates from the target phoneme. And when the maximum value is not larger than (not larger than or alternatively smaller than) a set threshold value and/or the maximum value is not larger than the posterior probability that the sounding phoneme is the target phoneme, determining that the sounding phoneme does not deviate from the target phoneme. That is, when the posterior probability that the voiced phoneme is a non-target phoneme is large and is greater than the posterior probability that the voiced phoneme is a target phoneme, the voiced phoneme is considered to be deviated from the target phoneme. Otherwise, the vocalized phoneme does not deviate from the target phoneme.

In one possible implementation, the acoustic measure of the vocalized phoneme further includes a vocalization duration value of the vocalized phoneme; after the target probability score of the sounding phoneme is determined, whether the sounding duration value of the sounding phoneme meets the Gaussian distribution principle corresponding to the target phoneme can be determined. When the sounding duration value of the sounding phoneme is determined to meet the Gaussian distribution principle corresponding to the target phoneme, keeping the target probability score of the sounding phoneme unchanged; and when the sound-making duration value of the sound-making phoneme is determined not to meet the Gaussian distribution principle corresponding to the target phoneme, reducing and updating the target probability score of the sound-making phoneme. And further determining the score in the M system corresponding to the target probability score of the sounding phoneme.

Whether the pronunciation of the user is accurate can be determined more accurately by starting from the posterior probability of the pronunciation phoneme and the pronunciation duration value. And when the sounding duration value does not meet the Gaussian distribution principle, the probability score is reduced again, and the score in the M score corresponding to the probability score is correspondingly reduced. In this way, by further suppressing the score of the non-target phoneme, the score in the final M-score can reflect the pronunciation of the user more reasonably.

In one possible implementation, the gaussian distribution principle is the positive-too-distributed 3 σ principle. That is, the 3 σ rule is satisfied when the voicing duration value is within (μ -3 σ, μ +3 σ), otherwise the 3 σ rule is not satisfied. A corpus includes multiple audios, an audio includes multiple phonemes, and different audios may include the same phonemes. And aiming at each phoneme, determining an average sounding duration value mu and a sounding duration value variance sigma of each phoneme according to a plurality of sounding duration values corresponding to each phoneme, and further determining a 3 sigma principle.

In one possible implementation, the following is done according to the formula: m/(1+ e) for Y ^x ) And determining the score in the M system corresponding to the target probability score of the sounding phoneme. And Y is the score in the M system, x is determined according to a first parameter, and the first parameter comprises the target probability score of the sounding phoneme. Further, the acoustic measure of the vocalized phoneme may further include: the energy and/or pitch frequency of the phoneme; the first parameter further comprises at least one of: utterance duration value, phoneme energy, pitch frequency. The score in the M system is determined through multiple dimensions of the target probability score, the sounding duration value, the phoneme energy and the pitch frequency, and the determined score can be more accurate.

In one possible implementation, when x is determined from the first parameter: x is determined according to the formula w1x1+ w2x2+ w3x3+ w4x4+ b. Wherein w1, w2, w3, w4 and b are all constants, x1 is a target probability score of the vocalized phoneme, x2 is a vocalized duration value of the vocalized phoneme, x3 is phoneme energy of the vocalized phoneme, and x4 is a pitch frequency of the vocalized phoneme.

In a possible implementation, when the scoring model is trained, the scoring model may be trained by using the target probability scores of the phonemes of different types of audio and the scores in the M-ary system corresponding to the target probability scores of the phonemes, where the scoring intervals in the M-ary system corresponding to the different types of audio are different. The categories of audio may be, for example: native speech, non-native speech, disorganized speech. Through different types of audios, the scoring model is trained, and scores in the M scores output by the scoring model can be more accurate.

In one possible implementation, when determining the probability score of the vocalized phoneme, the following steps may be performed: dividing the posterior probability of the target phoneme by the prior probability of the target phoneme to obtain a first quotient value; and the sounding phoneme is respectively the maximum value in the posterior probability of each first phoneme in the phoneme set, and the second quotient is obtained by dividing the prior probability of the sounding phoneme which is taken as the target phoneme. Then, the first quotient value is divided by the second quotient value to obtain a third quotient value. Then, the third quotient is logarithmized. And then, dividing the logarithmic absolute value by the sounding duration value to obtain the probability score of the sounding phoneme.

In a second aspect, a method for scoring a speech is provided, where a segment of audio and text information corresponding to the audio are input into a pre-trained acoustic model, so as to obtain an acoustic measure of each vocalized phoneme included in the audio, where the acoustic measure of each vocalized phoneme includes a vocalization duration value of the vocalized phoneme. Secondly, when the sounding duration value of the sounding phoneme is determined to meet the Gaussian distribution principle corresponding to a target phoneme, determining the predetermined probability score of the sounding phoneme as the target probability score of the sounding phoneme; and when the sounding duration value of the sounding phoneme is determined not to meet the Gaussian distribution principle corresponding to the target phoneme, reducing the probability score of the sounding phoneme which is determined in advance, and determining the probability score as the target probability score of the sounding phoneme. Further, the score in the M-score corresponding to the target probability score of the vocalized phoneme may be determined according to a pre-trained score model. Specifically, the target probability score may be output to a pre-trained scoring model, and the scoring model outputs a score in the M-score corresponding to the target probability score of the sounding phoneme. And further, determining the score in the M system corresponding to the audio according to the score in the M system corresponding to each sounding phoneme. M is generally 100 or 10, i.e., per cent, ten minutes.

Whether the pronunciation of the user is accurate or not can be determined by determining whether the pronunciation duration value of the pronunciation phoneme in the audio meets the Gaussian distribution principle of the pronunciation duration value of the target phoneme, and the accuracy of voice recognition can be improved. And feeding back pronunciation conditions to the user through the scores in the M scores. And when the sounding duration value of the sounding phoneme does not meet the gaussian distribution principle of the sounding duration value of the target phoneme, the probability score of the sounding phoneme is reduced, and the score in the M score corresponding to the probability score is also reduced correspondingly. In this way, by suppressing the score of the non-target phoneme, the score in the final M score can reflect the pronunciation situation of the user more reasonably.

In a possible implementation, the acoustic measure includes a posterior probability that the vocalized phoneme is a first phoneme, the first phoneme is a phoneme included in a phoneme set, and the phoneme set is a phoneme set of a language type corresponding to the text information; after the target probability score of the sounding phoneme is determined, whether the sounding phoneme deviates from a target phoneme can be determined according to the posterior probability of the sounding phoneme being each first phoneme, wherein the target phoneme is a phoneme obtained after the text information is decomposed; when the vocalized phoneme is determined not to deviate from the target phoneme, keeping the target probability score of the vocalized phoneme unchanged; when the sounding phoneme is determined to deviate from the target phoneme, the target probability score of the sounding phoneme is updated in a reducing mode. And further determining the score in the M system corresponding to the target probability score of the sounding phoneme.

Whether the pronunciation of the user is accurate or not can be determined more accurately by starting from two angles of the posterior probability and the sounding duration value of the sounding phoneme. And when the sounding phoneme deviates from the target phoneme, the probability score of the sounding phoneme is reduced again, so that the score in the M-score corresponding to the probability score is also reduced correspondingly. In this way, by further suppressing the score of the non-target phoneme, the score in the final M-score can reflect the pronunciation of the user more reasonably.

The second aspect differs from the first aspect in that: the first aspect determines a target probability score of an uttered phoneme by first determining whether the uttered phoneme deviates from the target phoneme according to a posterior probability. And then updating the determined target probability score according to whether the sounding duration value meets the Gaussian distribution principle. In the second aspect, the target probability score of the sounding phoneme is determined according to whether the sounding duration value meets the Gaussian distribution principle, and then whether the sounding phoneme deviates from the target phoneme is determined according to the posterior probability to update the determined target probability score.

Other possible implementations included in the second aspect are the same as other possible implementations of the first aspect, and the technical effects are also the same, and repeated descriptions are omitted.

In a third aspect, a device for speech scoring is provided, where the device has the functionality of any one of the possible implementations of the first aspect and the first aspect. These functions may be implemented by hardware, or by hardware executing corresponding software. The hardware or software includes one or more functional modules corresponding to the above functions.

In one possible implementation, the apparatus includes:

the acquisition module is used for inputting a section of audio and text information corresponding to the audio into a pre-trained acoustic model to obtain an acoustic measure of each sounding phoneme included in the audio, wherein the acoustic measure includes a posterior probability that each sounding phoneme is a first phoneme, the first phoneme is each phoneme included in a phoneme set, and the phoneme set is a phoneme set of a language type corresponding to the text information;

the verification module is used for determining whether the sounding phonemes deviate from target phonemes or not according to the posterior probabilities that the sounding phonemes are the first phonemes respectively, and the target phonemes are the phonemes obtained after the text information is decomposed;

a probability division module, configured to determine a predetermined probability score of the vocalized phoneme as a target probability score of the vocalized phoneme when it is determined that the vocalized phoneme does not deviate from the target phoneme; when the fact that the sounding phoneme deviates from the target phoneme is determined, the probability score of the sounding phoneme which is determined in advance is subjected to reduction processing, and then the target probability score of the sounding phoneme is determined;

and the scoring module is used for determining the score in the M system corresponding to the target probability score of the sounding phoneme according to a pre-trained scoring model.

In a possible implementation, the verification module, when configured to determine whether the vocalized phoneme deviates from the target phoneme according to the posterior probability that the vocalized phoneme is the first phoneme, is specifically configured to:

when the maximum value of the posterior probabilities of the sounding phonemes, which are respectively every first phoneme, is greater than a set threshold value and is greater than the posterior probability of the sounding phoneme being a target phoneme, determining that the sounding phoneme deviates from the target phoneme; and when the maximum value of the posterior probabilities of the sounding phonemes, which are respectively the first phonemes, is not larger than a set threshold value, and/or the maximum value of the posterior probabilities of the sounding phonemes, which are respectively the target phonemes, is not larger than the posterior probability of the sounding phoneme, determining that the sounding phonemes do not deviate from the target phonemes.

In one possible implementation, the acoustic measure further includes a voicing duration value for the voiced phoneme;

the probability score module is configured to, after determining the target probability score of the sounding phoneme, and before determining a score in the M-score corresponding to the target probability score of the sounding phoneme, further: when the sounding duration value of the sounding phoneme is determined to meet the Gaussian distribution principle corresponding to the target phoneme, keeping the target probability score of the sounding phoneme unchanged; or when the sounding duration value of the sounding phoneme is determined not to meet the Gaussian distribution principle corresponding to the target phoneme, reducing and updating the target probability score of the sounding phoneme.

In one possible implementation, the gaussian distribution principle is a 3 σ principle.

In a possible implementation, the scoring module, when configured to determine, according to a pre-trained scoring model, a score in the M-score corresponding to the target probability score of the vocalized phoneme, is specifically configured to:

according to the formula: m/(1+ e) ^x ) And determining a score in the M system corresponding to the target probability score of the sounding phoneme, wherein Y is the score in the M system, and x is determined according to a first parameter, wherein the first parameter comprises the target probability score of the sounding phoneme.

In one possible implementation, the acoustic measurement further comprises: the energy and/or pitch frequency of the phoneme;

the first parameter further comprises at least one of: utterance duration value, phoneme energy, pitch frequency.

In one possible implementation, the scoring module is further configured to: determining x, wherein w1, w2, w3, w4 and b are constants, x1 is a target probability score of the voiced phoneme, x2 is a voiced duration value of the voiced phoneme, x3 is a phoneme energy of the voiced phoneme, and x4 is a pitch frequency of the voiced phoneme.

In a possible implementation, the scoring module is further configured to train a scoring model by using the target probability scores of the phonemes of the different types of audios and the scores in the M-ary system corresponding to the target probability scores of the phonemes, where scoring intervals in the M-ary system corresponding to the different types of audios are different.

In a fourth aspect, there is provided an apparatus for speech scoring, the apparatus having the functionality of any one of the possible implementations of the second aspect and the second aspect. These functions may be implemented by hardware, or by hardware executing corresponding software. The hardware or software includes one or more functional modules corresponding to the functions described above.

In one possible implementation, the apparatus includes:

the acquisition module is used for inputting a section of audio and text information corresponding to the audio into a pre-trained acoustic model to obtain the acoustic measure of each sounding phoneme included in the audio, wherein the acoustic measure includes the sounding duration value of the sounding phoneme;

the probability division module is used for determining the predetermined probability score of the sounding phoneme as the target probability score of the sounding phoneme when the sounding duration value of the sounding phoneme is determined to meet the Gaussian distribution principle corresponding to the target phoneme; when the sounding duration value of the sounding phoneme is determined not to meet the Gaussian distribution principle corresponding to the target phoneme, the probability score of the sounding phoneme which is determined in advance is determined as the target probability score of the sounding phoneme after being subjected to reduction processing;

In a possible implementation, the acoustic measure includes a posterior probability that the vocalized phoneme is a first phoneme, the first phoneme is a phoneme included in a phoneme set, and the phoneme set is a phoneme set of a language type corresponding to the text information;

the device further comprises:

the verification module is used for determining whether the sounding phonemes deviate from target phonemes according to the posterior probability that the sounding phonemes are respectively every first phoneme, and the target phonemes are obtained after the text information is decomposed;

the probability segmentation module is further used for keeping the target probability score of the sounding phoneme unchanged when the sounding phoneme is determined not to deviate from the target phoneme; when the sounding phoneme is determined to deviate from the target phoneme, the target probability score of the sounding phoneme is updated in a reducing mode.

In a fifth aspect, there is provided a computer program product comprising: computer program code for causing a computer to perform a method as performed in any one of the above-mentioned first aspect and possible implementations of the first aspect, or to perform a method as performed in any one of the above-mentioned second aspect and possible implementations of the second aspect, when said computer program code is run on a computer.

In a sixth aspect, the present application provides a voice scoring apparatus, which includes a processor and a memory, electrically coupled between the processor and the memory; the memory to store computer program instructions; the processor is configured to execute part or all of the computer program instructions in the memory, and when the part or all of the computer program instructions are executed, the processor is configured to implement the functions in the method according to any one of the above-mentioned first aspect and first possible implementation, or implement the functions in the method according to any one of the above-mentioned second aspect and second possible implementation.

In one possible design, the apparatus may further include a transceiver configured to transmit a signal processed by the processor or receive a signal input to the processor.

Drawings

Fig. 1 is a schematic diagram of an evaluation system architecture provided in an embodiment of the present application;

FIG. 2a is a schematic diagram of an evaluation process provided in an embodiment of the present application;

FIG. 2b is a schematic diagram of a process for training an acoustic model provided in an embodiment of the present application;

FIG. 2c is a schematic diagram of text alignment provided in an embodiment of the present application;

fig. 3a is a schematic diagram of audio distribution of native english according to an embodiment of the present application;

FIG. 3b is a schematic diagram of audio distribution of English spoken by Chinese provided in an embodiment of the present application;

FIG. 3c is a schematic diagram of audio distribution of a mislabeled English language provided in an embodiment of the present application;

FIG. 3d is a schematic diagram of a score distribution provided in an embodiment of the present application;

FIG. 4 is a diagram illustrating an example of a process for scoring a voice in accordance with an embodiment of the present application;

fig. 5 is a block diagram of a device for speech scoring provided in an embodiment of the present application;

fig. 6 is a block diagram of a device for speech scoring provided in an embodiment of the present application.

Detailed Description

The embodiments of the present application will be described in detail below with reference to the accompanying drawings.

In order to facilitate understanding of the embodiments of the present application, some terms of the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.

1) The method comprises the following steps of (1) voice evaluation (GOP), and judgment of the sound quality of the spoken language. When the evaluators pronounce the spoken language, the evaluation system gives the scores in the M score, and generally the higher the score is, the better the pronunciation is.

2) The method comprises the steps of voice Recognition (Automatic Speech Recognition), a process of converting voice into text, audio input and audio output, namely text information.

3) Phoneme (phone): the minimum phonetic unit is divided according to the natural attribute of the speech, and is analyzed according to the pronunciation action in the syllable, and one action forms a phoneme. Phonemes are divided into two main categories, vowels and consonants. For example, the Chinese syllable a (e.g., one sound: o) has only one phoneme, ai (e.g., four sounds: ai) has two phonemes, dai (e.g., one sound: slow) has three phonemes, etc.

4) The Posterior Probability (spatial Probability) of a phoneme describes the Probability that the pronunciation (voice) of a user is a certain phoneme, and the higher the Probability value is, the more the voice looks like a certain phoneme.

5) When a sound-producing body produces sound by vibration, the sound is generally decomposed into many simple sine waves, that is, all natural sounds are basically composed of many sine waves of different frequencies (F0). The sine wave with the lowest frequency is the fundamental tone, the frequency is called the fundamental frequency and is represented by F0, and the other sine waves with higher frequencies are the overtones. The pitch frequency is an acoustic feature.

6) Mel Frequency Cepstrum Coefficients (MFCC) describe that besides the sound is represented by the speed of vibration (fundamental Frequency F0), the envelope of the waveform is also an important measurement characteristic, and MFCC is a characteristic representing the index. Mel-frequency cepstral coefficients are an acoustic feature.

7) Gaussian Mixture Model & & Hidden Markov Model (GMM-HMM): in acoustic model training, a model aligns audio with text. So-called alignment, i.e. the audio is divided into slices by text words or phonemes, each slice representing a certain word or phoneme. And inputting the labeled data { acoustic characteristics, corresponding text content } into a neural network for classification training to obtain an acoustic model.

8) Logistic Regression (logistic Regression), a classification model, is used to classify the input into the interval of 0-1. In the application, for the voice evaluation system, the pronunciation condition of the user is mapped to a certain score from 0 to M. M is generally 100 or 10, i.e., per cent, ten minutes.

9) Acoustic measure, some indexes for measuring the voice, such as the posterior probability of the phoneme, the sounding duration value of the phoneme, the phoneme energy, the pitch frequency of the phoneme, and the like.

"and/or" in the present application, describing an association relationship of associated objects, means that there may be three relationships, for example, a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The plural in the present application means two or more.

In the description of the present application, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance, nor order.

In addition, in the embodiments of the present application, the word "exemplary" is used to mean serving as an example, instance, or illustration. Any embodiment or implementation described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or implementations. Rather, the term using examples is intended to present concepts in a concrete fashion.

For convenience of understanding of the embodiment of the present application, an application scenario of the present application is introduced next, and the service scenario described in the embodiment of the present application is for more clearly explaining the technical solution of the embodiment of the present application, and does not constitute a limitation to the technical solution provided in the embodiment of the present application.

Fig. 1 is a schematic diagram of an evaluation system applicable to the present application. In the present application, the execution device for speech evaluation needs to execute the following steps: inputting a section of audio and text information corresponding to the audio into a pre-trained acoustic model to obtain the acoustic measure of each sounding phoneme included in the audio. The acoustic measure includes the posterior probability that the vocalized phoneme is each first phoneme, the first phoneme traverses all phonemes included in the phoneme set, the acoustic measure may further include a vocalization duration value of each vocalized phoneme, a pitch frequency of each vocalized phoneme, phoneme energy, and the like. And then, according to the posterior probability of each first phoneme of the sounding phoneme and the sounding duration value of the sounding phoneme, obtaining the target probability score of the sounding phoneme. And then determining the score in the M system corresponding to the target probability score of the sounding phoneme according to a pre-trained scoring model. M is generally 100 or 10 percent, i.e., ten percent.

In one implementation, the execution device for speech evaluation may be implemented by one or more servers 103. Optionally, the server 103 may also cooperate with other computing devices, such as: and the data storage, the router, the load balancer and other equipment are matched to finish the evaluation process. The server 103 may be disposed on one physical site or distributed over a plurality of physical sites. The server 103 may use the data in the data storage system or call up the program code in the data storage system to implement the above-described process of performing the speech evaluation that the device needs to perform.

The user may operate respective user devices (e.g., local device 101 and local device 102) to interact with server 103. For example, the acquired audio and the corresponding text information are sent to the server 103, so that the server 103 scores the audio. For another example, the scoring result of the sounding audio or the scoring result of the audio is obtained from the server 103, and the scoring result is fed back to the user.

The user device may obtain audio in the following manner: for example, a display screen and a microphone may be provided in the user equipment. A piece of text information, such as a piece of english, is displayed on the display screen, and the user can read the piece of text information, so that a microphone in the user equipment acquires a piece of audio. Or the user equipment can also be provided with a loudspeaker, the user equipment plays a section of audio through the loudspeaker, and a microphone in the user equipment can also acquire a section of audio when the user follows the content played by the loudspeaker. Still alternatively, the user equipment may also receive a piece of audio transmitted by another device, for example, the local device 101 receives a piece of audio transmitted by the local device 102.

The user equipment can feed back the scoring result in the following way: the scoring result may be displayed to the user, for example, via a display screen, or played to the user via a speaker.

Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart speaker, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, game console, robot, and so forth.

Each user's local device may interact with the server via a communication network of any communication mechanism/communication standard, such as a wide area network, a local area network, a peer-to-peer connection, etc., or any combination thereof.

It should be noted that all the functions of the server described above can also be implemented by the local device, that is, the execution device is a local device. For example, the local device 101 implements the functions of the server 103 and provides services to its own user, or provides services to a user of the local device 102.

Under a real voice evaluation scene, the pronunciation of the user may be accurate or inaccurate, or human voice interference exists around the user or the user coughs himself or herself. Under the scene, the method and the device determine whether the sounding phoneme in the audio deviates from the target phoneme from a plurality of angles such as the posterior probability of the phoneme, the sounding duration distribution of the phoneme and the like, and suppress the score of the non-target phoneme, so that the score in the M-division of the sounding phoneme is more reasonable, and the noise voice is effectively suppressed. M is generally 100 or 10 percent, i.e., ten percent.

Next, as shown in FIG. 2a, the process of speech evaluation is described. The whole speech evaluation process is divided into 3 stages: stage one: a training phase of the acoustic model; and a second stage: a training stage of a scoring model; and a third stage: and (5) online evaluation stage.

First, the above-mentioned stage three is introduced: and (5) online evaluation stage.

The examiner speaks a section of audio, the audio and the text information corresponding to the audio are input into the trained acoustic model, and the trained acoustic model can provide various acoustic measures for the online evaluation stage, such as the posterior probability of the phoneme, the sounding duration value of the phoneme, the fundamental tone frequency, the phoneme energy and the like. The on-line evaluation phase also needs to determine the probability score of the phoneme through the posterior probability and/or the vocalization duration value. The trained scoring model may convert the probability scores into scores in the M system, i.e., GOP scores, such as 80 points, 90 points, etc. in the percentile system.

The above stage one is described next: a training phase of the acoustic model.

First, a corpus is provided, which includes a plurality of spoken utterances, for example, english utterances, for example, chinese uttered english utterances of a plurality of provinces, or chinese uttered english utterances of a foreign person. The corpus here can be text information and audio corresponding to the text information. The audio is generally audio consistent with the content of the text information, i.e., audio that is captured when the user's pronunciation is accurate.

As shown in fig. 2b, first, various acoustic features, such as MFCC, F0, are extracted for the audio in the training corpus. For example, acoustic features may be extracted for each speech frame, and one speech frame is typically 20-30ms, and when training-related acoustic features are extracted from a training corpus, the extraction may be performed by using a Kaldi training platform.

Then, a GMM-HMM model is trained according to the audio and the corresponding text information. The GMM-HMM model may align acoustic features with textual information, i.e., determine which textual content corresponds to which portion of the audio.

And then training a neural network model by adopting the extracted acoustic features and the text labels corresponding to the acoustic features obtained by the GMM-HMM model to obtain an acoustic model. The acoustic model may determine to which position in the audio each sounding phoneme in the audio belongs, that is, a certain sounding phoneme corresponds to the start position and the end position in the audio, and may also be understood as which speech frames in the corresponding audio belong to which phoneme G, for example, the 1 st to 10 th frames belong to the phoneme G. As shown in FIG. 2c, the text information corresponding to the audio is "goovertment has big polarity definitions" and the audio segment includes phonemes "G AH1V ER 0M AH 0N T HH AE 1V M EY 1D P AA 1L AH 0S IY 0D IH 0S IH1 ZH AH 0N Z". The acoustic model may output a series of acoustic measures of the voiced phoneme, such as a posteriori probability, voiced duration value, phoneme energy, pitch frequency, and the like. The neural network DNN model used in training the acoustic model can be a nnet3tdnnf model under kaldi, and the model has better voice modeling capability.

The general process of training the acoustic model is described above, and the specific process may refer to the existing training process of the acoustic model, and details are not described again.

The above-mentioned stage two is described next: and (5) a training phase of the scoring model.

And setting a scoring corpus, wherein the scoring in the M scoring system corresponding to the sounding phonemes with the audio is stored in the scoring corpus, and parameters such as probability scoring under the scoring (note that the probability scoring is not posterior probability output by an acoustic model), sounding duration value, fundamental tone frequency, phoneme energy and the like are stored in the scoring corpus.

The scoring model is trained by using multiple sets of data (score in M-score, probability score, utterance duration value, pitch frequency, phoneme energy, etc.) as input data of the scoring model. Thus, the trained scoring model can give the score in the corresponding M-score of the sounding phoneme according to the parameters of the probability score, the sounding duration value, the pitch frequency, the phoneme energy and the like of the sounding phoneme. When M is given as a percentile score, the score may be, for example, 80, 90, etc.

In an example, when the scoring model is trained, the scoring model may be trained by using the target probability scores of the phonemes of different types of audio and the scores in the M-scale corresponding to the target probability scores of the phonemes, where the scoring intervals in the M-scale corresponding to the different types of audio are different. The categories of audio may be, for example: native speech, non-native speech, disorganized speech. For example, English words may include: the native language is three categories of english speeches spoken by a speaker of english (native english), english speeches spoken by a chinese, and english speeches labeled disorganized (text and speech do not correspond). Through different types of audios, the scoring model is trained, and scoring output by the scoring model can be more accurate.

In the model training, multiple pronunciation levels of audio may be selected for training, where a pronunciation level may refer to a grade, for example, when training is performed through 3 levels of audio, the 3 levels of audio may be respectively higher-grade audio, general-grade audio and worse-grade audio.

Then, the process of training the scoring model is described by taking native english, chinese speaking english, and confusingly labeled english as examples, and the three types of audios simulate the 3 levels of audios.

First, a score interval in M-score is pre-assigned for three categories of speech (native english, chinese speaking english, and confusingly labeled english), and the score of native english is usually higher, for example: when the M score is a percentile, the scoring interval in the percentile is 100-90 scores or 100-85 scores; the Chinese speaking English score is generally common, for example, the score interval in the percent system is 95-45 points or 90-50 points; the wrongly labeled English score is usually very low, for example, the score interval in the percentage system is 50-0 points or 45-0 points. Fig. 3a is an audio distribution example of native english, as shown in fig. 3a, fig. 3b and fig. 3 c; FIG. 3b is the audio distribution of a Chinese speaking English language; FIG. 3c is an audio distribution of the mislabeled English language. The horizontal axis in fig. 3a, 3b and 3c is the target probability score of the vocalized phoneme, and the closer to the coordinate origin 0, the higher the posterior probability of the vocalized phoneme is, the more standard the pronunciation is, and the corresponding pre-assigned score is correspondingly higher. As can be seen from the figure, the audio sets of the three categories are obviously different in the distribution of probability scores, and the probability score of native English is basically near 0 and pronounces well; the probability score of Chinese speaking English is about 0 to-8; the probability score of the mislabeled audio is between 0 and-20, and the variance between the audios is large.

And then, carrying out average distribution on the probability score of the sounding phoneme and the score interval in the M system, and mapping a direct relation between the probability score and the score in the M system. The same is done for either category of audio.

For example, for a mislabeled audio, the same phonemes may be included in different audio. Then for each phoneme in the set of phonemes a plurality of probability scores for that phoneme may be obtained. And sorting the probability scores of the phonemes, for example, sorting the phonemes from small to large, and distributing the sorted probability scores to corresponding score intervals in an average manner. For example, the score interval 0-50 in the percentage of the confusingly labeled audio is divided into N equal parts, and the plurality of probability scores of the phonemes are also divided into N equal parts. In each corresponding aliquot, the probability score is mapped to a score in percent. As shown in fig. 3d, the probability scores are divided into 4 parts, i.e., N is 4, and the corresponding score interval (0-50) is also divided into 4 parts, i.e., 0-12.5, 12.5-25, 25-37.5, and 37.5-50, respectively. The closer the probability score is to 0, the more accurate the pronunciation, the higher the score in the corresponding percentile system.

In each share, the probability scores are linearly mapped between the upper limit and the lower limit of the 1/N scoring interval, namely the score in the M system corresponding to each probability score is obtained, and thus a group of data such as { phoneme, probability score, score in the M system } is formed. Further, a set of data { phoneme, probability score, utterance duration value, phoneme energy, pitch frequency, score in M-score } may be obtained. When there are multiple categories of audio, for each category of audio, one phoneme in the phoneme set may obtain multiple sets of data { phoneme, probability score, score in M-score }. Or obtaining a plurality of sets of data { phoneme, probability score, utterance duration value, phoneme energy, pitch frequency, score in M-score }.

Next, for each phoneme in the set of phonemes, a scoring model corresponding to the phoneme may be trained according to the sets of data corresponding to the phoneme (including data extracted according to various categories of audio). A phoneme can train a scoring model.

The core content of the scheme is described above, and the scheme is described in detail in the following with reference to the attached drawings. The features or contents identified in the drawings with broken lines can be understood as optional operations or optional structures of the embodiments of the present application.

As shown in fig. 4, an exemplary diagram of a process of voice scoring is provided, and an executing device of the process may be the user device (local device 101, local device 102) or the server 103 mentioned in fig. 1.

Step 401: inputting a section of audio and text information corresponding to the audio into a pre-trained acoustic model to obtain an acoustic measure of each sounding phoneme included in the audio, wherein the acoustic measure includes a posterior probability that the sounding phoneme is respectively a first phoneme; the first phoneme is each phoneme included in a phoneme set, and the phoneme set is a phoneme set of a language type corresponding to the text information. The acoustic measurements may further include: the utterance duration value of the uttered phoneme, the pitch frequency of the uttered phoneme, the phoneme energy of the uttered phoneme, and the like. The language type may be, for example, Chinese, English, German … ….

When a user reads Chinese, a piece of audio is generated, and the audio corresponds to text information such as Chinese characters. When a user reads english, an audio is generated, and the text information corresponding to the audio may be, for example, english words.

For example, as shown in fig. 2c, the text information corresponding to the audio is "golf male polarity definitions", the language type corresponding to the text information is english, and for example, when the user reads "what is the weather today", a section of audio is generated, and the text information corresponding to the section of audio is chinese characters: "how much the weather is today". The text information has a correct phoneme corresponding thereto, which may be referred to as a target phoneme, that is, the target phoneme is a phoneme obtained by decomposing the text information. When the target phoneme corresponding to the text information is determined, for example, a pronunciation dictionary may be provided, and the target phoneme corresponding to the text is recorded in the pronunciation dictionary, and the target phoneme corresponding to the text information may be found through the pronunciation dictionary. For example, the target phoneme corresponding to the text information "governments have big polarity decisions" is "G AH1V ER 0M AH 0N T HH AE 1V M EY 1D P AA 1L AH 0S IY 0D IH 0S IH1 ZH AH 0N Z". For another example, the text message "how the weather is today" corresponds to the target phoneme of "j", "in", "t", "i" and "an" … …. Phonemes included in a piece of captured audio are referred to as vocalized phonemes.

The training process of the acoustic model has been described above, and the detailed description is not repeated here. After a section of audio is acquired, the section of audio and text information corresponding to the section of audio can be input into a pre-trained acoustic model to obtain an acoustic measure of each sounding phoneme included in the audio, wherein the acoustic measure includes a posterior probability that the sounding phoneme is respectively a first phoneme; the first phoneme traverses all phonemes comprised by the set of phonemes.

Taking the language type of the text information as Chinese (Chinese), the first phonemes included in the phoneme set are respectively: 21 initials: b. p, m, f, d, t, n, l, g, k, h, j, q, x, zh, ch, sh, r, z, c, s; and 24 vowels a, o, e, i, u, v, ai, ei, ui, ao, ou, iu, ie, ve, er, an, en, in, un, vn, ang, eng, ing, ong.

For example, the posterior probabilities of the vocalized phoneme corresponding to the target phoneme of "j" as each first phoneme respectively include: a posterior probability of "b", a posterior probability of "p", a posterior probability of "m", a posterior probability of "f", a posterior probability of "d", a posterior probability of "t", a posterior probability of "n", … …, for traversing each phoneme comprised by the set of phonemes.

For example, the posterior probability of "b" is 20%, the posterior probability of "p" is 30%, the posterior probability of "m" is 40%, and the posterior probability of "f" is 35%, … ….

Optionally, step 402: and determining whether the sounding phoneme deviates from a target phoneme according to the posterior probability that the sounding phoneme is respectively the first phoneme, wherein the target phoneme is a phoneme obtained after the text information is decomposed. And then step 4020a or step 402b is executed.

Step 402 a: when the sounding phoneme is determined not to deviate from the target phoneme, determining the predetermined probability score of the sounding phoneme as the target probability score of the sounding phoneme.

Step 402 b: and when the fact that the sounding phoneme deviates from the target phoneme is determined, the probability score of the sounding phoneme which is determined in advance is subjected to reduction processing, and then the target probability score of the sounding phoneme is determined.

When the target phoneme is not deviated from, the sounding phoneme may be considered to be a correct phoneme, and when the target phoneme is deviated from, the sounding phoneme may be considered as noise, and the score in the M-score of the sounding phoneme may be reduced, so as to achieve the effect of suppressing the non-target speech. M is generally 100 or 10, i.e., per cent, ten minutes.

In one example, a criterion of the posterior probability is set, and when P (x | Ot) -P (q | Ot) <0, the vocalized phoneme deviates from the target phoneme. Where x is a target phoneme included in the text information, and P (x | Ot) is a posterior probability that the vocalized phoneme is the target phoneme. q is the first phoneme corresponding to the maximum value of the posterior probability, and P (q | Ot) is the maximum value of the posterior probability that the sounding phoneme is each the first phoneme. If P (x | Ot) -P (q | Ot) <0, it means that the posterior probability that a certain voiced phoneme is the target phoneme is smaller than the posterior probabilities that the voiced phoneme is other than the target phoneme in the set of phonemes. At this time, the vocalized phoneme may be considered to be deviated from the target phoneme. Namely, the maximum value of the posterior probabilities of the sounding phonemes being respectively each first phoneme is determined, and when the maximum value is larger than the posterior probability that the sounding phoneme is the target phoneme, the sounding phoneme is determined to deviate from the target phoneme. When the maximum value is equal to (there is no case of being smaller than) the posterior probability that the voiced phoneme is the target phoneme, it is determined that the voiced phoneme does not deviate from the target phoneme.

For example, a target phoneme corresponding to a vocalized phoneme is l, the acoustic model outputs a vocalized phoneme with a posterior probability of l of 80%, a posterior probability of n of 65%, a posterior probability of b of 20%, a posterior probability of "p" of 30%, a posterior probability of "m" of 40%, and a posterior probability of "f" of 35%, … …. The maximum value of the posterior probabilities of the vocalized phoneme is 80% for each first phoneme, the maximum probability pronunciation of the user is "l" by the acoustic model, and the maximum value of 80% is equal to the posterior probability 80% that the vocalized phoneme is the target phoneme, and at this time, the vocalized phoneme does not deviate from the target phoneme of "l".

In another example, a posterior probability judgment condition when the target factor is deviated is set: p (x | Ot) -P (q | Ot) <0& & exp (P (q | Ot)) > β. Where x is a target phoneme included in the text information, and P (x | Ot) is a posterior probability that the utterance phoneme is the target phoneme. q is the first phoneme corresponding to the maximum value of the posterior probability, and P (q | Ot) is the maximum value of the posterior probabilities that the sounding phoneme is each the first phoneme. If P (P | Ot) -P (q | Ot) <0, it means that the posterior probability that a certain voiced phoneme is the target phoneme is smaller than the posterior probabilities that the voiced phoneme is other than the target phoneme in the set of phonemes. exp (P (q | Ot)) > β indicates that the voiced sound has a high probability of being the phoneme q. That is, when the maximum value of the posterior probabilities of the vocalized phonemes, respectively, for each first phoneme is greater than (alternatively, greater than or equal to) a set threshold β, and the maximum value is greater than the posterior probability that the vocalized phoneme is the target phoneme, it is determined that the vocalized phoneme deviates from the target phoneme. When the maximum value of the posterior probabilities of the sounding phonemes, which are respectively the first phonemes, is not larger than (and may be smaller than) a set threshold value and/or the maximum value is not larger than the posterior probability of the sounding phoneme being the target phoneme, determining that the sounding phoneme does not deviate from the target phoneme. The set threshold β may be 85%, 90%, 83%, etc. When the sounding phoneme deviates from the target phoneme, the acoustic model considers that the pronunciation of the user is not accurate.

For example, the threshold β is set to 82%, one vocalized phoneme corresponds to the target phoneme as l, the acoustic model outputs the vocalized phoneme as l with a posterior probability of 80%, the acoustic model outputs the vocalized phoneme as l with a posterior probability of 85%, the acoustic model outputs the vocalized phoneme as b with a posterior probability of 20%, the acoustic model outputs the vocal phoneme as "p" with a posterior probability of 30%, the acoustic model outputs the vocal phoneme as "m" with a posterior probability of 40%, and the acoustic model outputs the vocal phoneme as "f" with a posterior probability of 35%, … …. If the maximum value of the posterior probabilities of the vocalized phonemes is 85% respectively, the acoustic model considers that the vocalized phoneme with the maximum probability of the user is "n", the maximum value of 85% is greater than the set threshold β 82%, and the maximum value of 85% is greater than the posterior probability of 80% that the vocalized phoneme is the target phoneme, that is, the vocalized phoneme deviates from the target phoneme of "l".

For another example, a threshold β is set to 82%, one vocalized phoneme corresponds to the target phoneme as l, the acoustic model outputs the vocalized phoneme as l with a posterior probability of 80%, the utterance b with a posterior probability of 65%, the utterance b with a posterior probability of 20%, the utterance p with a posterior probability of 30%, the utterance m with a posterior probability of 40%, and the utterance f with a posterior probability of 35%, … …. If the maximum value of the posterior probabilities of the vocalized phonemes is 80% for each first phoneme, the acoustic model considers that the vocalized phoneme with the maximum probability of the user is "l", and although the maximum value of 80% is equal to 80% of the posterior probability of the vocalized phoneme as the target phoneme, the maximum value of 80% is smaller than the set threshold β 82%, at this time, the vocalized phoneme deviates from the target phoneme of "l".

If the threshold β is set to 80%, or 75%, the maximum value 80% is smaller than the set threshold β 82%, and the maximum value 80% is equal to 80% of the posterior probability that the voiced phoneme is the target phoneme, at which time, the voiced phoneme does not deviate from the target phoneme of "l".

Optionally, in step 403: and determining that the sounding duration value of the sounding phoneme meets the Gaussian distribution principle corresponding to the target phoneme. And then step 403a and step 403b are performed.

Step 403 a: and when the sounding duration value of the sounding phoneme is determined to meet the Gaussian distribution principle corresponding to the target phoneme, determining the predetermined probability score of the sounding phoneme as the target probability score of the sounding phoneme.

Step 403 b: and when the sounding duration value of the sounding phoneme is determined not to meet the Gaussian distribution principle corresponding to the target phoneme, reducing the probability score of the sounding phoneme which is determined in advance, and determining the probability score as the target probability score of the sounding phoneme.

When the gaussian distribution principle is satisfied, the sounding phoneme can be considered as a correct phoneme, and when the gaussian distribution principle is not satisfied, the sounding phoneme can be considered as noise, and the score in the M-score of the sounding phoneme is reduced, so that the effect of suppressing the non-target speech is achieved.

The gaussian distribution principle may be a 3 σ principle, a 2 σ principle, or a1 σ principle of normal distribution. Take the 3 σ rule as an example, that is, when the voicing duration value of a voiced phoneme is within (μ -3 σ, μ +3 σ), the voiced phoneme satisfies the 3 σ rule, otherwise does not satisfy the 3 σ rule. The gaussian distribution principle is a value range, and the value range may include upper and lower limits, may not include upper and lower limits, may include only an upper line and not a lower line, and may include only a lower line and not an upper limit.

For example, if the average duration value of the target phoneme "n" is 0.101s, and the variance of the average duration value is 0.002s, the value range of the phoneme "n" is (0.095, 0.107). If the utterance duration value of the target phoneme "n" in the audio is 0.103, the utterance duration value of 0.103 is within the duration value range (0.095, 0.107) of the target phoneme, and the gaussian distribution principle corresponding to the target phoneme is satisfied. If the utterance duration value of the target phoneme "n" in the audio is 0.109, the utterance duration value of 0.109 is within the duration value range (0.095, 0.107) of the target phoneme, which does not satisfy the gaussian distribution rule corresponding to the target phoneme.

The following describes the procedure of determining the gaussian distribution principle of the time value of the target phoneme: a corpus includes multiple audios, text information corresponding to one audio may correspond to multiple target phonemes, and different audios may correspond to the same target phoneme. For example, "how is the weather today" and "do it with rain today", both "today" are included in the text information. Whether a certain sounding phoneme deviates from the target phoneme can be determined through the posterior probability output by the trained acoustic model. If a certain vocalized phoneme does not deviate from the target phoneme, the vocalized duration value of the vocalized phoneme can be used as a reference data for determining the gaussian distribution principle of the duration value of the target phoneme. Thus, a plurality of referenceable duration values can be obtained for one target phoneme. According to the multiple referenceable time length values corresponding to the target phoneme, the average time length value mu and the time length value variance sigma of the target phoneme can be determined, and then the 3 sigma principle is determined.

The process of determining the probability score of an vocalized phoneme is presented next: firstly, dividing the posterior probability of a target phoneme by the prior probability of the target phoneme to obtain a first quotient value; and the sounding phoneme is respectively the maximum value in the posterior probability of each first phoneme in the phoneme set, and the second quotient is obtained by dividing the prior probability of the sounding phoneme which is taken as the target phoneme. Then, the first quotient value is divided by the second quotient value to obtain a third quotient value. Then, the third quotient is logarithmized. And then, dividing the logarithmic absolute value by the sounding duration value to obtain the probability score of the sounding phoneme.

When the probability score is reduced, the probability score may be reduced according to a set score step value, or may be reduced according to a degree of deviation from the target phoneme or a degree of deviation from a gaussian distribution principle of the time value.

In one example of the present application, step 403 may not be performed, and step 403a and step 403b may not be performed. In another example, step 402 and further step 402a and step 402b may not be performed, that is, only one of step 402 and step 403 may be performed. In another example, step 402 and step 403 may be performed, and at this time, the order of step 403 and step 402 is not limited. Therefore, whether the pronunciation of the user is accurate or not can be determined more accurately by starting from the posterior probability and the sounding duration value of the sounding phoneme.

In one example, step 402 is performed before step 403 is performed. Specifically, determining whether the sounding phoneme deviates from the target phoneme according to the posterior probability that the sounding phoneme is respectively a first phoneme; when the sounding phoneme is determined not to deviate from the target phoneme, determining a predetermined probability score of the sounding phoneme as a target probability score of the sounding phoneme; and when the fact that the sounding phoneme deviates from the target phoneme is determined, the probability score of the sounding phoneme which is determined in advance is subjected to reduction processing, and then the target probability score of the sounding phoneme is determined. Then determining whether the sounding duration value of the sounding phoneme meets the Gaussian distribution principle corresponding to the target phoneme, and keeping the target probability score of the sounding phoneme unchanged when determining that the sounding duration value of the sounding phoneme meets the Gaussian distribution principle corresponding to the target phoneme; or when the sounding duration value of the sounding phoneme is determined not to meet the Gaussian distribution principle corresponding to the target phoneme, reducing and updating the target probability score of the sounding phoneme.

For example, the probability score of the vocalized phoneme is-0.5, if the vocalized phoneme does not deviate from the target phoneme and the vocalization duration value of the vocalized phoneme meets the gaussian distribution principle corresponding to the target phoneme, the target probability score of the vocalized phoneme is still-0.5, and the score in the percentage system corresponding to the target probability score-0.5 is 90. If the vocalized phoneme deviates from the target phoneme but the vocalized duration value of the vocalized phoneme meets the gaussian distribution principle of the target probability, the probability score of-0.5 may be reduced, for example, to-1, and the score in the percentile corresponding to the target probability score of-1 is 80. If the vocalized phoneme does not deviate from the target phoneme but the vocalized duration value of the vocalized phoneme does not meet the gaussian distribution principle of the target probability, the probability score a may be reduced, for example, to-1.2 after the reduction, and the score in the percentile system corresponding to the target probability score-1.2 is 75. If the vocalization phoneme deviates from the target phoneme and the vocalization duration value of the vocalization phoneme does not meet the gaussian distribution principle of the target probability, a reduction process may be performed on the probability score of-0.5, for example, the probability score is-1.7 after the reduction, and the score in the percentile system corresponding to the target probability score of 1.7 is 65.

The score 90 in the percentile system is the highest when the vocalized phoneme does not deviate from the target phoneme and the vocalization duration value of the vocalized phoneme meets the gaussian distribution principle of the target probability. When the vocalized phoneme deviates from the target phoneme and the vocalization duration value of the vocalized phoneme does not meet the gaussian distribution principle of the target probability, the score 65 in the percentile system is the lowest. Whether the sounding phoneme deviates from the target phoneme is judged through the posterior probability and the sounding duration value, and whether the user's pronunciation is accurate can be determined more accurately.

In another example, step 403 is performed before step 402 is performed. Specifically, whether the sounding duration value of the sounding phoneme meets a gaussian distribution principle corresponding to a target phoneme is determined, and when the sounding duration value of the sounding phoneme meets the gaussian distribution principle corresponding to the target phoneme, the predetermined probability score of the sounding phoneme is determined as the target probability score of the sounding phoneme; and when the sounding duration value of the sounding phoneme is determined not to meet the Gaussian distribution principle corresponding to the target phoneme, reducing the probability score of the sounding phoneme which is determined in advance, and determining the probability score as the target probability score of the sounding phoneme. Then, according to the posterior probability of the sounding phoneme being each first phoneme, determining whether the sounding phoneme deviates from the target phoneme; when the vocalized phoneme is determined not to deviate from the target phoneme, keeping the target probability score of the vocalized phoneme unchanged; and when the sounding phoneme is determined to deviate from the target phoneme, performing reduction updating on the target probability score of the sounding phoneme.

Step 404: and determining the score in the M system corresponding to the target probability score of the sounding phoneme according to a pre-trained score model.

For example, the scoring model is Y ═ M/(1+ e) ^x ) I.e. according to the formula Y ═ M/(1+ e) ^x ) And determining the score in the M system corresponding to the target probability score of the sounding phoneme, wherein Y is the score in the M system, and x is determined according to the first parameter. The first parameters also include one or more of a target probability score, a voicing duration value, a phoneme energy, and a base audio. M is generally 100 or 10, i.e., per cent, ten minutes.

In one example, the first parameters include a target probability score for the vocalized phoneme. X ═ w1x1+ b, where w1 is a constant and x1 is the target probability score for the vocalized phoneme.

In one example, the first parameters include a target probability score and a voicing duration value for the voicing phoneme. Then x ═ w1x1+ w2x2+ b, where w1 and w2 are constants, x1 is the target probability score for the vocalized phoneme, and x2 is the vocalization duration value for the vocalized phoneme.

In another example, x is determined according to the formula w1x1+ w2x2+ w3x3+ w4x4+ b. Wherein w1, w2, w3, w4 and b are all constants, x1 is a target probability score of the vocalized phoneme, x2 is a vocalized duration value of the vocalized phoneme, x3 is phoneme energy of the vocalized phoneme, and x4 is a pitch frequency of the vocalized phoneme. For example, x1 is 0.015, x2 is 0.008356104, x3 is 0.1, and x4 is 5.05. The score in the M-score of the vocalized phoneme can be obtained by substituting the data of the 4 dimensions into the score model. The score in the M system is determined through multiple dimensions of the target probability score, the sounding duration value, the phoneme energy and the pitch frequency, and the determined score can be more accurate.

Still further, the score in the M-score corresponding to the audio may be determined according to the score in the M-score corresponding to each vocalized phoneme. Still further, the score in the M-score corresponding to each vocalized phoneme and/or the score in the M-score corresponding to the audio may also be output.

The method for speech scoring in the embodiment of the present application is described above, and the apparatus for speech scoring in the embodiment of the present application is described below. The method and the device are based on the same technical conception, and because the principles of solving the problems of the method and the device are similar, the implementation of the device and the method can be mutually referred, and repeated parts are not repeated.

Based on the same technical concept as the method for scoring a voice as described above, as shown in fig. 5, there is provided a device 500 for scoring a voice, wherein the device 500 can perform the steps performed in fig. 4 as described above. The apparatus 500 may be a user equipment, a chip applied to the user equipment, a server, or a chip applied to the server. The apparatus 500 comprises: an acquisition module 510, a verification module 520, a probability score module 530, and a scoring module 540.

In a possible implementation, the obtaining module 510 is configured to input a segment of audio and text information corresponding to the audio into a pre-trained acoustic model, to obtain an acoustic measure of each vocalized phoneme included in the audio, where the acoustic measure includes a posterior probability that each vocalized phoneme is a first phoneme, the first phoneme is each phoneme included in a phoneme set, and the phoneme set is a phoneme set of a language type corresponding to the text information;

the verification module 520 is configured to determine whether the vocalized phoneme deviates from a target phoneme according to the posterior probability that the vocalized phoneme is each of the first phonemes, where the target phoneme is a phoneme obtained after the text information is decomposed;

the probability score module 530 is configured to determine a predetermined probability score of the vocalized phoneme as a target probability score of the vocalized phoneme when it is determined that the vocalized phoneme does not deviate from the target phoneme; when the fact that the sounding phoneme deviates from the target phoneme is determined, the probability score of the sounding phoneme which is determined in advance is subjected to reduction processing, and then the target probability score of the sounding phoneme is determined;

the scoring module 540 is configured to determine, according to a pre-trained scoring model, a score in the M-score corresponding to the target probability score of the vocalized phoneme.

In a possible implementation, the verification module 520, when configured to determine whether the vocalized phoneme deviates from the target phoneme according to the posterior probability of the vocalized phoneme being each of the first phonemes, is specifically configured to: determining the sounding phoneme as the maximum value of the posterior probability of each first phoneme; when the maximum value is larger than a set threshold value and the maximum value is larger than the posterior probability that the sounding phoneme is the target phoneme, determining that the sounding phoneme deviates from the target phoneme; and when the maximum value is not larger than a set threshold value and/or the maximum value is not larger than the posterior probability that the sounding phoneme is the target phoneme, determining that the sounding phoneme does not deviate from the target phoneme.

In one possible implementation, the acoustic measure further includes an utterance duration value of the uttered phoneme; after determining the target probability score of the sounding phoneme, and before determining the score in the M-score corresponding to the target probability score of the sounding phoneme, the probability score module 530 is further configured to: when the sounding duration value of the sounding phoneme is determined to meet the Gaussian distribution principle corresponding to the target phoneme, keeping the target probability score of the sounding phoneme unchanged; or when the sound-making duration value of the sound-making phoneme is determined not to meet the Gaussian distribution principle corresponding to the target phoneme, reducing and updating the target probability score of the sound-making phoneme.

In a possible implementation, the scoring module 540, when configured to determine, according to a pre-trained scoring model, a score in the M-score corresponding to the target probability score of the vocalized phoneme, is specifically configured to: according to the formula: m/(1+ e) for Y ^x ) And determining a score in the M system corresponding to the target probability score of the sounding phoneme, wherein Y is the score in the M system, and x is determined according to a first parameter, wherein the first parameter comprises the target probability score of the sounding phoneme.

In one possible implementation, the acoustic measurement further comprises: the energy and/or pitch frequency of the phoneme; the first parameter further comprises at least one of: utterance duration value, phoneme energy, pitch frequency. The scoring module 540 is further configured to: determining x, wherein w1, w2, w3, w4 and b are constants, x1 is a target probability score of the voiced phoneme, x2 is a voiced duration value of the voiced phoneme, x3 is a phoneme energy of the voiced phoneme, and x4 is a pitch frequency of the voiced phoneme.

In a possible implementation, the scoring module 540 is further configured to train a scoring model by using the target probability scores of the phonemes of the audio of different types and the scores in the M-ary system corresponding to the target probability scores of the phonemes, where scoring intervals in the M-ary system corresponding to the audio of different types are different.

In a possible implementation, the obtaining module 510 is configured to input a section of audio and text information corresponding to the audio into a pre-trained acoustic model, so as to obtain an acoustic measure of each sounding phoneme included in the audio, where the acoustic measure includes a sounding duration value of the sounding phoneme;

the probability segmentation module 530 is configured to determine a predetermined probability score of the vocalized phoneme as a target probability score of the vocalized phoneme when it is determined that the vocalized duration value of the vocalized phoneme meets a gaussian distribution rule corresponding to a target phoneme; when the sounding duration value of the sounding phoneme is determined not to meet the Gaussian distribution principle corresponding to the target phoneme, the probability score of the sounding phoneme which is determined in advance is determined as the target probability score of the sounding phoneme after being subjected to reduction processing;

In one possible implementation, the acoustic measure includes a posterior probability that the vocalized phoneme is each a first phoneme, the first phoneme traversing all phonemes comprised in the set of phonemes;

the device further comprises:

a verification module 520, configured to determine whether the vocalized phoneme deviates from the target phoneme according to the posterior probability that the vocalized phoneme is each of the first phonemes;

the probability segmentation module 530 is further configured to keep the target probability score of the vocalized phoneme unchanged when it is determined that the vocalized phoneme does not deviate from the target phoneme; and when the sounding phoneme is determined to deviate from the target phoneme, performing reduction updating on the target probability score of the sounding phoneme.

Fig. 6 is a schematic block diagram of a speech scoring apparatus 600 according to an embodiment of the present application. It should be understood that the apparatus 600 is capable of performing the various steps in the method of fig. 4 described above. The apparatus 600 comprises: the processor 610, optionally, also includes a transceiver 620 and a memory 630. The transceiver may be configured to receive program instructions and transmit the program instructions to the processor, or the transceiver may be configured to perform communication interaction between the apparatus and other communication devices, such as interaction control signaling and/or service data. The transceiver may be a code and/or data read-write transceiver or the transceiver may be a signal transmission transceiver between the communication processor and the transceiver. The processor 610 and the memory 630 are electrically coupled.

Illustratively, a memory 630 for storing a computer program; the processor 610 may be configured to invoke computer programs or instructions stored in the memory to perform the above-described voice scoring method, or to perform the above-described voice scoring method via the transceiver 620.

For example, the processor 610 is configured to input a segment of audio and text information corresponding to the audio into a pre-trained acoustic model, to obtain an acoustic measure of each vocalized phoneme included in the audio, where the acoustic measure includes a posterior probability that each vocalized phoneme is a first phoneme, the first phoneme is each phoneme included in a phoneme set, and the phoneme set is a phoneme set of a language type corresponding to the text information; determining whether the sounding phoneme deviates from a target phoneme according to the posterior probability that the sounding phoneme is respectively the first phoneme, wherein the target phoneme is a phoneme obtained after the text information is decomposed; when the sounding phoneme is determined not to deviate from the target phoneme, determining a predetermined probability score of the sounding phoneme as a target probability score of the sounding phoneme; when the fact that the sounding phoneme deviates from the target phoneme is determined, the probability score of the sounding phoneme which is determined in advance is subjected to reduction processing, and then the target probability score of the sounding phoneme is determined;

and determining the score in the M system corresponding to the target probability score of the sounding phoneme according to a pre-trained score model.

In one example, the processor 610, when configured to determine whether the vocalized phoneme deviates from the target phoneme according to the posterior probability of the vocalized phoneme being each of the first phonemes, is specifically configured to: when the maximum value of the posterior probabilities of the sounding phonemes, which are respectively the first phonemes, is larger than a set threshold value, and the maximum value is larger than the posterior probability that the sounding phoneme is the target phoneme, determining that the sounding phoneme deviates from the target phoneme; and when the maximum value of the posterior probabilities of the sounding phonemes, which are respectively every first phoneme, is not more than a set threshold value and/or the maximum value is not more than the posterior probability of the sounding phoneme being the target phoneme, determining that the sounding phoneme does not deviate from the target phoneme.

In one example, the processor 610 is further configured to, after determining the target probability score of the vocalized phoneme, before determining the score in the M-score corresponding to the target probability score of the vocalized phoneme, when determining that the vocalized duration value of the vocalized phoneme satisfies the gaussian distribution rule corresponding to the target phoneme, keep the target probability score of the vocalized phoneme unchanged; or when the sounding duration value of the sounding phoneme is determined not to meet the Gaussian distribution principle corresponding to the target phoneme, reducing and updating the target probability score of the sounding phoneme.

In one example, the processor 610, when configured to determine a score in the M-scale corresponding to the target probability score of the vocalized phoneme according to a pre-trained scoring model, is specifically configured to: according to the formula: m/(1+ e) ^x ) Determining the score in the M system corresponding to the target probability score of the sounding phoneme, wherein Y is the score in the M system, and x is the rootDetermining from a first parameter, the first parameter comprising a target probability score for the vocalized phoneme.

In one example, the processor 610 may be further configured to: determining x, wherein w1, w2, w3, w4 and b are constants, x1 is a target probability score of the voiced phoneme, x2 is a voiced duration value of the voiced phoneme, x3 is a phoneme energy of the voiced phoneme, and x4 is a pitch frequency of the voiced phoneme.

In an example, the processor 610 may be further configured to train a scoring model using the target probability scores of the phonemes of the different types of audio and the scores in the M-scale corresponding to the target probability scores of the phonemes, where scoring intervals in the M-scale corresponding to the different types of audio are different.

In an example, the processor 610 may be further configured to input a piece of audio and text information corresponding to the audio into a pre-trained acoustic model, to obtain an acoustic measure of each vocalized phoneme included in the audio, where the acoustic measure includes a vocalization duration value of the vocalized phoneme; when the sounding duration value of the sounding phoneme is determined to meet the Gaussian distribution principle corresponding to a target phoneme, determining the predetermined probability score of the sounding phoneme as the target probability score of the sounding phoneme; when the sounding duration value of the sounding phoneme is determined not to meet the Gaussian distribution principle corresponding to the target phoneme, the probability score of the sounding phoneme which is determined in advance is determined as the target probability score of the sounding phoneme after being subjected to reduction processing; and determining the score in the M system corresponding to the target probability score of the sounding phoneme according to a pre-trained scoring model.

In an example, the processor 610 may be further configured to, after determining the target probability score of the vocalized phoneme, before determining a score in the M-score corresponding to the target probability score of the vocalized phoneme, determine whether the vocalized phoneme deviates from the target phoneme according to the posterior probability of the vocalized phoneme for each first phoneme; when the vocalized phoneme is determined not to deviate from the target phoneme, keeping the target probability score of the vocalized phoneme unchanged;

and when the sounding phoneme is determined to deviate from the target phoneme, performing reduction updating on the target probability score of the sounding phoneme.

The processor may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP. The processor may further include a hardware chip or other general purpose processor. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The aforementioned PLDs may be Complex Programmable Logic Devices (CPLDs), field-programmable gate arrays (FPGAs), General Array Logic (GAL) and other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., or any combination thereof. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will also be appreciated that the memory referred to in the embodiments herein may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic random access memory (DDR SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DR RAM). It should be noted that the memory described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

The transceiver, the interface circuit, or the transceiver according to the embodiments of the present application may include a separate transmitter and/or a separate receiver, or may be an integrated transmitter and receiver. The transceiver device, interface circuit, or transceiver may operate under the direction of a corresponding processor. Alternatively, the sender may correspond to a transmitter in the physical device, and the receiver may correspond to a receiver in the physical device.

The embodiment of the present application further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a computer, the computer program can be used to enable the computer to execute the above-mentioned voice scoring method.

Embodiments of the present application also provide a computer program product containing instructions that, when executed on a computer, enable the computer to perform the method for speech scoring provided above.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the present application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include such modifications and variations.

Claims

1. A method of speech scoring, the method comprising:

inputting a section of audio and text information corresponding to the audio into a pre-trained acoustic model to obtain an acoustic measure of each sounding phoneme included in the audio, wherein the acoustic measure includes a posterior probability that each sounding phoneme is a first phoneme, the first phoneme is each phoneme included in a phoneme set, and the phoneme set is a phoneme set of a language type corresponding to the text information;

determining whether the sounding phonemes deviate from target phonemes according to the posterior probability that the sounding phonemes are the first phonemes respectively, wherein the target phonemes are phonemes obtained after the text information is decomposed;

when the sounding phoneme is determined not to deviate from the target phoneme, determining a predetermined probability score of the sounding phoneme as a target probability score of the sounding phoneme;

when the fact that the sounding phoneme deviates from the target phoneme is determined, the probability score of the sounding phoneme which is determined in advance is subjected to reduction processing, and then the target probability score of the sounding phoneme is determined;

and determining the score in the M scores corresponding to the target probability score according to a pre-trained score model.

2. The method of claim 1 wherein said determining whether said vocalized phoneme deviates from said target phoneme based on a posterior probability that said vocalized phoneme is a respective first phoneme comprises:

determining the sounding phoneme as the maximum value of the posterior probability of each first phoneme;

when the maximum value is larger than a set threshold value and the maximum value is larger than the posterior probability that the sounding phoneme is the target phoneme, determining that the sounding phoneme deviates from the target phoneme;

and when the maximum value is not larger than a set threshold value and/or the maximum value is not larger than the posterior probability that the sounding phoneme is the target phoneme, determining that the sounding phoneme does not deviate from the target phoneme.

3. The method of claim 1 or 2, wherein the acoustic measure further comprises a voicing duration value for the voicing phoneme;

after determining the target probability score of the sounding phoneme, before determining the score in the M system corresponding to the target probability score of the sounding phoneme, the method further includes:

when the sounding duration value of the sounding phoneme is determined to meet the Gaussian distribution principle corresponding to the target phoneme, keeping the target probability score of the sounding phoneme unchanged; alternatively, the first and second electrodes may be,

and when the sound-making duration value of the sound-making phoneme is determined not to meet the Gaussian distribution principle corresponding to the target phoneme, reducing and updating the target probability score of the sound-making phoneme.

4. A method according to claim 3, characterized in that said gaussian distribution principle is the 3 σ principle.

5. The method of any one of claims 1-4, wherein said determining a score in the M-score corresponding to the target probability score of the vocalized phoneme according to a pre-trained scoring model comprises:

according to the formula: m/(1+ e) for Y ^x ) And determining a score in the M system corresponding to the target probability score of the sounding phoneme, wherein Y is the score in the M system, and x is determined according to a first parameter, wherein the first parameter comprises the target probability score of the sounding phoneme.

6. The method of claim 5, wherein the acoustic measurement further comprises: the energy and/or pitch frequency of the phoneme;

7. The method of claim 6, wherein determining x from the first parameter comprises:

according to the formula: determining x, wherein w1, w2, w3, w4 and b are constants, x1 is a target probability score of the voiced phoneme, x2 is a voiced duration value of the voiced phoneme, x3 is a phoneme energy of the voiced phoneme, and x4 is a pitch frequency of the voiced phoneme.

8. The method of any one of claims 1-7, wherein the process of pre-training the scoring model comprises:

and training a scoring model by adopting the target probability scores of the phonemes of the audios of different types and the scores in the M systems corresponding to the target probability scores of the phonemes, wherein the scoring intervals in the M systems corresponding to the audios of different types are different.

9. A method of speech scoring, the method comprising:

inputting a section of audio and text information corresponding to the audio into a pre-trained acoustic model to obtain an acoustic measure of each sounding phoneme included in the audio, wherein the acoustic measure includes a sounding duration value of the sounding phoneme;

when the sounding duration value of the sounding phoneme is determined to meet the Gaussian distribution principle corresponding to a target phoneme, determining the predetermined probability score of the sounding phoneme as the target probability score of the sounding phoneme;

when the sounding duration value of the sounding phoneme is determined not to meet the Gaussian distribution principle corresponding to the target phoneme, the predetermined probability score of the sounding phoneme is determined as the target probability score of the sounding phoneme after reduction processing is performed on the probability score of the sounding phoneme;

and determining the score in the M system corresponding to the target probability score of the sounding phoneme according to a pre-trained scoring model.

10. The method of claim 9 wherein said acoustic measure comprises a posterior probability that said vocalized phone is a respective first phone, said first phone traversing all phones comprised in a set of phones;

determining whether the sounding phoneme deviates from the target phoneme according to the posterior probability that the sounding phoneme is respectively each first phoneme;

when the vocalized phoneme is determined not to deviate from the target phoneme, keeping the target probability score of the vocalized phoneme unchanged;

11. An apparatus for speech scoring, the apparatus comprising:

a probability division module, configured to determine a predetermined probability score of the sounding phoneme as a target probability score of the sounding phoneme when it is determined that the sounding phoneme does not deviate from the target phoneme; when the fact that the sounding phoneme deviates from the target phoneme is determined, after the predetermined probability score of the sounding phoneme is subjected to reduction processing, the target probability score of the sounding phoneme is determined;

12. The apparatus of claim 11, wherein said verification module, when configured to determine whether said vocalized phoneme deviates from said target phoneme based on a posterior probability that said vocalized phoneme is a respective first phoneme, is specifically configured to:

determining the sounding phoneme to be the maximum value in the posterior probability of each first phoneme; when the maximum value is larger than a set threshold value and the maximum value is larger than the posterior probability that the sounding phoneme is the target phoneme, determining that the sounding phoneme deviates from the target phoneme; when the maximum value is not larger than a set threshold value and/or the maximum value is not larger than the posterior probability that the sounding phoneme is the target phoneme, determining that the sounding phoneme does not deviate from the target phoneme.

13. The apparatus of claim 11 or 12, wherein the acoustic measure further comprises a voicing duration value for the voicing phoneme;

the probability score module is configured to, after determining the target probability score of the sounding phoneme, and before determining a score in the M-score corresponding to the target probability score of the sounding phoneme, further: when the sounding duration value of the sounding phoneme is determined to meet the Gaussian distribution principle corresponding to the target phoneme, keeping the target probability score of the sounding phoneme unchanged; or when the sound-making duration value of the sound-making phoneme is determined not to meet the Gaussian distribution principle corresponding to the target phoneme, reducing and updating the target probability score of the sound-making phoneme.

14. The apparatus of claim 13, wherein the gaussian distribution principle is a 3 σ principle.

15. The apparatus according to any of the claims 11-14, wherein the scoring module, when configured to determine the score in the M-score corresponding to the target probability score of the vocalized phoneme according to the pre-trained scoring model, is specifically configured to:

according to the formula: m/(1+ e) for Y ^x ) Determining M-scores corresponding to the target probability scores of the vocalized phonemesWherein Y is a score in the M score, and x is determined according to a first parameter, the first parameter including a target probability score of the vocalized phoneme.

16. The apparatus of claim 15, wherein the acoustic measurement further comprises: the phoneme energy and/or pitch frequency;

17. The apparatus of claim 16, wherein the scoring module is further configured to: determining x, wherein w1, w2, w3, w4 and b are constants, x1 is a target probability score of the voiced phoneme, x2 is a voiced duration value of the voiced phoneme, x3 is a phoneme energy of the voiced phoneme, and x4 is a pitch frequency of the voiced phoneme.

18. The apparatus according to any of claims 11-17, wherein the scoring module is further configured to train the scoring model using the target probability scores of the phonemes for different types of audio and the scores in the M-ary system corresponding to the target probability scores of the phonemes, where the scoring intervals in the M-ary system corresponding to different types of audio are different.

19. An apparatus for speech scoring, the apparatus comprising:

20. The apparatus of claim 19, wherein said acoustic measure comprises a posterior probability that said vocalized phone is a respective first phone, said first phone traversing all phones comprised in a set of phones;

the device further comprises:

the verification module is used for determining whether the sounding phoneme deviates from the target phoneme according to the posterior probability that the sounding phoneme is respectively the first phoneme;

the probability segmentation module is further used for keeping the target probability score of the sounding phoneme unchanged when the sounding phoneme is determined not to deviate from the target phoneme; and when the sounding phoneme is determined to deviate from the target phoneme, performing reduction updating on the target probability score of the sounding phoneme.

21. An apparatus for speech scoring, the apparatus comprising: a processor and a memory;

the memory to store computer program instructions;

the processor is configured to execute part or all of the computer program instructions in the memory, and when the part or all of the computer program instructions are executed, to implement the method according to any one of claims 1 to 8, or to implement the method according to claim 9 or 10.

22. A computer-readable storage medium, in which a computer program is stored which, when executed by a computer, causes the computer to carry out the method of any one of claims 1 to 8, or the method of claim 9 or 10.