CN111951825A

CN111951825A - Pronunciation evaluation method, medium, device and computing equipment

Info

Publication number: CN111951825A
Application number: CN201910405363.6A
Authority: CN
Inventors: 杨晓飞; 蒋成林; 刘晨晨; 沈欣尧; 张欣; 王治民; 邓雅惠; 高慧朝
Original assignee: Shanghai Liulishuo Information Technology Co ltd
Current assignee: Shanghai Liulishuo Information Technology Co ltd
Priority date: 2019-05-16
Filing date: 2019-05-16
Publication date: 2020-11-17

Abstract

The embodiment of the invention provides a pronunciation evaluation method, a pronunciation evaluation device, a pronunciation evaluation medium and a pronunciation evaluation computing device. The method comprises the following steps: extracting at least one audio data segment from the audio frequency of the pronunciation to be tested of the user aiming at the content to be evaluated; acquiring a time boundary corresponding to at least one audio data segment and acoustic likelihood in the corresponding time boundary; acquiring a phoneme feature sequence to be detected corresponding to at least one audio data segment; identifying a phoneme to be corrected which is inconsistent with the standard phoneme feature sequence of the content to be evaluated from the phoneme feature sequence to be evaluated based on the time boundary, the confusion phoneme table and a threshold value corresponding to the confusion phoneme; and if the phonemes to be corrected exist in the corresponding time boundary and the acoustic likelihood is greater than a preset threshold, adjusting the corresponding score based on the phonemes to be corrected. The method can greatly improve the recognition rate of the confused phonemes in the pronunciation audio, provide more targeted pronunciation evaluation feedback for the user and improve the user experience.

Description

Pronunciation evaluation method, medium, device and computing equipment

Technical Field

The embodiment of the invention relates to the field of software, in particular to a pronunciation assessment method, a medium, a device and a computing device.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

In the process of language learning, learning correct spoken pronunciation is also a very important part, however, in the previous years, the spoken learning can only follow the teacher under the line, and with the development of technology, the online spoken learning becomes a trend, and in recent years, the evaluation and correction of spoken pronunciation are mainly established on the representation of speech features.

However, most of the existing pronunciation assessment schemes use the classic gop (goodness of pronunciations) algorithm proposed in his doctor's paper by silk Witt, cambridge university, or other schemes derived therefrom. Most of the existing pronunciation evaluation schemes use a neural network model trained by a CE (Cross Entropy) criterion or an older GMM model to calculate a likelihood score of a user pronunciation, and the CE model has a low accuracy in phoneme recognition and cannot recognize and correct phonemes which are easy to be confused or mispronounced when the user pronounces.

Disclosure of Invention

Because the existing pronunciation evaluation scheme adopts the CE model to calculate the likelihood score of the user pronunciation, and the CE model has low accuracy in phoneme recognition, the existing pronunciation evaluation scheme cannot recognize and correct phonemes which are easy to be confused or mistakenly pronounced when the user pronounces. Therefore, an improved pronunciation assessment method is needed to improve the accuracy of phoneme recognition and solve the above-mentioned technical problems.

In this context, embodiments of the present invention are intended to provide a pronunciation assessment method, apparatus, medium, and computing device.

In a first aspect of embodiments of the present invention, there is provided a pronunciation assessment method including: extracting at least one audio data segment from the audio frequency of the pronunciation to be tested of the user aiming at the content to be evaluated; acquiring a time boundary corresponding to at least one audio data segment and acoustic likelihood in the corresponding time boundary; acquiring a phoneme feature sequence to be detected corresponding to at least one audio data segment; identifying a phoneme to be corrected which is inconsistent with the standard phoneme feature sequence of the evaluation content from the phoneme feature sequence to be detected based on the time boundary, the confusion phoneme table and a threshold value corresponding to the confusion phoneme; and if the phoneme to be corrected exists in the corresponding time boundary and the acoustic likelihood is greater than a preset threshold, adjusting the corresponding score based on the phoneme to be corrected.

In still another embodiment of the present invention, the pronunciation assessment method further includes: and determining pronunciation error correction content pushed to the user based on the phonemes to be corrected and/or the adjusted scores, wherein the pronunciation error correction content is used for instructing the user to perform improved exercises on the phonemes to be corrected.

In another embodiment of the present invention, a phoneme recognition network is used to obtain a phoneme feature sequence to be tested corresponding to at least one audio data segment.

In yet another embodiment of the present invention, a phoneme recognition network is constructed from at least one word in the content of the assessment, a pronunciation dictionary, and a confusing phoneme table; and the word graph weight of each network path in the phoneme recognition network is adjusted according to the pre-input development set.

In another embodiment of the present invention, identifying a phoneme to be corrected, which is inconsistent with the standard phoneme feature sequence of the content to be evaluated, from the phoneme feature sequence to be evaluated based on the time boundary, the confusion phoneme table and the threshold corresponding to the confusion phoneme includes:

acquiring a standard phoneme characteristic sequence generated based on the evaluation content;

according to the time boundary, the phoneme characteristic sequence to be detected corresponding to each word in at least one audio data segment is aligned with the standard phoneme characteristic sequence corresponding to the word in editing distance to obtain distinguishing phoneme information;

and determining the phonemes to be corrected corresponding to the different phoneme information through a Bayesian judgment module according to the confusion phoneme table and the threshold corresponding to the confusion phonemes.

In yet another embodiment of the present invention, the distinguishing phoneme information includes position information that phonemes not consistent with the standard phoneme feature sequence are in the phoneme feature sequence to be tested.

In yet another embodiment of the present invention, the phonemes to be corrected include confusing phonemes that are acoustically similar to the standard phonemes in the content under evaluation.

In yet another embodiment of the present invention, a cross entropy criterion CE model is employed to obtain time boundaries corresponding to at least one audio data segment and acoustic likelihoods within the corresponding time boundaries.

In a second aspect of an embodiment of the present invention, there is provided a pronunciation assessment apparatus including:

the extraction module is configured to extract at least one audio data segment from the pronunciation audio to be tested of the user aiming at the content to be evaluated;

the first evaluation module is configured to acquire a time boundary corresponding to at least one audio data segment and acoustic likelihood in the corresponding time boundary;

the second evaluation module is configured to acquire a phoneme characteristic sequence to be tested corresponding to at least one audio data segment; identifying a phoneme to be corrected which is inconsistent with the standard phoneme feature sequence of the content to be evaluated from the phoneme feature sequence to be evaluated based on the time boundary, the confusion phoneme table and a threshold value corresponding to the confusion phoneme;

and the adjusting module is configured to adjust the corresponding score based on the phoneme to be corrected if the phoneme to be corrected exists in the corresponding time boundary and the acoustic similarity is greater than a preset threshold.

In a further embodiment of the present invention, the pronunciation assessment apparatus further comprises a determination module configured to determine pronunciation error correction content to be pushed to the user based on the phonemes to be corrected and/or the adjusted scores, wherein the pronunciation error correction content is used to instruct the user to perform improved exercises on the phonemes to be corrected.

In another embodiment of the present invention, the second evaluation module is further provided with a phoneme recognition network, and the phoneme recognition network is specifically configured to obtain a phoneme feature sequence to be tested corresponding to at least one audio data segment.

In yet another embodiment of the present invention, a phoneme recognition network is constructed from at least one word in the content of the assessment, a pronunciation dictionary, and a confusing phoneme table; further, the word graph of each network path in the phoneme recognition network is adjusted according to a pre-recorded development set.

In another embodiment of the present invention, the second evaluation module, when identifying a phoneme to be corrected that is inconsistent with the standard phoneme feature sequence of the evaluation content from the phoneme feature sequence to be detected based on the time boundary, the confusion phoneme table, and the threshold corresponding to the confusion phoneme, is specifically configured to: acquiring a standard phoneme characteristic sequence generated based on the evaluation content; according to the time boundary, the phoneme feature sequence to be detected corresponding to each word in at least one audio data segment is aligned with the standard phoneme feature sequence corresponding to the word in editing distance to obtain distinguished phoneme information; and determining the phonemes to be corrected corresponding to the distinguished phoneme information through a Bayesian judgment module according to the confusion phoneme table and the threshold corresponding to the confusion phonemes.

In a further embodiment of the present invention, the first evaluation module is provided with a cross entropy criterion CE model, and the CE model is specifically configured to obtain a time boundary corresponding to the at least one audio data segment and an acoustic likelihood within the corresponding time boundary.

In a third aspect of embodiments of the present invention, there is provided a medium having stored thereon computer-executable instructions for causing a computer to perform the method of any one of the first aspect.

In a fourth aspect of embodiments of the present invention, there is provided a computing device comprising a processing unit, a memory, and an input/output (In/Out, I/O) interface; a memory for storing programs or instructions for execution by the processing unit; a processing unit for performing the method of any of the embodiments of the first aspect in accordance with a program or instructions stored by the memory; an I/O interface for receiving or transmitting data under control of the processing unit.

The technical scheme provided by the embodiment of the invention can identify the phonemes to be corrected, which are inconsistent with the standard phoneme characteristic sequence, from the pronunciation audio of the user, so that the corresponding score of the pronunciation audio of the user is adjusted, the identification rate of the confused phonemes in the pronunciation audio is greatly improved, more targeted pronunciation evaluation feedback is provided for the user, and the user experience is improved.

Drawings

The foregoing and other objects, features and advantages of exemplary embodiments of the present invention will be readily understood by reading the following detailed description with reference to the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

fig. 1 is a schematic structural diagram illustrating a pronunciation assessment scene according to an embodiment of the present invention;

FIG. 2 schematically shows a flow diagram of a pronunciation assessment method according to an embodiment of the invention;

FIG. 3 is a schematic diagram illustrating another pronunciation assessment scenario according to an embodiment of the invention;

FIG. 4A is a schematic diagram illustrating the structure of a phoneme sequence according to an embodiment of the present invention;

FIG. 4B schematically illustrates a structural schematic of a confusion tone data set, according to an embodiment of the invention;

fig. 5 schematically shows a structural schematic view of a pronunciation assessment apparatus according to an embodiment of the present invention.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It should be understood that these embodiments are given only for the purpose of enabling those skilled in the art to better understand and to implement the present invention, and do not limit the scope of the present invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to an embodiment of the invention, a pronunciation assessment method, a medium, a device and a computing device are provided.

In this document, it is to be understood that the number of any element in the figures is intended to be illustrative rather than restrictive, and that any nomenclature is used for differentiation only and not in any limiting sense.

The principles and spirit of the present invention are explained in detail below with reference to several exemplary embodiments thereof.

Summary of The Invention

The inventor finds that the conventional pronunciation evaluation scheme mostly adopts the classic GOP algorithm or other schemes derived from the classic GOP algorithm. These existing pronunciation evaluation schemes use a CE model to calculate a likelihood score of a user pronunciation, but the CE model has a low accuracy in phoneme recognition, and cannot recognize and correct phonemes which are easy to be confused or mispronounced when the user pronounces.

In order to overcome the problems in the technology, the invention provides a pronunciation assessment method, a pronunciation assessment device, a pronunciation assessment medium and a computing device. The method comprises the following steps: extracting at least one audio data segment from the audio frequency of the pronunciation to be tested of the user aiming at the content to be evaluated; acquiring a time boundary corresponding to at least one audio data segment and acoustic likelihood in the corresponding time boundary; acquiring a phoneme characteristic sequence to be detected corresponding to at least one audio data segment; identifying a phoneme to be corrected which is inconsistent with the standard phoneme feature sequence of the content to be evaluated from the phoneme feature sequence to be evaluated based on the time boundary, the confusion phoneme table and a threshold value corresponding to the confusion phoneme; and if the phonemes to be corrected exist in the corresponding time boundary and the acoustic likelihood is greater than a preset threshold, adjusting the corresponding score based on the phonemes to be corrected. The method can realize the adjustment of the corresponding scores of the pronunciation audio of the user by identifying the phonemes to be corrected which are inconsistent with the standard phoneme characteristic sequence from the pronunciation audio of the user, thereby greatly improving the recognition rate of the confused phonemes in the pronunciation audio, providing more targeted pronunciation evaluation feedback for the user and improving the user experience.

Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.

Application scene overview

Referring first to fig. 1, fig. 1 is a schematic view of an application scenario of the pronunciation assessment method of the present invention, in fig. 1, a user can perform pronunciation assessment through a terminal device a, which can display on a screen the assessment content (such as words or sentences or articles) to be pronounced by the user, and can also capture video and/or audio when the user pronounces according to the assessment content through a data capture device such as a camera device (image capture device) and/or a microphone (audio capture device) to assess the pronunciation of the user through the pronunciation assessment method.

It is understood that the pronunciation evaluation content may be obtained by the terminal a in advance, or may be obtained in real time. The pronunciation evaluation content may be downloaded from a server by the terminal a, and the terminal a or the server may analyze and process the data collected by the terminal a (i.e., execute the pronunciation evaluation method). In an actual application process, the server may have multiple stages, that is, the server may receive video and/or audio data sent by the terminal device and send the received video and/or audio data to the processing server, and the processing server processes the received video data according to the pronunciation evaluation method of the present invention to obtain a pronunciation evaluation result of the user and feeds the pronunciation evaluation result back to the terminal device a for display.

Exemplary method

In the following, in conjunction with the application scenario of fig. 1, a pronunciation assessment method according to an exemplary embodiment of the present invention is described with reference to fig. 2. It should be noted that the above application scenarios are merely illustrative for the convenience of understanding the spirit and principles of the present invention, and the embodiments of the present invention are not limited in this respect. Rather, embodiments of the present invention may be applied to any scenario where applicable.

Fig. 2 is a flowchart of an example of a pronunciation assessment method according to the first aspect of the embodiment of the present invention. Although the present invention provides the method operation steps or apparatus structures as shown in the following embodiments or figures, more or less operation steps or module units after partial combination may be included in the method or apparatus based on conventional or non-inventive labor. In the case of steps or structures which do not logically have a necessary relationship, the execution order of the steps or the block structure of the apparatus is not limited to the execution order or the block structure shown in the embodiment of the present invention or the drawings. When the described method or module structure is applied to a device, a server or an end product in practice, the method or module structure may be executed sequentially or in parallel according to the embodiments or the method or module structure shown in the attached drawings (for example, in the environment of parallel processors or multi-thread processing, even in the environment of implementation including distributed processing and server clustering).

For clarity, the following embodiments are described in a specific implementation scenario in which a user performs pronunciation assessment via a mobile terminal. The mobile terminal can comprise a mobile phone, a tablet computer or other general or special equipment with a video shooting function and a data communication function. The mobile terminal and the server may be deployed with corresponding application modules, such as a spoken language learning APP (application) installed in the mobile terminal, to implement corresponding data processing. However, those skilled in the art can understand that the spirit of the present solution can be applied to other implementation scenarios of pronunciation assessment, for example, referring to fig. 3, after the mobile terminal collects data, the collected data is sent to the server for processing, and is fed back to the user through the mobile terminal.

In a specific embodiment, as shown in fig. 2, in an embodiment of a pronunciation assessment method provided by the present invention, the method may include:

s201, extracting at least one audio data segment from the pronunciation audio to be tested of the user aiming at the content to be evaluated;

in the present embodiment, before evaluating the pronunciation of the user, the pronunciation audio of the user or the pronunciation audio extracted from the pronunciation video is acquired. Optionally, the pronunciation audio is fed back by the user for the evaluation content. In one embodiment, the mobile terminal collects audio from the user as he speaks through an integrated microphone. In another embodiment of the present invention, the mobile terminal collects a video of the user during pronunciation through the integrated front-facing camera, and obtains pronunciation audio from the video. It will be appreciated that the pronunciation audio may not be captured in real time, such as local audio stored in the mobile terminal, or pronunciation audio received from other mobile terminals/servers.

The content of the assessment includes, but is not limited to, words, phrases, sentences, or articles; optionally, the evaluation content may be carried in the course text pushed to the user, or may be carried in the course video or other push played for the user. In one embodiment, the evaluation content may be a reading-after text presented to the user by the mobile terminal at the terminal interface.

After the pronunciation audio of the user is acquired, a specific implementation step of executing S201 includes: the at least one audio data segment may be extracted by removing ineffective audio (audio not including the user pronunciation process, i.e. audio before/after the user pronunciation without the user pronunciation, and non-effective audio uttered by the user, such as breathing sound, cough, etc.) or background noise, specifically, including: acquiring an audio signal of pronunciation audio of the user; based on the fluctuation of the audio signal, cutting the pronunciation audio to remove invalid audio and/or environmental noise; and segmenting the cut pronunciation audio to obtain at least one audio data segment.

In practice, an audio data segment may also be understood as an audio data frame or an audio frame, for example, an audio data segment may be an audio frame with a length of 10 milliseconds (ms). In one embodiment, a neural network model trained using CE (cross entropy) criteria extracts at least one audio data segment from user utterance audio. In this embodiment, whether the current audio is an effective video can be determined according to the fluctuation condition of the audio signal, and the smaller the fluctuation of the signal, the smaller the change of the audio picture is, that is, the smaller the probability that the audio includes the user pronunciation, so that whether the current audio includes an effective part of the user pronunciation can be determined by setting a reasonable threshold or a filter.

After extracting at least one audio data segment, executing S202, and acquiring a time boundary corresponding to the at least one audio data segment and acoustic likelihood in the corresponding time boundary;

in one embodiment of this step, a time boundary corresponding to at least one audio data segment and an acoustic likelihood within the corresponding time boundary may be obtained by the CE model. Specifically, an acoustic model (such as a CE model) is used to perform Forced Alignment (Forced Alignment) on a standard pronunciation audio obtained based on the evaluation content and at least one audio data segment extracted from the user pronunciation audio to obtain a time boundary and an acoustic likelihood of each audio data segment, wherein the acoustic likelihood is used to measure an acoustic similarity between the user pronunciation audio and the standard pronunciation. Here, the acoustic likelihood may be determined by calculating an average likelihood L1 of an audio data segment based on a time boundary, and determining an average likelihood L2 corresponding to at least one audio data segment by freely recognizing the audio data segment through a phoneme-level recognition network, and then determining the acoustic likelihood based on a difference between L1 and L2.

It should be noted that, in addition to the CE model described above, the present invention is not limited to using other algorithms or models to obtain the acoustic likelihood of the corresponding time boundary and the corresponding time boundary of at least one audio data segment.

After extracting at least one audio data segment, S203 may be executed to obtain a phoneme feature sequence to be tested corresponding to the at least one audio data segment;

in this step, a phoneme recognition network is used to obtain a feature sequence of the phoneme to be detected corresponding to at least one audio data segment. In an embodiment of the Phoneme recognition network (i.e. the Phoneme recognition Model), the Phoneme recognition network is a discriminative acoustic Model, such as a TDNN Chain Model (Chain Model), which can implement audio recognition at a smaller granularity, such as a Phoneme level, compared to a conventional acoustic Model at a word level, because no extra decoding is required to generate a grid (lattice) at the word level, such that a PER (Phoneme recognition Error rate) is greatly reduced, and performance of the Phoneme recognition network is improved. In addition to the TDNN chain model, other networks or models may be employed, such as a phoneme recognition network trained by MPE (Minimum phoneme Error) criteria.

After extracting the phone feature sequence to be corrected corresponding to the at least one audio data segment, S204 may be executed to identify a phone to be corrected that is inconsistent with the standard phone feature sequence of the evaluated content from the phone feature sequence to be corrected based on the time boundary, the confusion phone table, and the threshold corresponding to the confusion phone;

in the embodiment of the invention, the different phoneme information comprises a plurality of types aiming at different classesThe pronunciation of the pattern is wrong, and there may be a difference in distinguishing the type of the phoneme information. Taking one of the information as an example, that is, when the phoneme information is distinguished as the position information of the phoneme inconsistent with the standard phoneme feature sequence in the phoneme feature sequence to be tested, taking the phoneme sequence shown in fig. 4A as an example, assuming that the upper phoneme sequence is the pre-recorded standard phoneme feature sequence and the lower phoneme sequence is the phoneme feature sequence to be tested, four double-headed arrows are respectively used for representing three types of pronunciation errors, wherein the first double-headed arrow from left to right is used for representing that the user pronunciation at the position has a replacement error, that is, the phoneme at this position in the standard phoneme feature sequence is the phoneme at this position, and the phoneme corresponding to the actual pronunciation at this position in the phoneme feature sequence to be tested is the phoneme at this position in the standard phoneme feature sequence

Similarly, the type of error characterized by the third bi-directional arrow is also a replacement error; the error type represented by the second bidirectional arrow is a deletion error, namely that the user misses out the phoneme n; the type of error characterized by the fourth double arrow is an insertion error, i.e. the user has pronounced the phoneme e.

The phonemes to be corrected include, but are not limited to, confusing phonemes that are acoustically similar to the standard phonemes in the content under evaluation. The phonemes to be corrected included in the user audio can be determined through the step S204, so that the actual errors included in the user pronunciation audio can be pointed out in a targeted manner, and further, more targeted pronunciation evaluation feedback can be provided for the user. Further, one possible embodiment of S204 comprises the following steps:

the first substep: acquiring a standard phoneme characteristic sequence generated based on the evaluation content;

the standard phoneme feature sequence may be a test data set recorded in advance, or may be extracted and generated from a standard pronunciation audio corresponding to the content to be evaluated through a deep learning network.

And a second substep: according to the time boundary, the phoneme feature sequence to be detected corresponding to each word in at least one audio data segment is aligned with the standard phoneme feature sequence corresponding to the word by the editing distance to obtain distinguished phoneme information;

specifically, in this step, the phoneme feature sequence to be tested corresponding to each word in at least one audio data segment is edited and aligned with the standard phoneme feature sequence corresponding to the word based on the word time boundary, which may be in an alignment manner as shown in fig. 4A, so that a position where the current pronunciation is inconsistent is detected from the phoneme feature sequence to be tested and the standard phoneme feature sequence after the editing distance alignment, and a phoneme to be positioned at the position is located, for example, a position indicated by a first two-way arrow from left to right as shown in fig. 4A.

And a third substep: determining the phonemes to be corrected corresponding to the distinguished phoneme information through a Bayesian judgment module according to the confusion phoneme table and the threshold corresponding to the confusion phonemes;

in this step, the confusion phoneme table is traversed to select at least two confusion phonemes with the smallest difference from the user pronunciation phoneme at the position, and the user pronunciation phoneme at the position is determined as the phoneme to be corrected through the bayesian decision model according to the at least two confusion phonemes and the optimal threshold range corresponding to the at least two confusion phonemes. The confusion phoneme table includes, but is not limited to, at least one confusion phoneme set, wherein at least one confusion phoneme set includes at least two confusion phonemes and corresponding standard pronunciation audio or big data collection user pronunciation audio. Such as phonemes and

for a set of confusing phonemes, phoneme i: and i is a set of confusing phonemes, phoneme a: and a is a set of confusing phonemes.

Also taking the alignment shown in FIG. 4A as an example, in the third substep, for the position indicated by the first two-way arrow from left to right, the confusing phone list is traversed, and at least two confusing phones and the confusing phone sum with the smallest difference from the user pronunciation phone at the position are selected

According to phoneme and phoneme

Determining the phoneme at the position in the pronunciation audio of the user as the best threshold range corresponding to the pronunciation audio and the two confusion phonemes through a Bayesian decision model

And combines the phonemes

And determining the phoneme to be corrected.

It should be noted that the two steps S203 and S204 can be implemented by a phoneme recognition network. In order to improve the performance of the phoneme recognition network and make the phoneme recognition capability of the phoneme recognition network more accurate, the phoneme recognition network may be formed by at least one word in the evaluated content, a pronunciation dictionary and the confusing phoneme table, and the word graph of each network path in the phoneme recognition network is adjusted according to the pre-entered development set. The development set referred to herein includes, but is not limited to, the content being evaluated, the standard pronunciation audio corresponding to the content being evaluated, or the pronunciation audio of a big data collection regarding at least a part of the content being evaluated, and the phoneme sequence corresponding to at least a part of the content being evaluated; the phoneme sequence here may be pre-entered, or may be output by a deep learning network, and the embodiment of the present invention is not limited thereto.

Specifically, in the construction process of the phoneme recognition network, there is a minor pair (confusion phoneme table) for each phoneme, and the confusion phoneme table may be pre-recorded and prepared based on a teaching and research experience, or may be formed after a neural network learns a large amount of user pronunciation data. Aiming at any phoneme, the optimal threshold corresponding to the confusing phoneme can be searched according to the confusing phoneme table and the development set, and the optimal threshold is the prior factor used by the phoneme recognition network in the process of recognizing the phoneme. See the following equation:

where h denotes which phoneme is, o denotes an acoustic signal, p (h)_i|o)p(h_jI o) the phoneme h corresponding to the larger conditional probability value_i or h_jI.e. the phoneme that the final audio o actually corresponds to. By adding a priori factor alpha in the development set_ijSo that it will refer to p (h) when calculating the output detection result_i| o) and α_ijp(h_iI o) size, alpha_ijI.e. as a prior probability. Thus, the phoneme recognition network in the embodiment of the invention is more flexible through the principle, and the confusion phoneme recognition can be more accurate without increasing the data volume of the word dictionary on the basis of the prior art.

Taking the confusion data set shown in fig. 4B as an example, including 7 sets of acoustic likelihoods, if thresholds of phonemes I and I (I:) are determined, a batch of audio including phoneme I needs to be collected or recorded, and a phoneme sequence corresponding to the audio is obtained, and the audio is divided into two categories, namely, pronunciation close to I and pronunciation close to I, which are labeled as 1 (pronunciation close to I) and 0 (pronunciation close to I). Fig. 4B shows that a is the acoustic likelihood with the pronunciation close to I, and B is the acoustic likelihood with the pronunciation close to I, and then the optimal threshold range of the phoneme with the pronunciation close to I is determined according to the ratio of the groups of a and B, that is, the threshold corresponding to the phoneme I is determined.

After determining the phonemes to be corrected, S205 may be executed, and if the phonemes to be corrected exist within the corresponding time boundary and the acoustic likelihood is greater than the preset threshold, the corresponding score is adjusted based on the phonemes to be corrected. In one embodiment, the acoustic likelihood may be used to determine a score for the audio data segment within the respective time boundary, the score indicating a degree of similarity between the user utterance and the standard utterance in the audio data segment. For example, assuming that pronunciation evaluation is performed on bit, if a phoneme i to be corrected exists in the time boundary corresponding to the word: that is, the user utters the short sound i into the confusing phoneme i: and if the acoustic likelihood of the word in the user pronunciation audio measured and evaluated by the CE model is greater than the preset threshold, the CE model indicates that the pronunciation evaluation result score of the word by the user is higher, but the phoneme recognition network detects that the user has a replacement error, and under the condition, the score of the word is adjusted downwards based on the phoneme to be corrected detected by the phoneme recognition network.

After S204 or S205, pronunciation error correction content pushed to the user is determined based on the phonemes to be corrected and/or the adjusted scores, wherein the pronunciation error correction content is used for instructing the user to perform improved exercises on the phonemes to be corrected. Such as pushing the phonetic symbol corresponding to the phoneme to be corrected, or a phonetic symbol course, or determining the pronunciation score of the user based on the adjusted score.

Through the pronunciation evaluation method shown in fig. 2, the phoneme to be corrected which is inconsistent with the standard phoneme feature sequence can be identified from the pronunciation audio of the user, so that the corresponding score of the pronunciation audio of the user can be adjusted, the identification rate of the confusion phoneme in the pronunciation audio is greatly improved, more targeted pronunciation evaluation feedback is provided for the user, and the user experience is improved.

Exemplary devices

Having described the method of an exemplary embodiment of the present invention, it is next described that the present invention provides an apparatus for an exemplary implementation. The pronunciation assessment device provided by the invention can realize the method executed by any one of the methods provided by the embodiment corresponding to the figure 2. Referring to fig. 5, the pronunciation assessment apparatus at least includes:

Optionally, the determining module is further configured to determine pronunciation error correction content to be pushed to the user based on the phonemes to be corrected and/or the adjusted score, wherein the pronunciation error correction content is used for instructing the user to perform improved exercise on the phonemes to be corrected.

Optionally, the second evaluation module is further provided with a phoneme recognition network, and the phoneme recognition network is specifically configured to obtain a phoneme feature sequence to be detected corresponding to at least one audio data segment.

Optionally, constructing a phoneme recognition network by at least one word in the content to be evaluated, a pronunciation dictionary and a confusion phoneme table; and the word graph of each network path in the phoneme recognition network is adjusted according to the pre-entered development set.

Optionally, the second evaluation module is specifically configured to, when the to-be-corrected phoneme inconsistent with the standard phoneme feature sequence of the evaluation content is identified from the to-be-detected phoneme feature sequence based on the time boundary, the confusion phoneme table, and the threshold corresponding to the confusion phoneme, perform:

Optionally, the phoneme to be detected information includes a position information of a phoneme not consistent with the standard phoneme feature sequence in the phoneme feature sequence to be detected.

Optionally, the phonemes to be corrected include confusing phonemes that are similar in acoustic pronunciation to the standard phonemes in the content under evaluation.

Optionally, the first evaluation module is provided with a cross entropy criterion CE model, and the CE model is specifically configured to obtain a time boundary corresponding to the at least one audio data segment and an acoustic likelihood within the corresponding time boundary.

Exemplary Medium

Having described the method and apparatus of the exemplary embodiments of the present invention, a computer-readable storage medium of the exemplary embodiments of the present invention is described, which is an optical disc having a computer program (i.e., a program product) stored thereon, which when executed by a processor, performs the steps recited in the above-described method embodiments, for example, extracting at least one audio data segment from a pronunciation audio to be tested of a user for a content to be evaluated; acquiring a time boundary corresponding to at least one audio data segment and acoustic likelihood in the corresponding time boundary; acquiring a phoneme feature sequence to be detected corresponding to at least one audio data segment; identifying a phoneme to be corrected which is inconsistent with the standard phoneme feature sequence of the evaluation content from the phoneme feature sequence to be detected based on the time boundary, the confusion phoneme table and a threshold value corresponding to the confusion phoneme; if the phoneme to be corrected exists in the corresponding time boundary and the acoustic likelihood is greater than a preset threshold, adjusting the corresponding score based on the phoneme to be corrected; the specific implementation of each step is not repeated here.

It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, or other optical and magnetic storage media, which are not described in detail herein.

Exemplary computing device

Having described the methods, apparatus and media of exemplary embodiments of the present invention, a description of exemplary embodiments of the present invention will now be given of computing devices, which may be computer systems or servers, showing block diagrams of exemplary computing devices suitable for implementing embodiments of the present invention. The computing device shown is only one example and should not be taken as limiting the scope of use and functionality of embodiments of the invention.

Components of the computing device may include, but are not limited to: one or more processors or processing units, a system memory, and a bus connecting the various system components (including the system memory and the processing units).

The computing device typically includes a variety of computer system readable media. Such media may be any available media that is accessible by a computing device and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) and/or cache memory. The computing device may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, ROM may be used to read from and write to non-removable, nonvolatile magnetic media (not shown, but commonly referred to as a "hard drive"). Although not shown in the figures, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to the bus by one or more data media interfaces. At least one program product may be included in the system memory having a set (e.g., at least one) of program modules configured to perform the functions of embodiments of the present invention.

A program/utility having a set (at least one) of program modules may be stored, for example, in system memory, and such program modules include, but are not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment. The program modules generally perform the functions and/or methodologies of the described embodiments of the invention.

The computing device may also communicate with one or more external devices (e.g., keyboard, pointing device, display, etc.). Such communication may be through an input/output (I/O) interface. Also, the computing device may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) through a network adapter. The network adapter communicates with other modules of the computing device (e.g., processing unit, etc.) over the bus. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the computing device.

The processing unit executes various functional applications and data processing by running programs stored in the system memory, for example, acquiring pronunciation video of a user; obtaining an effective video segment from the pronunciation video; acquiring a pronunciation mouth shape characteristic sequence of a user based on the effective video segment; and obtaining the pronunciation mouth shape score according to the pronunciation mouth shape characteristic sequence of the user and the standard pronunciation mouth shape characteristic sequence in a preset calculation mode. The specific implementation of each step is not repeated here. It should be noted that although several units/modules or sub-units/sub-modules of the pronunciation assessment apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Further, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Through the above description, the embodiments of the present invention provide the following technical solutions, but are not limited thereto: 1. a pronunciation assessment method, comprising:

extracting at least one audio data segment from the audio frequency of the pronunciation to be tested of the user aiming at the content to be evaluated;

acquiring a time boundary corresponding to at least one audio data segment and acoustic likelihood in the corresponding time boundary;

acquiring a phoneme feature sequence to be detected corresponding to at least one audio data segment;

identifying a phoneme to be corrected which is inconsistent with the standard phoneme feature sequence of the evaluation content from the phoneme feature sequence to be detected based on the time boundary, the confusion phoneme table and a threshold value corresponding to the confusion phoneme;

and if the phonemes to be corrected exist in the corresponding time boundary and the acoustic likelihood is greater than a preset threshold, adjusting the corresponding score based on the phonemes to be corrected.

2. The method of claim 1, further comprising:

and determining pronunciation error correction content pushed to the user based on the phonemes to be corrected and/or the adjusted scores, wherein the pronunciation error correction content is used for instructing the user to improve the training of the phonemes to be corrected.

3. The method according to claim 1 or 2, wherein a phoneme recognition network is used to obtain the phoneme feature sequence to be tested corresponding to the at least one audio data segment.

4. The method of claim 3, wherein the phoneme recognition network is constructed from at least one single word in the content being assessed, a pronunciation dictionary, and the confusing phoneme table; and is

And the word graph of each network path in the phoneme recognition network is adjusted according to a pre-input development set.

5. The method of any one of claims 1 to 4, wherein the identifying of the phoneme to be corrected from the phoneme feature sequence to be tested, which is inconsistent with the standard phoneme feature sequence of the content to be evaluated based on the time boundary, the confusion phoneme table and the threshold corresponding to the confusion phoneme, comprises:

according to the time boundary, the phoneme feature sequence to be detected corresponding to each word in at least one audio data segment is aligned with the standard phoneme feature sequence corresponding to the word by editing distance to obtain distinguished phoneme information;

and determining the phonemes to be corrected corresponding to the distinguished phoneme information through a Bayesian judgment module according to the confusion phoneme table and the threshold corresponding to the confusion phonemes.

6. The method of claim 5 wherein said discriminative phoneme information includes position information for phonemes not consistent with a canonical phoneme feature sequence in said test phoneme feature sequence.

7. The method of any one of claims 1 to 6, wherein the phonemes to be corrected comprise confusing phonemes that are acoustically similar to the standard phonemes in the content under evaluation.

8. The method of any one of claims 1 to 7, wherein the cross-entropy criterion CE model is used to obtain the time boundaries and acoustic likelihoods within the respective time boundaries for the at least one audio data segment.

9. A pronunciation assessment device, comprising:

the second evaluation module is configured to acquire a phoneme characteristic sequence to be tested corresponding to at least one audio data segment; identifying a phoneme to be corrected which is inconsistent with the standard phoneme feature sequence of the evaluation content from the phoneme feature sequence to be detected based on the time boundary, the confusion phoneme table and a threshold value corresponding to the confusion phoneme;

and the adjusting module is configured to adjust the corresponding score based on the phoneme to be corrected if the phoneme to be corrected exists in the corresponding time boundary and the acoustic likelihood is greater than a preset threshold.

10. The pronunciation assessment apparatus of claim 9, further comprising a determination module configured to determine pronunciation error correction content to be pushed to the user based on the phonemes to be corrected and/or the adjusted scores, wherein the pronunciation error correction content is used to instruct the user to improve the practice of the phonemes to be corrected.

11. The pronunciation assessment device as claimed in claim 9 or 10, wherein the second assessment module is further provided with a phoneme recognition network, and the phoneme recognition network is specifically configured to obtain a phoneme feature sequence to be tested corresponding to at least one audio data segment.

12. The pronunciation assessment apparatus as claimed in claim 11, wherein the phoneme recognition network is constructed of at least one word in the assessment contents, a pronunciation dictionary and the confusion phoneme table; and is

13. The pronunciation assessment device of any one of claims 9 to 12, wherein the second assessment module, when identifying the phonemes to be corrected that do not conform to the standard phoneme feature sequence of the assessment content from the phoneme feature sequence to be assessed based on the time boundary, the confusion phoneme table and the threshold corresponding to the confusion phoneme, is specifically configured to:

14. The pronunciation assessment device of claim 13, wherein the discriminating phoneme information includes position information that phonemes which do not coincide with a standard phoneme feature sequence are in the phoneme feature sequence to be tested.

15. The pronunciation assessment apparatus as claimed in any one of claims 9 to 14, wherein the phonemes to be corrected include confusing phonemes which are acoustically close to the standard phonemes in the assessment content.

16. The pronunciation assessment device of any one of claims 9 to 15, wherein the first assessment module is provided with a cross entropy criterion CE model, which is specifically configured to obtain a time boundary corresponding to at least one audio data segment and an acoustic likelihood within the corresponding time boundary.

17. A computer-readable storage medium storing program code which, when executed by a processor, implements a method according to one of claims 1 to 8.

18. A computing device comprising a processor and a storage medium storing program code which, when executed by the processor, implements the method of one of claims 1 to 8.

Claims

1. A pronunciation assessment method, comprising:

2. The method of claim 1, further comprising:

and determining pronunciation error correction content pushed to the user based on the phoneme to be corrected and/or the adjusted score, wherein the pronunciation error correction content is used for instructing the user to perform improved exercise on the phoneme to be corrected.

3. The method according to claim 1 or 2, wherein a phoneme recognition network is used to obtain the phoneme feature sequence to be tested corresponding to at least one audio data segment.

4. The method of claim 3, wherein the phoneme recognition network is constructed from at least one word in the content being assessed, a pronunciation dictionary, and the confusing phoneme table; and is

according to the time boundary, the phoneme feature sequence to be detected corresponding to each word in at least one audio data segment and the standard phoneme feature sequence corresponding to the word are subjected to editing distance alignment to obtain distinguished phoneme information;

7. The method of any one of claims 1 to 6, wherein the cross-entropy criterion CE model is used to obtain the time boundaries corresponding to at least one audio data segment and the acoustic likelihoods within the respective time boundaries.

8. A pronunciation assessment device, comprising:

the extraction module is configured to extract at least one audio data segment from the audio to be tested of the pronunciation of the user for the content to be evaluated;

the second evaluation module is configured to acquire a phoneme feature sequence to be tested corresponding to at least one audio data segment; identifying a phoneme to be corrected which is inconsistent with the standard phoneme feature sequence of the evaluation content from the phoneme feature sequence to be detected based on the time boundary, the confusion phoneme table and a threshold value corresponding to the confusion phoneme;

9. A computer-readable storage medium storing program code which, when executed by a processor, implements a method according to one of claims 1 to 7.

10. A computing device comprising a processor and a storage medium storing program code which, when executed by the processor, implements the method of one of claims 1 to 7.