CN117043857A

CN117043857A - Method, apparatus and computer program product for English pronunciation assessment

Info

Publication number: CN117043857A
Application number: CN202180090828.9A
Authority: CN
Inventors: 陈子意; 朱益兴; 初伟; 于欣璐; 夏天; 常鹏; 韩玫; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-01-08
Filing date: 2021-11-27
Publication date: 2023-11-10
Also published as: WO2022148176A1; US20220223066A1

Abstract

An English pronunciation assessment method includes: receiving an audio file comprising English speech and a text transcript corresponding to the English speech; inputting the audio signal into one or more acoustic models to obtain speech information for each phoneme in each word, wherein the one or more acoustic models are trained with speech uttered by a native speaker and are further trained with speech uttered by a non-native speaker without marking out speech errors, such that speech errors are more accurately detected based on the obtained speech information; extracting the time sequence characteristics of each word; the extracted time-series characteristics of each word, the obtained speech information of each phoneme in each word, and the audio signal included in the audio file are input into a word accent model to obtain misplaced word accents in each word in the english speech of different pitch numbers without expanding short words to cause input approximations.

Description

Method, apparatus and computer program product for English pronunciation assessment

Cross Reference to Related Applications

The present application claims priority from U.S. patent application Ser. No. 17/145,136, filed on 1/8 of 2021, the entire contents of which are incorporated herein by reference.

Technical Field

The present application relates to the field of pronunciation assessment technology, and more particularly, to a method, apparatus, and computer program product for English pronunciation assessment based on machine learning techniques.

Background

Non-native speakers often pronounce wrong or misplaced accents in their english speech. Their pronunciation can be improved by practice, i.e. mistakes, receiving feedback and making corrections. Traditionally, interaction with a human english teacher is required to practice english pronunciation. In addition to human english teachers, computer-aided language learning (CALL) systems are commonly available to provide a sound quality (GOP) score as feedback of english speaking by non-native speakers. In this case, an audio recording of english speech by a non-native speaker reciting the english text transcript is input to the pronunciation assessment system. The pronunciation assessment system assesses the english pronunciation of non-native speakers and outputs words with mispronunciations, such as mispronunciations and mispronounced accents. However, there is a need to improve the accuracy and sensitivity of computer-aided pronunciation assessment systems.

The disclosure provides an English pronunciation assessment method based on machine learning technology. The method incorporates non-native language speech that is not marked with pronunciation errors into the acoustic model training used to generate GOP scores. The acoustic model also employs accent-based features as auxiliary inputs. Furthermore, time series features are input into the acoustic model to fully explore the input information and accommodate words with different pitch numbers. Therefore, the accuracy and recall rate of detecting pronunciation errors and misplaced accents are improved.

Disclosure of Invention

One aspect of the present disclosure includes a computer-implemented English pronunciation assessment method. The method comprises the following steps: receiving an audio file comprising English speech and a text transcript corresponding to the English speech; inputting the audio signals included in the audio file into one or more acoustic models to obtain the speech information of each phoneme in each word of the english speech, wherein the one or more acoustic models are trained with the speech uttered by the native speaker and are further trained with the speech uttered by the non-native speaker without marking out a pronunciation error, thereby more accurately detecting a pronunciation error based on the one or more acoustic models trained with the speech of the native speaker and the non-native speaker; extracting a time-series feature of each word contained in the input audio signal to convert each word of varying length into a fixed-length feature vector; inputting the extracted time-series feature of each word, the obtained voice information of each phoneme in each word, and the audio signal contained in the audio file into a word accent model to obtain misplaced word accents in each word in english voices of different pitch numbers without expanding short words to cause input approximation; and outputting each word having a pronunciation error corresponding to at least the word accent in the text transcript.

Another aspect of the invention includes an english pronunciation assessment device. The apparatus includes: a memory for storing program instructions; and a processor for executing program instructions stored in the memory to perform: receiving an audio file comprising English speech and a text transcript corresponding to the English speech; inputting the audio signals included in the audio file into one or more acoustic models to obtain the speech information of each phoneme in each word of the english speech, wherein the one or more acoustic models are trained with the speech uttered by the native speaker and are further trained with the speech uttered by the non-native speaker without marking out a pronunciation error, thereby more accurately detecting a pronunciation error based on the one or more acoustic models trained with the speech of the native speaker and the non-native speaker; extracting a time-series feature of each word contained in the input audio signal to convert each word of varying length into a fixed-length feature vector; inputting the extracted time-series feature of each word, the obtained voice information of each phoneme in each word, and the audio signal contained in the audio file into a word accent model to obtain misplaced word accents in each word in english voices of different pitch numbers without expanding short words to cause input approximation; and outputting each word having a pronunciation error corresponding to at least the word accent in the text transcript.

Another aspect of the disclosure includes a computer program product comprising a non-transitory computer-readable storage medium and program instructions stored therein, the program instructions configured to be executable by a computer to cause the computer to perform operations comprising: receiving an audio file comprising English speech and a text transcript corresponding to the English speech; inputting the audio signals included in the audio file into one or more acoustic models to obtain the speech information of each phoneme in each word of the english speech, wherein the one or more acoustic models are trained with the speech uttered by the native speaker and are further trained with the speech uttered by the non-native speaker without marking out a pronunciation error, thereby more accurately detecting a pronunciation error based on the one or more acoustic models trained with the speech of the native speaker and the non-native speaker; extracting a time-series feature of each word contained in the input audio signal to convert each word of varying length into a fixed-length feature vector; inputting the extracted time-series feature of each word, the obtained voice information of each phoneme in each word, and the audio signal contained in the audio file into a word accent model to obtain misplaced word accents in each word in english voices of different pitch numbers without expanding short words to cause input approximation; and outputting each word having a pronunciation error corresponding to at least the word accent in the text transcript.

Other aspects of the present disclosure will be appreciated by those skilled in the art from the specification, claims and drawings of the present disclosure.

Drawings

FIG. 1 illustrates an exemplary English pronunciation assessment method, according to embodiments of the present disclosure;

FIG. 2 illustrates another exemplary English pronunciation assessment method, according to embodiments of the present disclosure;

FIG. 3 illustrates an exemplary method of obtaining phonetic information for each phoneme in each word in accordance with an embodiment of the disclosure;

FIG. 4 illustrates an exemplary method of detecting pronunciation errors, according to an embodiment of the present invention;

FIG. 5 illustrates an exemplary method of detecting misplaced word accents in accordance with an embodiment of the present disclosure;

FIG. 6 illustrates an exemplary time-delayed neural network (TDNN) in accordance with an embodiment of the present disclosure;

FIG. 7 illustrates an exemplary factorization layer with semi-orthogonal constraints in accordance with an embodiment of the disclosure;

FIG. 8 illustrates an exemplary state-clustered triphone Hidden Markov Model (HMM) according to an embodiment of the disclosure;

FIG. 9 illustrates an exemplary posterior probabilistic acoustic model in accordance with an embodiment of the present disclosure;

FIG. 10 illustrates an exemplary neural network architecture for acoustic modeling for pronunciation error detection according to embodiments of the present disclosure;

FIG. 11 illustrates comparing accuracy versus recall curves for various AMs consistent with embodiments of the present disclosure;

FIG. 12 illustrates an exemplary neural network architecture for acoustic modeling of misplaced word stress detection in accordance with an embodiment of the present disclosure; and

fig. 13 illustrates an exemplary english pronunciation assessment device according to an embodiment of the disclosure.

Detailed Description

The following describes the technical scheme in the embodiment of the present invention with reference to the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. It will be apparent that the described embodiments are merely some, but not all, embodiments of the invention. Other embodiments, which are obtained based on embodiments of the present invention without inventive effort by those skilled in the art, will fall within the scope of the present disclosure. Certain terms used in this disclosure are explained first below.

Acoustic Model (AM): acoustic models are used in automatic speech recognition to represent the relationship between an audio signal and the phonemes or other linguistic units that make up the speech. The model is learned from a set of audio recordings and their corresponding transcripts, and machine learning software algorithms are used to create a statistical representation of the sounds that make up each word.

Automatic Speech Recognition (ASR): ASR is a technique that converts spoken words into text.

Bi-directional encoder representation from transformer (BERT): BERT is a method of pre-training language representations.

Cross Entropy (CE): if the coding scheme for the group is optimized for the estimated probability distribution q instead of the true distribution p, then the cross entropy measure between the two probability distributions p and q on the same underlying event group identifies the average number of bits required for events drawn from the group.

Deep Neural Network (DNN): DNN is an artificial neural network with multiple layers between input and output layers for modeling complex nonlinear relationships. DNNs are typically feed-forward networks in which data flows from an input layer to an output layer without loopback.

Pronunciation well (GOP): the GOP algorithm calculates likelihood ratios for the implemented phones that should be spoken according to the standard pronunciation.

Hidden Markov Model (HMM): HMMs are statistical markov models in which the device being modeled is assumed to be a markov process with unobservable states.

Word accent detection: word accent detection is a deep learning model that identifies whether the vowel phone in an isolated word is accented or not.

Optical gradient booster (LightGBM): lightGBM is an open source gradient enhancement framework for machine learning. It is based on decision tree algorithms and is used for sorting and categorization etc.

Long term short term memory (LSTM): LSTM is an artificial Recurrent Neural Network (RNN) architecture for deep learning applications.

Mel-frequency cepstral coefficient (MFCC): mel-frequency cepstrum is a representation of the short-term power spectrum of sound based on the linear cosine transform of the log power spectrum on the non-linear mel scale of frequency. MFCC is a coefficient that collectively constitutes MFC.

Hybrid model (MM): a hybrid model is a probabilistic model that is used to represent the presence of sub-populations throughout a population without requiring that the observed dataset should identify sub-populations to which an individual's observations pertain. As one of the mixture models, a gaussian mixture model is a probabilistic model that assumes that all data points are generated from a mixture of a finite number of gaussian distributions with unknown parameters.

Multiparty learning (MTL): MTL is a subdomain of machine learning in which multiple learning tasks are addressed simultaneously, while exploiting commonalities and differences between tasks. The MTL can improve learning efficiency and prediction accuracy of the task-specific model as compared to training the model alone.

Mutual Information (MI): MI of two random variables is a measure of the interdependence between the two variables. More specifically, it quantifies the amount of information obtained about one random variable by observing another random variable.

One-bit thermal encoding (OHE): one-bit thermal encoding is typically used to indicate the state of the state machine, which is in the nth state if and only if the nth bit is high.

Phonemes and phonemes: a phoneme is any different voice or gesture, regardless of whether the exact sound is critical to the meaning of the word. Conversely, a phoneme is a speech in a given language, and if exchanged with another, one word may be changed to another.

Senone: senones are a subset of phonemes and are defined as binding states within contextually relevant phonemes.

Time Delay Neural Network (TDNN): TDNN is a multi-layer artificial neural network architecture that aims to classify patterns using shift invariance and model the context of the layers of the network.

Universal Background Mode (UBM): UBM is a model often used in biometric verification devices to represent general, person-independent characteristics for comparison with models of person-specific characteristics when making an acceptance or rejection decision.

Word Error Rate (WER): WER is a measure of speech recognition performance.

The present disclosure provides an english pronunciation assessment method. The method takes advantage of various machine learning techniques to improve the performance of detecting mispronounced and misplaced accents of words spoken by non-native speakers.

FIG. 1 illustrates an exemplary English pronunciation assessment method, according to embodiments of the present disclosure. As shown in fig. 1, an audio file including english speech is received as input together with a text transcript corresponding to english speech (at S110).

The audio file includes an audio signal of human speech. The audio signal is a time-varying signal. Typically, an audio signal is divided into a plurality of segments for audio analysis. Such segments are also referred to as analysis frames or phonemes and are typically within a duration of 10ms to 250 ms. Audio frames or phone strings together form words.

At S120, a time-series feature of each word included in the input audio file is extracted to convert each word of varying length into a feature vector of fixed length.

Specifically, extracting the time-series features includes windowing the audio signal into a plurality of frames, performing a Discrete Fourier Transform (DFT) on each frame, taking a logarithm of an amplitude of each DFT-transformed frame, warping frequencies contained in the DFT-transformed frame on a mel scale, and performing an inverse Discrete Cosine Transform (DCT).

The time series characteristics may include frequency, energy, and mel-frequency cepstral coefficient (MFCC) characteristics. After the frequency, energy, and MFCC features are extracted, they are normalized in the word level and each feature dimension. For example, the extracted features are scaled linearly to a minimum and maximum range and the average value is subtracted.

At S130, the audio signal included in the audio file and the extracted time-series feature of each word are input to one or more acoustic models to obtain voice information of each phoneme in each word of english voice. In particular, one or more acoustic models may be cascaded together.

In computer aided pronunciation training, speech recognition related techniques are used to detect pronunciation errors and misplaced word accents in non-native speaker uttered speech. At the segmentation level, the speech is analyzed to detect the pronunciation errors of each phoneme in each word. At the super-segmentation level, the speech is analyzed to detect misplaced word accents for each word.

Pronunciation errors may include analog (substitution), appearance (insertion), and metathesis (deletion) errors. Detecting the appearance and the metathesis errors involves building an extended recognition network from phonetic rules either summarized by english as a second language (ESL) teacher or automatically learned from data labeled by the ESL teacher. English pronunciation assessment methods consistent with embodiments of the present disclosure do not require involvement of ESL teachers. In this specification, pronunciation error detection is focused only on analog errors.

In the prior art, acoustic Models (AM) for pronunciation error detection are typically trained with a native speaker data set. The AM may also be further trained with non-native speaker datasets. However, pronunciation errors in the non-native speaker data set must be annotated by the ESL teacher, which limits the size of the non-native speaker data set and thus provides less than ideal accuracy.

In embodiments of the present disclosure, a significant non-native speaker data set (1700 hours of speaking) with a pronunciation error is incorporated into the AM training along with the native speaker data set to substantially improve the performance of the pronunciation error detection without requiring the ESL teacher to mark the pronunciation error in the non-native speaker data set.

In acoustic modeling for speech recognition, training and testing with matching conditions is assumed. In speech assessment, a baseline specification AM trained by a native speaker on training speech must be applied to mismatch test speech of non-native speakers. In embodiments of the present disclosure, accent-based embedding and accent one-bit thermal encoding are incorporated in acoustic modeling. When the assist feature is extracted in the reasoning stage, the AM is trained in a multiple learning (MTL) manner except that the speech of the speaker who is not the native language is intentionally taken as the speech of the native language speaker. The method substantially improves performance of the detection of pronunciation errors.

AM typically includes feed forward neural networks such as Time Delay Neural Networks (TDNNs) and 1D expanded convolutional neural networks with res net type connections.

Fig. 3 illustrates an exemplary method of obtaining phonetic information for each phoneme in each word in accordance with an embodiment of the disclosure. As shown in fig. 3, inputting an audio signal included in an audio file to one or more acoustic models to obtain voice information of each phoneme in each word of english voice may include the following steps.

At S310, an audio signal included in the audio file is input into the alignment acoustic model to obtain time boundaries of each word and each phoneme in each word.

Specifically, the alignment acoustic model is used to determine the temporal boundaries of each phoneme and each word given the corresponding text transcript. The aligned acoustic model may include a combination of a Gaussian Mixture Model (GMM) and a Hidden Markov Model (HMM) or a combination of a Neural Network Model (NNM) and an HMM.

GMM is used to estimate density. It consists of a linear combination of gaussian densities:

wherein alpha is _m Is sigma alpha _m A mixing ratio of =1, and each phi (x; mu _m ,Σ _m ) Is of average mu _m Sum covariance matrix Σ _m Is a gaussian density of (c).

In one embodiment, the NNM is a factorized Time Delay Neural Network (TDNN). Factorized TDNNs (also known as TDNNFs) use sub-sampling to reduce computation during training. In the TDNN architecture, the initial transform is learned over a narrower context, while the deeper layers handle hidden activations from a wider temporal context. Higher layers have the ability to learn a wider temporal relationship. Each layer in TDNN operates at a different time resolution, which increases at higher layers of the neural network.

Fig. 6 illustrates an exemplary time-delayed neural network (TDNN) according to an embodiment of the present disclosure. As shown in fig. 6, hidden activations are calculated at all time steps of each layer, and the correlation between activations is cross-layer and localized in time. The super parameters defining the TDNN network are the input context for each layer required for computing output activation in time steps. The layer-by-layer context specification corresponding to TDNN is shown in table 1.

TABLE 1

For example, in Table 1, the symbol { -7,2} indicates that the network spliced the inputs at the current frame minus 7 and the current frame plus 2 together. As shown in fig. 6, wider and wider contexts can be stitched together by higher layers of the network. At the input layer, the network concatenates frames t-2 through t+2 (i.e., { -2, -1,0,1,2} or more closely [ -2,2 ]). At the three hidden layers, the network concatenates the frames with offsets { -1,2}, { -3,3} and { -7,2 }. In table 1, the context is compared to the hypothetical settings in the middle column without sub-sampling. The difference between the offsets at the concealment layers is configured as a multiple of 3 to ensure that a small amount of concealment layer activation is evaluated for each output frame. In addition, the network uses asymmetric input contexts with more left-hand contexts because it reduces the latency of the neural network in online decoding.

FIG. 7 illustrates an exemplary factorization layer having semi-orthogonal constraints in accordance with an embodiment of the disclosure. As shown in fig. 7, the TDNN acoustic model is trained with a parameter matrix expressed as a product of two or more smaller matrices, where all factors except one are constrained to be semi-orthogonal, so that the TDNN becomes factorized TDNN or TDNNF.

In one embodiment, the factorized convolution of each hidden layer is a 3-level convolution. The 3-level convolutions include a 2 x 1 convolution constrained to dimension 256, and a 2 x 1 convolution constrained to dimension 1536 back to the hidden layer. That is, 1536= > 256= >1536 within one layer. Due to the additional 2 x 1 convolution, the effective temporal context is wider than the TDNN without factorization. The drop rises from 0 at the beginning of training to 0.5 halfway through and 0 at the end. Release was applied after ReLU and slapping.

In one embodiment, the HMM is a state-clustered triphone model for time-series data having a set of discrete states. The triphone HMM models three different states for each phoneme. The speech decision tree is used to cluster similar states together. The triphone HMM naturally produces an alignment between the states and the observations. The neural network model or gaussian mixture model estimates likelihood. The triphone HMM uses estimated likelihood in the Viterbi algorithm to determine the most likely sequence of phonemes.

Fig. 8 illustrates an exemplary state-clustered triphone Hidden Markov Model (HMM) according to an embodiment of the disclosure. As shown in fig. 8, the phonetic context of a phonetic unit is modeled using triphones of the state clusters, where the pronunciation location of one of the voices depends on the neighboring voices. The context may be modeled using longer units that incorporate the context, or multiple models per context or contextually relevant phoneme models. For triphones, each phoneme has a unique model for each left and right context. The phoneme x with left and right contexts l and r may be represented as 1-x+r.

The context-dependent model is more specific than the context-independent model. As more context-dependent models are defined, each model is responsible for a smaller area of the sound-sound space. In general, the number of possible triphone types is much greater than the number of triphone tokens observed. Techniques such as smoothing and parameter sharing are used to reduce the number of triphones. Smoothing combines less specific and more specific models. Parameter sharing enables different contexts to share models. Various examples of smoothing and parameter sharing are described below.

In one example, as a type of smoothing, rollback uses a less specific model when there is insufficient data to train a more specific model. For example, if no triphones are observed or several examples are observed, a diphone model may be used instead of the triphone model. Diphones can be further reduced to monophones if the occurrence of diphones is rare. The minimum training example count may be used to determine whether to model triphones or compensate diphones. This approach ensures that each model is well trained. Because the training data is sparse, there are relatively few specific triphone models that are actually trained.

In another example, interpolation combines fewer specific models with more specific models as another type of smoothing. For example, triphone lambda ^{Three kinds of} The parameters are interpolated with diphones lambda ^Double-piece And tone lambda ^{Single sheet} Parameters of (i) i.eEstimating interpolation parameter alpha based on pruned interpolation ₁ 、α ₂ And alpha ₃ . Interpolation enables more triphone models to be estimated and is also augmented by diphone and monophonic models sharing data from other contextsRobustness.

In another example, parameter sharing explicitly shares a model or parameter between different contexts. Sharing may occur at different levels. At the gaussian level, all distributions share the same gaussian set, but have different mixing weights (i.e., constrained mixing). At the state level, different models are allowed to share the same state (i.e., state clustering). In state clustering, states responsible for acoustically similar data are shared. By clustering similar states, training data associated with each state can be aggregated together, resulting in better parameter estimates for that state. The left context and the right context may be clustered separately. At the model level, similar context-dependent models (i.e., generic triphones) are merged. The contextually relevant phones are referred to as phonemic variants of the parent phones. The allopatric models with different triphone contexts are compared and similar models are combined. The combined model may be estimated from more data than a single model, resulting in a more accurate model and generally fewer models. The combined model is called a generic triphone.

Furthermore, a speech decision tree is used in the cluster state. A phonetic decision tree is built for each state of each parent phoneme, with a yes/no question at each node. At the root of the speech decision tree, all states are shared. The yes/no problem is used to split the state library. The resulting state clusters are leaves of the speech decision tree. Questions at each node are selected from a large set of predefined questions. The question is selected to maximize the likelihood of data for a given state cluster. The splitting is terminated when the likelihood increase does not exceed a predetermined likelihood threshold or the amount of data associated with the splitting node is less than a predetermined data threshold.

The likelihood of state clustering is determined as follows. First, the log-likelihood of the data associated with the state pool is calculated. In this case, all states are concentrated in a single cluster at the root, and all states have a gaussian output probability density function. Let s= { S ₁ ，s ₂ ，...，s _K The K states forming a cluster are pools sharing a common mean μ _s Sum covariance sigma _s . Let X be the training dataset. Let gamma be _s (x) The probability of X e X, i.e. the state occupancy probability, is generated for the state s. The log-likelihood of the data associated with cluster S is then:

Furthermore, likelihood calculations do not require all data to pass through each state. When the output probability density function is gaussian, the log likelihood may be:

where d is the dimension of the data. Therefore, L (S) depends only on the combined state variance Σ _s The variance is calculated from the mean and variance of the individual states in the pool, and the state occupancy probability has been calculated when forward-backward is performed.

The resolution problem is selected based on the likelihood of the parent state and the likelihood of the split state. The problem with speech context aims at dividing S into two parts S _y And S is _n . Partition S will now be _y Are clustered together to form a matrix having a mean valueSum of covariance->Is distributed and will now partition S _n Are clustered together to form a matrix with mean +.>Sum of covariance->Is a single gaussian output distribution of (a). The likelihood of the data after segmentation is L (S _y )+L(S _n ). Total likelihood of partition dataSex increase Δ=l (S _y )+L(S _n ) -L (S). The split problem may be determined by looping through all possible problems, calculating delta for each problem, and selecting the problem with the greatest delta.

For new clusters S _y And S is _n The splitting continues until the maximum value delta falls below a predetermined likelihood threshold or the amount of data associated with the splitting node falls below a predetermined data threshold. For gaussian output distribution, the state likelihood estimates may be estimated using only the state occupancy count (obtained at alignment) and parameters of the gaussian. No acoustic data is required. The state occupancy count is the sum of the state occupancy probabilities of the states over time.

The above state clustering assumes that the state output is gaussian, which makes the calculation very simple. However, gaussian mixtures provide better acoustic models than gaussian. In one embodiment, a gaussian distribution-based HMM-based device may be transformed into a gaussian mixture-based device. The transformation may include performing state clustering using a gaussian distribution, separating the gaussian distribution into clustered states by cloning and perturbing the average with a small fraction of the standard deviation and then retraining, and repeating by separating the dominant (highest state occupancy count) mixture component in each state.

Returning to fig. 3, at S320, the audio signal included in the audio file and the obtained time boundaries of each word and each phoneme in each word are input to the posterior probability acoustic model to obtain a posterior probability distribution of each senone of each phoneme in each word.

The posterior probabilistic acoustic model may be the same as the aligned acoustic model with different inputs and outputs. Specifically, the posterior probabilistic acoustic model is a combination of a neural network model and an HMM model. Fig. 9 illustrates an exemplary posterior probabilistic acoustic model in accordance with an embodiment of the present disclosure. As shown in fig. 9, the neural network and HMM are cascaded to form a posterior probabilistic acoustic model. Since the neural network in fig. 9 is the same as the TDNNF in the alignment acoustic model in fig. 6, a detailed description thereof is omitted. Similarly, since the HMM in fig. 9 is the same as that in the alignment acoustic model in fig. 6, a detailed description is omitted.

Unlike the alignment acoustic model, the input of the posterior probabilistic acoustic model includes the audio signal aligned with the time boundary and the MFCC features extracted from the audio signal at S120, and the output of the posterior probabilistic acoustic model includes the posterior probability distribution of each senone of each phoneme in each word.

Returning to fig. 2, at S330, the obtained time boundaries of each word and each phoneme in each word and the posterior probability distribution of each senone of each phoneme in each word are correlated to obtain the posterior probability distribution of each phoneme in each word. Subsequently, at S340, each word and the time boundaries of each phoneme in each word are output, as well as the posterior probability distribution of each phoneme in each word for further processing. Specifically, each word and the time boundaries of each phoneme in each word, and the posterior probability distribution of each phoneme in each word, will be used to detect mispronounced and mispronounced word accents, respectively, in utterances spoken by non-native speakers. An acoustic model for detecting pronunciation errors and misplaced word accents in speech uttered by non-native speakers will be described in detail below.

Returning to fig. 1, at S140, the extracted time-series characteristics of each word, the speech information of each phoneme obtained in each word, and the audio signal included in the audio file are input to the accent pattern to obtain misplaced accents in each word of english speech having different number of syllables, without expanding short words to cause input approximation. As shown in fig. 1, after detecting a pronunciation error, the method detects misplaced accents in english speech.

FIG. 5 illustrates an exemplary method of detecting misplaced word accents in accordance with an embodiment of the present disclosure. As shown in fig. 5, at S510, the extracted time-series characteristics of each word, the time boundaries of each word and each phoneme in each word, the posterior probability distribution of each phoneme in each word, the audio signal included in the audio file, and the corresponding text transcript are received. The time series characteristics of each word may be extracted at S120 in fig. 1. The time boundaries of each word and each phoneme in each word may be obtained at S310 in fig. 3. The posterior probability distribution of each phoneme in each word may be obtained at S330 in fig. 3.

At S520, the time-series characteristics of each word, the time boundaries of each word and each phoneme in each word, the posterior probability distribution of each phoneme in each word, the audio signal included in the audio file, and the corresponding text transcript are input to the word accent model to obtain the word accent in each word.

The accent is related to the prominent syllable of the word. In many cases, the location of syllables carries important information to disambiguate the word semantics. For example, "subject" and "subject" have entirely different meanings, and "permission (permission)" have entirely different meanings. After detecting the word accent of a word, the result is compared to its typical word accent patterns from the English dictionary to determine if the word accent of the word is misplaced.

In one embodiment, the word stress detection process includes an internal attention process. In conjunction with LSTM machine learning techniques, time series features are modeled using internal attention processing by extracting important information from each word of the input speech and converting the words of varying length into feature vectors of fixed length.

In extracting the time series features, a plurality of highest frequencies or tones are extracted from each audio frame. Because accented syllables exhibit higher energy than their neighboring syllables, energy is also extracted from each audio frame. Further, mel-frequency cepstral coefficient (MFCC) features with delta and delta-delta information are extracted by performing dimension reduction on each frame. When extracting MFCC features, large dimensions are preferred.

FIG. 12 illustrates an exemplary neural network architecture for acoustic modeling of misplaced word stress detection in accordance with an embodiment of the present disclosure. As shown in fig. 12, the word "projector" is divided into three syllables "pro", "jec" and "tor" by the language rule. Several time series features (e.g., MFCCs, tones, and energies) represented at the frame level are concatenated, each syllable being encoded by an LSTM block and then converted to a fixed length feature vector by an internal attention process. After processing each syllable in the word, all feature vectors representing the syllables interact with each other by self-attention and are trained to fit their final labeling. In this case, all LSTM models share the same parameters, and all positional Feed Forward Networks (FFNs) also share the same parameters.

As shown in fig. 12, the neural network architecture of the accent model includes six logic levels. Logic levels 2, 3 and 4 illustrate the internal structure of a syllable coding module comprising a bi-directional LSTM block, unidirectional LSTM blocks and residual edges, and an internal attention handling layer. Bi-directional LSTM is a recurrent neural network architecture for modeling sequence-to-sequence problems. In this case, the input sequence is a time-dependent feature (time sequence), and the output sequence is syllable-level word accent probability.

Based on statistics of syllable duration, the maximum LSTM step size is limited to 50, which corresponds to a duration of 500 ms. As shown in fig. 12, nodes at logic levels 2 and 3 represent LSTM cell states for different time steps. At logic level 2, the two frame levels LSTM run in two opposite directions and are clustered together element by element to enrich the left and right context of each frame state. Logic level 3 comprises a plurality of identical blocks. Each block has unidirectional LSTM and its input elements are gathered into its output stage by stage via the remaining edges. The horizontal arrows at logic levels 2 and 3 indicate the directional connection of the cell states in the corresponding LSTM layer. The bidirectional LSTM includes two LSTM cell state sequences: one with a forward connection (left to right) and the other with a backward connection (right to left). In the output, the two sequences are added in an elemental fashion to serve as the input sequence for the next stage (indicated by the upward arrow). Logic level 4 includes internal attention processing as a special weighted combining strategy. Because the duration of syllables varies greatly and the maximum number of LSTM steps (or maximum number of frames) is limited, only real frame information is weighted and filled frame information is ignored, as shown in the following equation.

S＝∑α _i ·S _i

Wherein S is _i Is the state vector corresponding to LSTM of each speech frame, H is a global and trainable vector shared by all syllables, and the function f defines how the importance of each state vector is calculated from its real content. For example, the simplest definition of function f is inner product.

In one embodiment, the word stress detection process also includes self-attention techniques. Self-attention techniques inherently support words with different syllable numbers and are used to model context information.

As shown in fig. 12, logic level 5 illustrates the internal structure of a syllable interaction module that includes a self-attention based network for digesting words with different syllable numbers without expanding the input by filling in empty locations. Logic level 5 comprises two parts: self-attention operation O (n) ² ) And a positional feed forward network operation O (n). In the self-attention section, attention weight α _i，j Bilinear formulas are used and matrix M is a globally trainable parameter. The bilinear formulation is easy to implement and is concentrated in the overall network architecture itself. Alternatively, the attention weight α _i，j May be calculated by multi-head attention in the BERT model. The self-attention process is represented by the following equation.

S _i ＝∑ _j α _i，j ·S _j

The position feed forward network comprises two dense networks. One network includes a reactivation function while the other network does not. The positional feed forward network is represented by the following equation.

FFN(x)＝max(0，xW ₁ +b ₁ )W ₂ +b

At logic level 6, scores 1, 0.5, and 0 are assigned to the target mark as main rereading, sub-rereading, and no rereading, respectively. Each target mark corresponds to a syllable. For example, the word "projector" has 3 syllables, and the target label may be 0, 1, 0. The token score is converted to a probability distribution via the l1 norm. The probability distribution is then used in a cross entropy based loss function. It should be noted that a word may have more than one main rereading. Thus, it is not a multi-category problem, but a multi-label problem. The loss function is represented in the following equation.

Wherein the method comprises the steps ofIs the normalized target labeling probability of syllables, and +.>Is the corresponding output probability from the self-attention block.

The training data set and the test data set of the acoustic model include two common data sets and one proprietary data set. One of the two common data sets is a free data set. Pure read english speech for 360 hours was used as the training dataset and pure read english speech for 50 hours was used as the test dataset. The other of the two common data sets is the TedLium data set. A talk set of 400 hours with various speakers and topics was used as the training dataset and a talk set of 50 hours was used as the test dataset. The proprietary dataset is a dictionary-based dataset. 2000 words spoken by 10 speakers were recorded. Most of them have three and four syllables. Each word is pronounced and recorded three times. Of the 10 speakers, 5 speakers are male and 5 speakers are female. The proprietary dataset included 6000 word-based samples in total. Half of the 6000 samples contained incorrect accents.

In the inference phase, word accent detection models are used to detect misplaced word accents at the word level. The detection result is the F value, which balances the precision rate and the recall rate.

Specifically, the input audio signal is decoded by an Automatic Speech Recognizer (ASR) to extract syllables from the phoneme sequence. Features such as duration, energy, pitch, and MFCC are then extracted from each syllable. Because the absolute duration of the same syllable within a word can vary significantly from person to person, the duration is measured relative to each syllable in the word. The same applies to the other features. Features are extracted at the frame level and normalized at word boundaries to calculate their relative values. Values of 25%, 50%, 75%, minimum and maximum are obtained within the syllable window. The MFCC is sized 40. The dimension of the additional delta and delta-delta information is set to 120.

The attention-based network model of word stress detection directly extracts time series features, including energy, frequency, and MFCC features, but not duration, on a frame-by-frame basis. The model is implemented in Tensorflow and the optimizer is Adam with default superparameters. The learning rate is le-3. After at least 10 training periods, the model achieves the desired performance. The performance results (i.e., F values) of the attention-based network model are compared to two baseline models in table 2, including an SVM-based model and a gradient-enhanced tree model.

TABLE 2

As can be seen from table 2, the attention-based network model outperforms the two baseline models. Constructing even larger proprietary datasets can also improve performance.

In some embodiments, model performance with different numbers of LSTM blocks (or layers) is explored. Table 3 shows that more LSTM blocks at logic level 3 in fig. 12 improve performance until the number of LSTM blocks reaches 5. In this case, the number of self-attention blocks is set to 1. On the other hand, more LSTM blocks make training significantly slower.

TABLE 3 Table 3

#LSTM	1	2	3	4	5	6
							LibriSpeech	0.920	0.928	0.939	0.944	0.951	0.948
Dictionary with a plurality of dictionary words	0.743	0.751	0.760	0.768	0.770	0.764

In some embodiments, model performance with a different number of self-attention blocks (or layers) is explored. Table 4 shows that more self-attention blocks at logic level 5 in fig. 12 do not improve performance due to potential overfitting. In this case, the number of LSTM blocks is set to 5.

TABLE 4 Table 4

Self-attention #	0	1	2
				LibriSpeech	0.941	0.951	0.929
Dictionary with a plurality of dictionary words	0.743	0.770	0.760

At S530, it is determined whether the word accent obtained for each word is misplaced based on the dictionary. Specifically, the dictionary may be an english dictionary, and the word accent obtained for each word may be compared with the word accent defined in the english dictionary. If the word accent obtained for each word is different from the word accent defined in the English dictionary, it is determined that the corresponding word contains misplaced word accents. When more than one word accent is defined in the english dictionary, a match between the word accent obtained for each word and any of the more than one word accents defined in the english dictionary may be considered as no misplaced word accent being found.

At S540, each word with misplaced word accents is output in the text transcript.

Returning to fig. 1, at S150, each word having a pronunciation error corresponding to the accent is output in the text transcript. In particular, text transcripts may be displayed to the user and misplaced word accents highlighted in the text transcripts. Alternatively, statistics regarding word accents of text transcripts may be presented to a user in various formats. The present disclosure is not limited to formats that present misplaced word accents.

In embodiments of the present disclosure, an acoustic model for detecting a pronunciation error is trained with a combination of a native speaker's spoken speech and a non-native speaker's spoken speech without marking the pronunciation error. In addition, based on the characteristics of the accent.

In embodiments of the present disclosure, an acoustic model for detecting misplaced word accents employs time series features as input to fully explore the input information. The network architecture of the acoustic model is essentially adapted to words with different number of syllables without expanding short words, thereby reducing input approximations. Thus, the detection accuracy is improved.

In addition, the English pronunciation assessment method detects pronunciation errors. FIG. 2 illustrates another exemplary English pronunciation assessment method, according to embodiments of the present disclosure. As shown in fig. 2, S150 in fig. 1 is replaced by S210 and S220.

At S210, the obtained phonetic information of each phoneme in each word is input into a vowel model or a consonant model to obtain each mispronounced phoneme in each word of english speech.

In particular, a vowel model and a consonant model may be used to determine whether a vowel or consonant, respectively, is mispronounced. The speech information includes an audio signal aligned with a time boundary of each phoneme in each word of the english speech and a posterior probability distribution of each phoneme in each word of the english speech. The following describes the pronunciation error detection in detail.

FIG. 4 illustrates an exemplary method of detecting pronunciation errors, according to an embodiment of the present invention. As shown in fig. 4, at S410, a time boundary of each word and each phoneme in each word, a posterior probability distribution of a speech level, and a corresponding text transcript are received. Specifically, the output of S130 in fig. 1 is the input of S410 in fig. 4.

At S420, the actual label (vowel or consonant) of each phoneme in each word is determined based on the dictionary. In particular, the acoustic models for detecting pronunciation errors of vowels or consonants may be identical. Even if knowledge is given in the dictionary, knowing whether each phoneme is a vowel or a consonant does not produce substantial differences. The dictionary may also be considered an english pronunciation dictionary.

At S430, each phoneme having a corresponding posterior probability below a pre-configured threshold is identified as a mispronounced phoneme. Specifically, the posterior probability acoustic model described in the posterior probability of each phoneme calculates the posterior probability of each phoneme.

FIG. 10 illustrates an exemplary neural network for acoustic modeling for pronunciation error detection according to embodiments of the present disclosure. As shown in fig. 10, X represents a frame level MFCC, and X _e Representing the assist feature. Fig. 10A is a neural network architecture of the i vector extractor. The i-vector extractor may be speaker-based or accent-based. The switches in fig. 10B and 10C are used to select only one auxiliary input. Fig. 10B is a neural network architecture for a speaker of the same class i-vector extractor or accent i-vector extractor. Fig. 10C is a neural network architecture using accent one-bit thermal encoding.

The i vector is typically used for speaker recognition and verification. AM embedding in speech recognition tasks is also effective as a speaker. In one embodiment, the modified version allows the i vector to be updated at a fixed rate, i.e., an "online i vector".

For each frame of features extracted from the training data set or the test data set, as shown in fig. 6a, a speaker i vector is connected to the MFCC as an auxiliary network input.

The training accent i-vector extractor is identical to the speaker i-vector extractor except that all speaker tags are replaced with their corresponding accent tags, which are native or non-native. In the reasoning phase, the accent i vector is used in the same way as the speaker i vector shown in fig. 10A.

It should be noted that the pronunciation error detection is performed only on non-native language voices. This information is used to train the homogeneous speaker i vector.

In one embodiment, during the training phase, a Universal Background Model (UBM) is trained by both native and non-native language voices to collect sufficient statistics. The UBM is then used to train a homogeneous speaker i vector extractor of the speaker on the native language only speech. This extractor is called L1 speaker i vector extractor. The L2 speaker i vector extractor can be trained in the same manner except that only non-native speakers are used. Unlike speaker i-vector extractors that use heterogeneous data with accents in both native and non-native languages in training, training of homogeneous speaker i-vector extractors uses homogeneous data with only one accent.

In one embodiment, only one vector extractor need be selected as an assist feature extractor for the neural network architecture shown in FIG. 10B during the inference phase. In this case, the L1 speaker i vector extractor is used for all non-native language voices. That is, non-native speakers are intentionally treated as native speakers during the inference phase. In this way, the performance of pronunciation error detection is improved compared to using an L2 speaker i vector extractor. It should be noted that in speech recognition applications a match between the type of i-vector extractor and the type of speech is required. However, the mismatch between the type of i vector extractor and the type of speech helps to improve the performance of the pronunciation error detection, which requires a distinct GOP score.

Because speakers of the same accent are grouped together during the training phase, a homogeneous speaker i vector may also be considered an implicit accent representation. For the native speech homophones i vector, i.e., the L1 accent i vector, each process and configuration is identical to the L1 speaker i vector except that all speaker tags are replaced with only one type of tag, i.e., the native language. The same is true for non-native language.

In one embodiment, L1 and L2 accent one-bit thermal encoding (OHE) are defined as [1,0] and [0,1], respectively. For each frame of features extracted from native language speech in the training dataset, L1OHE is connected to MFCC features as shown in fig. 10C, respectively.

In one embodiment, the L1 accent OHE is used for non-native language speech in the detection of pronunciation errors. The reason is the same as in the case of a homogenous accent or speaker i vector. The coach confirms the data in both the native and non-native language and learns his speaker or accent marks from the data, while the importer uses the trained model and marks each input data as the native language.

In one embodiment, the x vector or neural network activation based embedding may also be used instead of the i vector.

The training data set and the test data set are summarized in table 5. k represents thousands. The test dataset included 267 sentences and paragraphs read by 56 non-native speakers. On average, each record includes 26 words. The entire test dataset included 10386 vowel samples, 5.4% of which were marked as similar pronunciation errors. The phonemes that are mispronounced are not labeled.

TABLE 5

	Hours of	Speech sound	Speaker (S)
				Native language trainingData set A	452	268k	2.3k
Native language training data set B	1262	608k	33.0k
				Training data set for non-native language	1696	1430k	6.0k
Test data set for non-native language	1.1	267	56

The AM used for pronunciation error detection is a ResNet type TDNN-F model with five layers. The output sizes of the factorization and TDNN layers are set to 256 and 1536, respectively. The final output size was 5184. The initial and final learning rates are set to 1e-3 and 1e-4, respectively. The number of periods is set to 4. No pressure differential is used. The size of the accent/speaker i vector is set to 100. The 60k speech from each accent is used for i-vector training.

FIG. 11 illustrates comparing the accuracy and recall curves of various AMs consistent with embodiments of the present disclosure. As shown in FIG. 11, at a recall of 0.50, after non-native language speech is included in the training dataset, the accuracy increases from 0.58 to 0.74. After including the L1 homophones i vector as an assist feature for acoustic modeling, the accuracy further increases to 0.77. After including L1 accent one-bit thermal encoding as an assist feature for acoustic modeling, the accuracy eventually increases to 0.81.

In one embodiment, the neural network architecture for acoustic modeling of pronunciation error detection includes factorized feed-forward neural networks, i.e., TDNN-F. Alternatively, more complex networks may be used, such as RNNs or noted sequence-to-sequence models. The accent OHE adds little additional computational cost compared to the baseline, as it introduces only two additional dimensions as inputs.

At S440, after each mispronounced phoneme is identified, each mispronounced phoneme is output in the text transcript.

Returning to FIG. 2, at S220, each word having a pronunciation error corresponding to at least one or more of a vowel, a consonant, and a word accent is output in the text transcript. In particular, a text transcript may be displayed to a user and words having pronunciation errors corresponding to one or more of vowels, consonants, and word accents are highlighted in the text transcript. Alternatively, statistics regarding pronunciation errors of text transcripts may be presented to a user in various formats. The present disclosure is not limited to formats that present pronunciation errors.

In an embodiment of the present disclosure, the acoustic model for detecting a pronunciation error is trained with a combination of a native speaker's spoken speech and a non-native speaker's spoken speech without marking out a pronunciation error, which significantly improves the pronunciation error detection accuracy from 0.58 to 0.74 at a recall of 0.5. Accent-based features are input as auxiliary inputs into the acoustic model, and accent one-bit thermal encoding is used to further improve detection accuracy to 0.81 on the proprietary test dataset, and its generalizability is demonstrated by a relative improvement of 6.9% in detection accuracy on the common L2-ARCTIC test dataset using the same acoustic model trained from the proprietary dataset.

The disclosure also provides an English pronunciation assessment device. The device examines speech uttered by non-native speakers and provides a speech-level pronunciation assessment to the user by identifying mispronounced phonemes and mispronounced accents. The apparatus also provides overall goodness of a pronunciation (GOP) score. The device is capable of accommodating various accents of non-native speakers and processing long sentences for up to 120 seconds.

Fig. 13 illustrates an exemplary english pronunciation assessment device according to an embodiment of the disclosure. As shown in fig. 13, the english pronunciation assessment device 1300 includes a training engine 1310 and an inference engine 1320. In the training phase, the training engine 1310 trains the acoustic model 1322 using the speech uttered by the native speaker, the speech uttered by the non-native speaker, and the corresponding text transcript. In the inference phase, the inference engine 1320 uses the audio file of english speech and the corresponding text transcript to be evaluated as inputs to the acoustic model. The inference engine 1320 outputs mispronounced and mispronounced word accents in the text transcript.

The English pronunciation assessment device 1300 may include a processor and a memory. The memory may be used to store computer program instructions. The processor may be configured to invoke and execute computer program instructions stored in the memory to implement the english pronunciation assessment method.

In one embodiment, the processor is configured to receive an audio file including english speech and a text transcript corresponding to the english speech, input an audio signal included in the audio file to one or more acoustic models to obtain speech information for each phoneme in each word of the english speech, wherein the one or more acoustic models are trained with speech spoken by a native speaker and further with speech spoken by a non-native speaker without marking out voice errors, such that based on the one or more acoustic models trained by the native and non-native speakers, a pronunciation error is more accurately detected, a time series feature of each word included in the input audio signal is extracted to convert each word of varying length into a feature vector of fixed length, the extracted time series feature of each word, the obtained speech information of each phoneme in each word, and the audio signal included in the audio file is input to a word accent model to obtain a corresponding word with a different number of mispronunciations in each word of the input approximation without expanding to cause the input word to have a mispronounced text.

In one embodiment, the processor is configured to receive an audio file including english speech and a text transcript corresponding to the english speech, input an audio signal included in the audio file to one or more acoustic models to obtain speech information for each of the phonemes in the english speech, wherein the one or more acoustic models are trained with speech uttered by a native speaker and further with speech uttered by a non-native speaker without marking out a pronunciation error, such that based on one or more acoustic models trained by the native and non-native speakers, a pronunciation error is more accurately detected, a time series feature of each word included in the input audio signal is extracted to convert each word of varying length into a feature vector of fixed length, the extracted time series feature of each word, speech information of each of the phonemes in the obtained, and the audio signal included in the audio file is input to a word accent model to obtain a number of vocals having a difference in the input approximation, the pronunciation error is input to each of words or each of the vocals having a mispronounced word, and the pronunciation error is input to at least one of the consonant words in the input to the speech model, and the pronunciation error is input to each of the consonant words having at least one of the words.

The audio signal contained in the audio file is input to the alignment acoustic model to obtain the time boundary of each word and each phoneme in each word, the audio signal contained in the audio file and the obtained time boundary of each word and each phoneme in each word are input to the posterior probability acoustic model to obtain the posterior probability distribution of each cluster state of each phoneme in each word, the obtained time boundary of each word and each phoneme is correlated with the obtained posterior probability distribution of each cluster state of each phoneme in each word to obtain the posterior probability distribution of each senone of each phoneme in each word, and the time boundary of each word and each phoneme in each word and the posterior probability distribution of each senone of each phoneme in each word are output.

In one embodiment, the processor is further configured to receive a time boundary for each word and each phoneme in each word and a posterior probability distribution for each phoneme in each word, determine an actual label (vowel or consonant) for each phoneme in each word based on the dictionary, identify each phoneme having a corresponding posterior probability below a pre-configured threshold as a mispronounced phoneme, and output each mispronounced phoneme in the text transcript.

In one embodiment, the processor is further configured to receive the extracted time series characteristics of each word, the time boundaries of each word and each phoneme in each word, the posterior probability distribution of each phoneme in each word, the audio signal included in the audio file, and the corresponding text transcript, input the time series characteristics of each word, the time boundaries of each word and each phoneme in each word, the posterior probability distribution of each phoneme in each word, the audio signal included in the audio file, and the corresponding text transcript to the word stress model to obtain word stress in each word, determine whether the word stress in each word is misplaced based on the dictionary, and output each word with misplaced stress in the text transcript.

In one embodiment, the processor is further configured to sound each word together with at least one mispronounced phoneme and each word with mispronounced word accent as a word with mispronounced, and to output each word with mispronounced pronunciation in the text transcript.

Various embodiments may also provide a computer program product. The computer program product may include a non-transitory computer readable storage medium and program instructions stored therein that are configured to be executable by a computer to cause the computer to perform operations comprising the disclosed methods.

In some embodiments, the English pronunciation assessment device may further include a user interface for a user to input an audio file and corresponding text transcript and to view pronunciation errors in the text transcript.

In an embodiment of the present disclosure, the english pronunciation assessment device includes an acoustic model trained with a combination of speech uttered by a native speaker and speech uttered by a non-native speaker without marking out pronunciation errors, which significantly improves the accuracy of pronunciation error detection from 0.58 to 0.74 at a recall of 0.5. Furthermore, accent-based features and accent one-bit thermal encoding are incorporated into the acoustic model to further improve detection accuracy. An acoustic model for detecting misplaced word accents takes time series features as input to fully mine the input information. The network architecture of the acoustic model is essentially adapted to words of different pitch numbers without expanding short words, thereby reducing input approximations and improving detection accuracy. Thus, the English pronunciation assessment device more accurately detects pronunciation errors and misplaced word accents to provide a more desirable user experience.

While the principles and implementations of the present disclosure have been described in this specification through the use of specific embodiments, the foregoing description of the embodiments is only intended to assist in understanding the methods and core ideas of the methods of the present disclosure. Meanwhile, modifications of the specific embodiments and application scope can be made by those skilled in the art based on the ideas of the present disclosure. In summary, the content of the description should not be construed as limiting the disclosure.

Claims

1. A computer-implemented english pronunciation assessment method comprising:

receiving an audio file comprising English speech and a text transcript corresponding to the English speech;

inputting an audio signal included in the audio file into one or more acoustic models to obtain speech information of each phoneme in each word of the english speech, wherein the one or more acoustic models are trained with speech uttered by a native speaker and further trained with speech uttered by a non-native speaker without marking a departure error, thereby more accurately detecting the departure error based on the one or more acoustic models trained with speech of both the native speaker and the non-native speaker;

extracting a time-series feature of each word included in the input audio signal to convert each word of varying length into a fixed-length feature vector;

inputting the extracted time-series feature of each word, the obtained phonetic information of each phoneme in each word, and the audio signal included in the audio file into a accent model to obtain misplaced accents in each word in the english language with different number of syllables without expanding short words to cause input approximation; and

Outputting each word having the pronunciation error corresponding to at least a word accent in the text transcript.

2. The method of claim 1, further comprising:

inputting the obtained phonetic information of each phoneme in each word into a vowel model or a consonant model to obtain each mispronounced phoneme in each word of the english language; and

outputting each word having the pronunciation error corresponding to at least one or more of a vowel, a consonant, and the word accent in the text transcript.

3. The method according to claim 1, wherein:

the speech information includes at least a time boundary of each word and each phoneme in each word and a posterior probability distribution of each senone of each phoneme in each word; and

the time series characteristics include at least frequency, energy, and mel-frequency cepstral coefficient (MFCC) characteristics.

4. The method of claim 1, wherein inputting the audio signals included in the audio file into the one or more acoustic models comprises:

inputting the audio signals included in the audio file into an aligned acoustic model to obtain time boundaries of each word and each phoneme in each word;

Inputting the audio signal included in the audio file and the obtained time boundaries of each word and each phoneme in each word to a posterior probability acoustic model to obtain a posterior probability distribution of each senone of each phoneme in each word;

correlating the obtained time boundaries of each word and each phoneme in each word and the posterior probability distribution of each senone of each phoneme in each word to obtain the posterior probability distribution of each phoneme in each word; and

outputting the time boundaries of each word and each phoneme in each word, and the posterior probability distribution of each phoneme in each word.

5. The method of claim 2, wherein inputting the obtained speech information of each phoneme in each word into the vowel model or the consonant model comprises:

receiving a time boundary for each word and each phoneme in each word, and the posterior probability distribution for each phoneme in each word;

determining an actual mark (vowel or consonant) of each phoneme in each word based on the dictionary;

identifying each phoneme having a corresponding posterior probability below a pre-configured threshold as a mispronounced phoneme; and

Each mispronounced phoneme is output in the text transcript.

6. The method of claim 1, wherein inputting the extracted time-series features of each word, the obtained phonetic information of each phoneme in each word, and the audio signal included in the audio file into a word accent model to obtain misplaced word accents in each word in the english language for different number of syllables without expanding short words to cause input approximations comprises:

receiving the extracted time-series characteristics of each word, time boundaries of each word and each phoneme in each word, a posterior probability distribution of each phoneme in each word, the audio signal included in the audio file, and the corresponding text transcript;

inputting the time series feature of each word, the time boundaries of each word and each phoneme in each word, the posterior probability distribution of each phoneme in each word, the audio signal included in the audio file, and the corresponding text transcript into the accent model to obtain accents in each word;

Determining whether the word accent in each word is misplaced based on a dictionary; and

each word with misplaced word accent is output in the text transcript.

7. The method of claim 2, wherein outputting each word having the pronunciation error corresponding to at least one or more of the vowels, the consonants, and the word accents in the text transcript comprises:

combining each word with at least one mispronounced phoneme and each word with mispronounced accent as a word with the mispronounced pronunciation; and

outputting each word having the pronunciation error in the text transcript.

8. The method according to claim 4, wherein:

the aligned acoustic model includes a Gaussian Mixture Model (GMM) cascaded with a Hidden Markov Model (HMM) or a Neural Network Model (NNM) cascaded with an HMM.

9. The method according to claim 8, wherein:

the GMM consists of a linear combination of gaussian densities:

10. The method according to claim 8, wherein:

The NNM is a factorized Time Delay Neural Network (TDNNF).

11. The method according to claim 10, wherein:

the TDNNF includes five hidden layers;

each hidden layer of the TDNNF is a 3-level convolution; and

the 3-level convolutions include a 2 x 1 convolution constrained to dimension 256, and a 2 x 1 convolution constrained to dimension 1536 back to the hidden layer.

12. The method according to claim 8, wherein:

the HMM is a triphone model of state clustering for modeling three different states for each phoneme; and

the speech decision tree is used to cluster similar states together.

13. The method according to claim 4, wherein:

the posterior probabilistic acoustic model includes a Neural Network Model (NNM) cascaded with a Hidden Markov Model (HMM);

the inputs of the posterior probabilistic acoustic model include the audio signal aligned with the time boundary and the time series features extracted from the audio signal; and

the output of the posterior probability acoustic model includes the posterior probability distribution for each senone of each phoneme in each word.

14. The method of claim 6, wherein the word stress model comprises at least:

A second logic level comprising a bidirectional Long Short Term Memory (LSTM) model;

a third logic level comprising a plurality of LSTM modules and a high speed layer;

a fourth logic level including an internal layer of interest;

a fifth logic level comprising a plurality of self-attention blocks and a positional feed forward network layer; and

a sixth logic level including target indicia corresponding to all syllables of each word.

15. The method according to claim 14, wherein:

the maximum number of LSTM steps is limited to 50; and

each LSTM step corresponds to a duration of 10 ms.

16. An english pronunciation assessment device comprising:

a memory for storing program instructions; and

a processor for executing program instructions stored in the memory to perform the following:

17. The device of claim 16, further comprising a human-machine interface configured to:

allowing a user to input the audio file including the english language voice and a text transcript corresponding to the english language voice; and

each word having the pronunciation error corresponding to at least the word accent in the text transcript is displayed to the user.

18. The device of claim 16, wherein the processor is further configured to perform:

19. The apparatus of claim 16, wherein:

20. A computer program product comprising a non-transitory computer readable storage medium and program instructions stored therein, the program instructions being configured to be executable by a computer to cause the computer to: