CN114373481A

CN114373481A - Pronunciation error detection method and device and pronunciation error detection model training method and device

Info

Publication number: CN114373481A
Application number: CN202111660932.5A
Authority: CN
Inventors: 李�浩; 朱磊; 盛志超
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-04-19

Abstract

The invention provides a pronunciation error detection method and device and a pronunciation error detection model training method and device. The pronunciation error detection method comprises the following steps: acquiring a speech signal to be detected and an interpretation text corresponding to the speech signal; extracting acoustic features of a speech signal to be detected, and converting the reading text into a phoneme sequence; acquiring acoustic features of at least one pair of confusing phoneme pairs; by using the pronunciation error detection model, pronunciation error detection is performed based on the acoustic features of the speech signal to be detected, the acoustic features of at least one pair of confusion phoneme pairs and the phoneme sequence, so that the accuracy of pronunciation error detection can be improved.

Description

Pronunciation error detection method and device and pronunciation error detection model training method and device

Technical Field

The invention relates to the technical field of pronunciation error detection, in particular to a pronunciation error detection method and device and a pronunciation error detection model training method and device.

Background

With the development of computer technology and speech recognition technology, computer aided Pronunciation learning (CAPT) becomes a research hotspot in the field of intelligent speech technology. The CAPT system can realize automatic assessment of the pronunciation level of the learner and feedback and guide pronunciation errors.

Pronunciation error detection, namely detection of errors in the user pronunciation process, is an important link in the CAPT system. However, the existing pronunciation error detection method mostly depends on the forced alignment of the voice segments, the precision requirement on the forced alignment technology is high, and errors introduced by the forced alignment technology can greatly affect the effect of the subsequent steps, so that the accuracy of pronunciation error detection is greatly reduced.

Disclosure of Invention

In view of this, embodiments of the present invention provide a pronunciation error detection method and apparatus, and a pronunciation error detection model training method and apparatus, which can improve the accuracy of pronunciation error detection.

According to a first aspect of the embodiments of the present invention, there is provided a pronunciation error detection method, including: acquiring a speech signal to be detected and an interpretation text corresponding to the speech signal; extracting acoustic features of a speech signal to be detected, and converting the reading text into a phoneme sequence; acquiring acoustic features of at least one pair of confusing phoneme pairs; and performing pronunciation error detection based on the acoustic features of the speech signal to be detected mistaken, the acoustic features of at least one pair of confusing phoneme pairs and the phoneme sequence by using a pronunciation error detection model.

In an embodiment of the present invention, the obtaining the acoustic features of at least one confusing phoneme pair includes: acquiring acoustic features of a voice sample, wherein the voice sample comprises a plurality of voice fragments corresponding to each pair of confusion phonemes in at least one pair of confusion phonemes, and each pair of confusion phonemes comprises a first phoneme and a second phoneme; obtaining a covariance matrix corresponding to each pair of confusion phonemes according to a plurality of voice fragments corresponding to the first phoneme and a plurality of voice fragments corresponding to the second phoneme in each pair of confusion phonemes; and respectively fusing the acoustic features of the voice sample with the covariance matrixes corresponding to each pair of the confusion phonemes to obtain the acoustic features of each pair of the confusion phonemes.

In an embodiment of the present invention, the obtaining the covariance matrix corresponding to each pair of confusing phonemes according to the plurality of speech segments corresponding to the first phoneme and the plurality of speech segments corresponding to the second phoneme in each pair of confusing phonemes includes: segmenting the voice sample to obtain a plurality of voice fragments corresponding to a first phoneme and a plurality of voice fragments corresponding to a second phoneme in each pair of confusion phonemes; respectively extracting acoustic features of a plurality of voice segments corresponding to the first phoneme and clustering the acoustic features to obtain N first-class central vectors; respectively extracting acoustic features of a plurality of voice segments corresponding to the second phoneme and clustering the acoustic features to obtain N second-class central vectors; and reducing the dimensions of the N first-class central vectors and the N second-class central vectors to obtain a covariance matrix.

In an embodiment of the present invention, the pronunciation error detection method further includes: judging whether the reading text contains confusion phonemes or not; and when the reading text contains the confused phoneme, performing pronunciation error detection according to the pronunciation error detection model and the output result of the phoneme classification model corresponding to the confused phoneme.

According to a second aspect of the embodiments of the present invention, there is provided a method for training a pronunciation error detection model, including: acquiring a training sample, wherein the training sample comprises a voice signal sample and a text sample corresponding to the voice signal sample, and the voice signal sample comprises voice information formed by reading the text sample by a reader; extracting acoustic features of the voice samples, and converting the text samples into phoneme sequences; acquiring acoustic features of at least one pair of confusing phoneme pairs; and performing error detection training on the pronunciation error detection model based on the acoustic features of the voice signal samples, the acoustic features of at least one pair of confusing phoneme pairs and the phoneme sequence.

In an embodiment of the present invention, before performing error detection training on a pronunciation error detection model based on an acoustic feature of a speech signal sample, an acoustic feature of at least one pair of confusing phonemes, and a phoneme sequence, the method for training the pronunciation error detection model further includes: replacing part of phonemes in the phoneme sequence by using a mask; and performing voice recognition training on a pronunciation error detection model based on the acoustic features of the voice signal samples, the acoustic features of at least one pair of confusing phoneme pairs and the phoneme sequence after mask replacement, wherein the pronunciation error detection model recognizes and outputs phonemes corresponding to the replaced positions.

In an embodiment of the present invention, the method for training the pronunciation error detection model further includes: and constructing at least one phoneme classification model corresponding to at least one pair of confused phoneme pairs so as to carry out pronunciation error detection according to the pronunciation error detection model and the output result of the phoneme classification model when the text sample and/or the voice signal sample contains the confused phoneme, wherein each pair of confused phoneme pairs corresponds to one phoneme classification model, each pair of confused phoneme pairs comprises a first phoneme and a second phoneme, and the phoneme classification model is used for outputting the probability that the confused phoneme belongs to the first phoneme or the second phoneme.

In an embodiment of the present invention, the constructing at least one phoneme classification model corresponding to at least one pair of confusing phoneme pairs includes: segmenting the voice sample to obtain a plurality of voice fragments corresponding to each pair of confusion phoneme pairs in at least one pair of confusion phoneme pairs; acquiring a vector of each voice fragment in a plurality of voice fragments corresponding to the first phoneme, and acquiring a vector of each voice fragment in a plurality of voice fragments corresponding to the second phoneme; and training a phoneme classification model corresponding to the first phoneme and the second phoneme according to the vectors of the plurality of speech segments corresponding to the first phoneme and the vectors of the plurality of speech segments corresponding to the second phoneme.

According to a third aspect of the embodiments of the present invention, there is provided a pronunciation error detection apparatus, including: the first acquisition module is used for acquiring the speech signal to be detected as error and the corresponding reading text; the extraction module is used for extracting the acoustic characteristics of the speech signal to be detected and converting the reading text into a phoneme sequence; a second obtaining module, configured to obtain acoustic features of at least one pair of confusing phoneme pairs; and the error detection module is used for performing pronunciation error detection based on the acoustic characteristics of the speech signal to be detected incorrectly, the acoustic characteristics of at least one pair of confusing phoneme pairs and the phoneme sequence by using a pronunciation error detection model.

According to a fourth aspect of the embodiments of the present invention, there is provided a training apparatus for a pronunciation error detection model, including: the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a training sample, the training sample comprises a voice signal sample and a text sample corresponding to the voice signal sample, and the voice signal sample comprises voice information formed by reading the text sample by a reader; the extraction module is used for extracting the acoustic characteristics of the voice sample and converting the text sample into a phoneme sequence; a second obtaining module, configured to obtain acoustic features of at least one pair of confusing phoneme pairs; and the training module is used for carrying out error detection training on the pronunciation error detection model based on the acoustic characteristics of the voice signal sample, the acoustic characteristics of at least one pair of confusing phoneme pairs and the phoneme sequence.

According to a fifth aspect of embodiments of the present invention, there is provided a computer-readable storage medium having stored thereon computer-executable instructions, wherein the executable instructions, when executed by a processor, implement a method as any one of the above.

According to a sixth aspect of an embodiment of the present invention, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; the processor is configured to perform any of the methods described above.

According to the technical scheme provided by the embodiment of the invention, the voice signal to be detected and the corresponding reading text are obtained; extracting acoustic features of a speech signal to be detected, and converting the reading text into a phoneme sequence; acquiring acoustic features of at least one pair of confusing phoneme pairs; by using the pronunciation error detection model, pronunciation error detection is carried out based on the acoustic features of the speech signal to be detected, the acoustic features of at least one pair of confusing phoneme pairs and the phoneme sequence, so that errors caused by a forced alignment technology in a pronunciation error detection method can be eliminated, the distinguishability of confusing phonemes is improved, and the accuracy of pronunciation error detection is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart illustrating a pronunciation error detection method according to an embodiment of the present invention.

Fig. 2 is a schematic diagram illustrating a training method of a pronunciation error detection model according to an embodiment of the present invention.

Fig. 3 is a schematic flow chart illustrating a process of obtaining acoustic features of at least one confusing phoneme pair according to an embodiment of the present invention.

Fig. 4 is a flowchart illustrating a method for training a pronunciation error detection model according to an embodiment of the present invention.

Fig. 5 is a flowchart illustrating a method for training a pronunciation error detection model according to another embodiment of the present invention.

Fig. 6 is a schematic diagram illustrating a training method of speech recognition training according to an embodiment of the present invention.

Fig. 7 is a schematic diagram illustrating a training method for performing error detection training on a pronunciation error detection model based on acoustic features of a speech sample and a phoneme sequence after replacement of an error phoneme according to an embodiment of the present invention.

Fig. 8 is a schematic diagram illustrating a training method for performing error detection training on a pronunciation error detection model based on acoustic features and phoneme sequences of a voice sample labeled with pronunciation error levels according to an embodiment of the present invention.

Fig. 9 is a flowchart illustrating a process of constructing at least one phoneme classification model corresponding to at least one confusing phoneme pair according to an embodiment of the present invention.

Fig. 10 is a block diagram of a pronunciation error detection apparatus according to an embodiment of the present invention.

Fig. 11 is a block diagram of a training apparatus for pronunciation error detection models according to an embodiment of the present invention.

Fig. 12 is a block diagram of an electronic device according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Most of the existing pronunciation error detection methods rely on forced alignment of voice segments. Inputting the reading text and the voice, obtaining a voice segment corresponding to each phoneme by using forced alignment, constructing a representation vector of the voice segment, and predicting an error detection result through the vector. However, when error detection is performed at a phoneme level in a continuous speech stream, the accuracy requirement on the forced alignment technique is high because the phoneme pronunciation time is short, and errors introduced by the forced alignment technique have a large influence on the effect of the subsequent steps, so that the accuracy of pronunciation error detection is greatly reduced. In addition, the existing pronunciation error detection method has lower error detection accuracy rate for confusing voice.

Aiming at the problems, the embodiment of the invention provides a pronunciation error detection method and device and a pronunciation error detection model training method and device, wherein the pronunciation error detection model is trained based on the integral acoustic characteristics of a voice sample, so that errors caused by a forced alignment technology can be eliminated; in addition, by constructing the acoustic features of at least one pair of confusing phoneme pairs, the distinguishability of the pronunciation error detection model to confusing sounds can be improved, so that the accuracy of pronunciation error detection is improved.

Fig. 1 is a flowchart illustrating a pronunciation error detection method according to an embodiment of the present invention. The method may be performed by a computer device (e.g., a server). As shown in fig. 1, the method includes the following.

S110: and acquiring the speech signal to be detected and the corresponding reading text.

The speech signal to be detected may be a speech signal formed by reading the contents of the text by people of different genders, ages and dialects.

In particular, the embodiment of the invention can be applied to the situation of spoken language error detection. Given the speakable text, the tester's speakable speech is subjected to phoneme level error detection, e.g., in the case of Chinese, to detect whether the initial, final, and tone of each word are spoken correctly, and in the case of English, to detect whether the vowel, consonant, and accent of each word are correct, respectively.

S120: and extracting acoustic features of the speech signal to be detected mistakenly, and converting the reading text into a phoneme sequence.

Specifically, the acoustic feature parameter extraction, for example, filter bank (filter bank) feature parameter extraction or MFCC feature parameter extraction, may be performed on the speech signal to be detected to obtain the corresponding acoustic feature. It should be understood that the above description is only an exemplary description, and the way of extracting the acoustic features of the speech signal to be detected by the present invention is not particularly limited.

Specifically, the speakable text may be converted into a phoneme sequence according to characteristics of different languages, and information such as a start-stop time of each phoneme may be further marked. Phoneme (phone), the smallest unit in speech. The analysis can be based on the pronunciation actions in the syllables of the words, one action constituting one phoneme. For example, in Chinese, there are 32 phonemes, which can be divided into initials and finals. For example, for a Chinese reading text, the corresponding phoneme sequence may be composed of an initial consonant and a final consonant of each word in turn. Taking the Chinese reading text as "pronunciation error detection" as an example, the corresponding phoneme sequence can be "f, a, y, in, j, ian, c, uo". It should be understood that the corresponding phoneme sequence may also be "f, a, y, i, n, j, i, a, n, c, u, o", and the form of the phoneme sequence is not particularly limited in the present invention.

It should be noted that the pronunciation error detection method provided by the embodiment of the present invention is applicable to different languages. For example, for english, there are 48 phonemes, which can be divided into two categories, namely vowel and consonant. The phoneme sequence in the embodiment of the present invention may be a sequence composed of various phonetic symbols. It should be understood that the embodiment of the present invention is described by way of example in the application, and the language of the speech signal to be detected and the language of the text to be read are not particularly limited.

S130: acoustic features of at least one pair of aliased phoneme pairs are obtained.

For example, in chinese, partial phonemes pronunciations are close to each other or are highly confused in partial dialect pronunciations, which is likely to cause certain difficulty in pronunciation error detection, such as in/ing, en/eng, etc. of front and back nasal sounds, l/n, d/t, etc. of tongue tip mediant sounds, z/zh, s/sh, c/ch, etc. of flat and warped tongue sounds.

S140: and performing pronunciation error detection based on the acoustic features of the speech signal to be detected mistaken, the acoustic features of at least one pair of confusing phoneme pairs and the phoneme sequence by using a pronunciation error detection model.

Specifically, the acoustic features of the speech signal to be mistaken, the acoustic features of at least one pair of confusing phoneme pairs, and the phoneme sequence may be input into a pronunciation error detection model. The pronunciation error detection model judges whether the pronunciation of the speech signal to be detected is matched with the reading text or not based on the acoustic characteristics and the phoneme sequence, namely whether the pronunciation of the speech signal to be detected is correct or not.

In one embodiment of the present invention, the pronunciation error detection model may be an Encoder-Decoder (Encoder-Decoder) model, as shown in FIG. 2. The acoustic features of the speech signal to be detected are fused with the acoustic features of at least one pair of confusing phoneme pairs to serve as the input of the encoder, the phoneme sequence serves as the input of the decoder, and the decoder of the pronunciation error detection model outputs the error detection result about whether each phoneme is in pronunciation error or not.

For example, the targeted acoustic features of all the above-mentioned confusing phoneme pairs can be spliced with the acoustic features of the speech signal to be detected as the acoustic features input by the encoder in the pronunciation error detection model. For example, the acoustic feature of the speech signal to be mistaken is 10-dimensional, and the acoustic feature for the in/ing confusing phoneme pair is 10-dimensional; the acoustic features for the en/eng confuse phoneme pair are 10-dimensional; for the acoustic features of the l/n confusion phoneme pairs to be 10-dimensional, etc., the targeted acoustic features of the 12 confusion phoneme pairs and the acoustic features of the speech signal to be mistaken can be spliced to obtain 130-dimensional acoustic features, which are used as the input features of the encoder in the pronunciation error detection model.

In another embodiment of the present invention, in addition to converting the speakable text into a phoneme sequence, a tone sequence of the speakable text may be obtained. That is, the above step S120 may include: the speakable text is converted into a sequence of phonemes and a sequence of tones. Taking the Chinese reading text as the pronunciation error detection as an example, the corresponding tone sequence is 1, 3 and 4. It should be noted that, in order to make the length of the tone sequence and the phoneme sequence "f, a, y, in, j, ian, c, uo" consistent, a null symbol may be added before each element of the tone sequence, i.e. to become "null, 1, null, 3, null, 4".

In this embodiment, the step S140 may specifically include: and performing pronunciation error detection based on the acoustic characteristics of the speech signal to be detected in error, the acoustic characteristics of at least one pair of confusion phoneme pairs, the phoneme sequence and the tone sequence by using a pronunciation error detection model.

Specifically, the acoustic features of the speech signal to be detected are fused with the acoustic features of at least one pair of confusing phoneme pairs as input of the encoder, the phoneme sequence and the tone sequence are used as input of the decoder, and the decoder of the pronunciation error detection model outputs an error detection result about whether each phoneme is in pronunciation error or not.

According to the technical scheme provided by the embodiment of the invention, by utilizing the pronunciation error detection model and carrying out error detection training based on the acoustic characteristics of the voice sample, the acoustic characteristics of at least one pair of confusing phoneme pairs, the phoneme sequence and the tone sequence, the tone error detection can be carried out, so that the accuracy of pronunciation error detection is further improved.

In order to improve the distinction between the pronunciation error detection model and the confusing phoneme, in the embodiment of the present invention, a set of acoustic features may be constructed for each pair of confusing phonemes. Specifically, as shown in fig. 3, the step S130 may include:

s131: obtaining acoustic features of a speech sample, wherein the speech sample comprises a plurality of speech segments corresponding to each pair of confusing phoneme pairs of at least one pair of confusing phonemes, and each pair of confusing phonemes comprises a first phoneme and a second phoneme.

In particular, large-scale speech recognition data can be used as speech samples for pronunciation error detection tasks.

S132: and obtaining a covariance matrix corresponding to each pair of confusing phonemes according to the plurality of speech fragments corresponding to the first phoneme and the plurality of speech fragments corresponding to the second phoneme in each pair of confusing phonemes.

Specifically, the speech sample may be segmented to obtain a plurality of speech segments corresponding to a first phoneme and a plurality of speech segments corresponding to a second phoneme in each pair of confusing phonemes; respectively extracting acoustic features of a plurality of voice segments corresponding to the first phoneme and clustering the acoustic features to obtain N first-class central vectors; respectively extracting acoustic features of a plurality of voice segments corresponding to the second phoneme and clustering the acoustic features to obtain N second-class central vectors; and reducing the dimensions of the N first-class central vectors and the N second-class central vectors to obtain a covariance matrix.

The embodiment of the invention takes in/ing confusing phoneme pairs as an example.

Specifically, the large-scale speech recognition data may be used as a speech sample for a pronunciation error detection task, and a forced alignment technique is used to obtain speech segments corresponding to all the first phonemes (in) and the second phonemes (ing) in the speech sample.

Firstly, the filter bank characteristics can be extracted from all the speech segments corresponding to in, and kmeans clustering is carried out on the filter bank characteristics to obtain N class center vectors

Similarly, extracting filter bank characteristics from all speech segments corresponding to the ing, and performing kmeans clustering on the filter bank characteristics to obtain N class center vectors

Secondly, can be paired with

And

the 2N total class center vectors are subjected to dimensionality reduction by using a Principal Component Analysis (PCA) method to obtain a covariance matrix W thereof.

It should be understood that other clustering manners may also be used to cluster the filter bank characteristics, and the clustering manner is not specifically limited in the present invention. In addition, the dimension reduction can be performed on the 2N class center vectors in other manners, and the dimension reduction manner is not particularly limited in the present invention.

S133: and respectively fusing the acoustic features of the voice sample with the covariance matrixes corresponding to each pair of the confusion phonemes to obtain the acoustic features of each pair of the confusion phonemes.

Specifically, the acoustic features of each pair of confusing phonemes may be obtained by multiplying the acoustic features of the speech sample by the covariance matrix corresponding to each pair of confusing phonemes.

For example, the filter bank features of all the speech samples are extracted, and the filter bank features of all the speech samples are multiplied by the covariance matrix W, so as to obtain the acoustic features for the in/ing confusing phoneme pairs.

It should be noted that, according to the above method, the acoustic features for each pair of confusing phonemes, such as en/eng, l/n, d/t, z/zh, s/sh, c/ch, etc., can be sequentially constructed.

It should be noted that the acoustic features of the at least one pair of confusing phoneme pairs may be constructed in advance (for example, the acoustic features of the at least one pair of confusing phoneme pairs used in training the pronunciation error detection model), and during the pronunciation error detection process, the acoustic features of the speech signal to be detected as an input to the encoder may be directly spliced with the acoustic features of the speech signal to be detected as an error.

In addition, it should be understood that the above description is only an exemplary description, and the present invention does not specifically limit the manner of obtaining the acoustic features of at least one pair of confusing phoneme pairs.

It should be noted that, the phoneme classification model can be obtained by training through the following method: segmenting a voice sample to obtain a plurality of voice fragments corresponding to at least one pair of confusion phoneme pairs, wherein each type of confusion phoneme pair comprises a first phoneme and a second phoneme; acquiring a vector of each voice fragment in a plurality of voice fragments corresponding to the first phoneme, and acquiring a vector of each voice fragment in a plurality of voice fragments corresponding to the second phoneme; and training a phoneme classification model corresponding to the first phoneme and the second phoneme according to the vectors of the plurality of speech segments corresponding to the first phoneme and the vectors of the plurality of speech segments corresponding to the second phoneme, wherein the phoneme classification model is used for outputting the probability that the confusion phoneme belongs to the first phoneme or the second phoneme.

The embodiment of the invention takes in/ing confusing phoneme pairs as an example to explain the training phoneme classification model.

Specifically, the speech samples may be aligned forcibly to obtain isolated speech segments corresponding to all in and ing phonemes.

For each in or ing corresponding voice segment, the filter bank characteristic of the voice segment can be extracted, forward operation is performed through an encoder of a trained pronunciation error detection model, a group of output vectors is obtained, and the group of output vectors is averaged, so that vector representation corresponding to the voice segment is obtained.

Denote the vector representation corresponding to all in speech segments as

The vector representation corresponding to all ing speech segments is noted as

Wherein N is_inAnd N_ingRepresenting the number of occurrences of in and ing in the corpus, respectively. Then according to

And

and training a phoneme classification model.

In an embodiment of the invention, can be according to

And

training phoneme binary model through Support Vector Machine (SVM) algorithm, and marking as SVM_in/ing. Repeating the above steps for each type of confusing phoneme pair to obtain a plurality of SVM classification models, such as SVM_z/zh，SVM_c/ch，SVM_s/shAnd the like. It should be understood that the phoneme classification model may also be a neural network model such as CNN, and the present invention is not limited to the specific method for training the phoneme classification model.

In the embodiment of the invention, when the speakable text does not contain the confusing phoneme, the error detection can be directly carried out according to the output result of the decoder of the trained pronunciation error detection model.

When the reading text contains the confused phonemes, the pronunciation error detection model and the result of the phoneme classification model can be fused, namely, pronunciation error detection is carried out according to the output results of the pronunciation error detection model and the phoneme classification model.

Taking phoneme in as an example, firstly, the probability p of the position misreading is obtained according to the output result of the decoder of the trained pronunciation error detection model_ed-errorThen finding out the corresponding SVM model SVM in the phoneme classification model_in/ingObtaining the probability that the prediction result of the SVM is ing from the position, namely the probability p that the SVM model considers the position to be wrongly read_svm-errorThe final prediction result may be the average of the output results of the two models, i.e. p_error＝(p_ed-error+p_svm-error)/2。

It should be understood that, the above-mentioned fusion of the results of the pronunciation error detection model and the phoneme classification model may be, besides averaging the output results of the two models, weighted addition of the output results of the two models, which is not specifically limited by the present invention.

According to the technical scheme provided by the embodiment of the invention, when the read-aloud text contains the confusion phoneme, the phoneme classification model is used for classifying the confusion phoneme of the isolated voice segment obtained by forced alignment, and the pronunciation error detection model and the output result of the phoneme classification model are fused for pronunciation error detection, so that the error detection effect of the pronunciation error detection model on the confusion phoneme can be improved; in addition, when the reading text does not contain the confused phonemes, pronunciation error detection is carried out according to the output result of the pronunciation error detection model.

Fig. 4 is a flowchart illustrating a method for training a pronunciation error detection model according to an embodiment of the present invention. The method may be performed by a computer device (e.g., a server). As shown in fig. 4, the method includes the following.

S410: the method comprises the steps of obtaining training samples, wherein the training samples comprise voice signal samples and text samples corresponding to the voice signal samples, and the voice signal samples comprise voice information formed by reading the text samples by readers.

Specifically, the speech signal samples may be speech recognition corpora formed by reading the content of the same or different texts by people of different genders, ages and dialects, where the text samples may be understood as labels for the speech signal samples, each speech signal sample may correspond to one text sample, and one text sample may correspond to multiple speech signal samples.

S420: and extracting acoustic features of the voice signal samples, and converting the text samples into phoneme sequences.

Specifically, acoustic feature parameter extraction, for example, filter bank (filter bank) feature parameter extraction or MFCC feature parameter extraction, may be performed on the speech signal samples to obtain corresponding acoustic features. It should be understood that the above description is only an exemplary description, and the present invention is not limited in particular to the way of extracting the acoustic features of the speech signal samples.

Specifically, the text sample can be converted into a phoneme sequence according to the characteristics of different languages, and information such as the start and stop time of each phoneme can be marked. Phoneme (phone), the smallest unit in speech. The analysis can be based on the pronunciation actions in the syllables of the words, one action constituting one phoneme. For example, in Chinese, there are 32 phonemes, which can be divided into initials and finals. For example, for a chinese text sample, the corresponding phoneme sequence may be composed of the initial and final of each word in turn. Taking a Chinese text sample as an example of "pronunciation error detection", the corresponding phoneme sequence may be "f, a, y, in, j, ian, c, uo". It should be understood that the corresponding phoneme sequence may also be "f, a, y, i, n, j, i, a, n, c, u, o", and the form of the phoneme sequence is not particularly limited in the present invention.

It should be noted that the pronunciation error detection model training method provided by the embodiment of the present invention is applicable to different languages. For example, for english, there are 48 phonemes, which can be divided into two categories, namely vowel and consonant. The phoneme sequence in the embodiment of the present invention may be a sequence composed of various phonetic symbols. It should be understood that the embodiments of the present invention are described in the application as examples, and the language of the speech signal samples and the text samples is not particularly limited by the embodiments of the present invention.

S430: acoustic features of at least one pair of aliased phoneme pairs are obtained.

S440: and performing error detection training on the pronunciation error detection model based on the acoustic features of the voice signal samples, the acoustic features of at least one pair of confusing phoneme pairs and the phoneme sequence.

Specifically, the pronunciation error detection model can be trained by taking the acoustic features of the speech signal samples, the acoustic features of at least one pair of confusing phoneme pairs and the phoneme sequence as input of the pronunciation error detection model. The pronunciation error detection model judges whether the pronunciation of the voice signal sample is matched with the text sample or not, namely whether the pronunciation of the voice signal sample is correct or not, based on the acoustic characteristics of the voice signal sample, the acoustic characteristics of at least one pair of confusing phoneme pairs and the phoneme sequence corresponding to the text sample.

In one embodiment of the present invention, the pronunciation error detection model may be an Encoder-Decoder (Encoder-Decoder) model. The acoustic features of the speech signal samples and the mixed acoustic features of at least one pair of confusing phoneme pairs are used as the input of the encoder, the phoneme sequence is used as the input of the decoder, and the decoder of the pronunciation error detection model outputs the error detection result about whether each phoneme is in pronunciation error or not.

For example, the targeted acoustic features of all the confusing phoneme pairs can be spliced with the acoustic features of the speech signal samples to serve as the acoustic features input by the encoder in the pronunciation error detection model. For example, the acoustic features of the speech signal samples are 10-dimensional, and the acoustic features for in/ing confusing phone pairs are 10-dimensional; the acoustic features for the en/eng confuse phoneme pair are 10-dimensional; for the acoustic features of the l/n confusing phoneme pairs to be 10-dimensional, etc., the targeted acoustic features of the 12 confusing phoneme pairs and the acoustic features of the speech signal samples may be spliced to obtain 130-dimensional acoustic features, which are used as the input features of the encoder in the pronunciation error detection model, and the pronunciation error detection model may be trained by using the phoneme sequence as the input features of the decoder.

It should be noted that the decoder in the embodiment of the present invention adopts a non-autoregressive bi-directional structure to enhance the modeling capability of the model. For convenience of description, the pronunciation error detection model is used as an encoder-decoder model in the embodiments and the following embodiments of the present invention, and it should be understood that the present invention is not limited thereto.

According to the technical scheme provided by the embodiment of the invention, the pronunciation error detection model is subjected to error detection training by utilizing the integral acoustic characteristics of the voice signal sample, so that errors caused by a forced alignment technology can be eliminated; in addition, by constructing acoustic features for at least one pair of confusing phoneme pairs; based on the acoustic characteristics of the voice signal sample, the acoustic characteristics of at least one pair of confusing phoneme pairs and the phoneme sequence, error detection training is carried out on the pronunciation error detection model, so that the distinguishability of the pronunciation error detection model on confusing sounds can be improved, and the accuracy of the pronunciation error detection model on the error detection of the confusing phonemes is improved.

In another embodiment of the present invention, in addition to converting the text sample into a phoneme sequence, a tone sequence of the text sample may be obtained. That is, the step S420 may include: the text sample is converted into a phoneme sequence and a tone sequence. Taking a Chinese text sample as an example of 'pronunciation error detection', the corresponding tone sequence is '1, 3, 4'. It should be noted that, in order to make the length of the tone sequence and the phoneme sequence "f, a, y, in, j, ian, c, uo" consistent, a null symbol may be added before each element of the tone sequence, i.e. to become "null, 1, null, 3, null, 4".

In this embodiment, the step S430 may specifically include: and performing error detection training on the pronunciation error detection model based on the acoustic features of the voice signal samples, the acoustic features of at least one pair of confusing phoneme pairs, the phoneme sequence and the tone sequence.

Specifically, the acoustic features of the speech signal samples and the acoustic features of at least one pair of confusing phoneme pairs are fused as input to an encoder, the phoneme sequence and the pitch sequence are used as input to a decoder, and the decoder of the pronunciation error detection model outputs an error detection result as to whether each phoneme is in pronunciation error or not.

According to the technical scheme provided by the embodiment of the invention, the pronunciation error detection model is subjected to error detection training by the aid of the acoustic features of the voice signal samples, the acoustic features of at least one pair of confusing phoneme pairs, the phoneme sequences and the tone sequences, and tone error detection can be performed, so that the accuracy of pronunciation error detection is further improved.

In order to improve the distinction between the pronunciation error detection model and the confusing phoneme, in the embodiment of the present invention, a set of acoustic features may be constructed for each pair of confusing phonemes. Specifically, the step S430 may include:

s4310: obtaining acoustic features of a speech sample, wherein the speech sample comprises a plurality of speech segments corresponding to each pair of confusing phoneme pairs of at least one pair of confusing phonemes, and each pair of confusing phonemes comprises a first phoneme and a second phoneme.

S4320: and obtaining a covariance matrix corresponding to each pair of confusing phonemes according to the plurality of speech fragments corresponding to the first phoneme and the plurality of speech fragments corresponding to the second phoneme in each pair of confusing phonemes.

Specifically, first, a speech sample is segmented, and a plurality of speech segments corresponding to a first phoneme and a plurality of speech segments corresponding to a second phoneme in each pair of confusing phonemes are obtained.

Secondly, respectively extracting acoustic features of a plurality of voice segments corresponding to the first phoneme and clustering the acoustic features to obtain N first-class central vectors; and respectively extracting acoustic features of a plurality of voice segments corresponding to the second phoneme and clustering the acoustic features to obtain N second-class central vectors.

And then, performing dimensionality reduction on the N first-class central vectors and the N second-class central vectors to obtain a covariance matrix.

S4330: and respectively fusing the acoustic features of the voice sample with the covariance matrixes corresponding to each pair of the confusion phonemes to obtain the acoustic features of each pair of the confusion phonemes.

Specifically, the large-scale speech recognition data may be used as a training sample set for a pronunciation error detection task, and all speech segments corresponding to the first phoneme (in) and the second phoneme (ing) in the training sample set may be obtained by using a forced alignment technique.

Secondly, can be paired with

And

reducing the dimension of 2N class center vectors by using Principal Component Analysis (PCA) method to obtain the central vectorsThe covariance matrix W.

Then, the filter bank features of all the speech samples in the training sample set can be extracted, and the filter bank features of all the speech samples are multiplied by the covariance matrix W, so as to obtain the acoustic features for the in/ing confusing phoneme pairs.

It should be noted that the acoustic features of the at least one confusing phoneme pair may be constructed before the pronunciation error detection model is trained, and when the pronunciation error detection model is trained, the confusing phoneme pair may be directly spliced with the acoustic features of the speech sample to be used as the input of the encoder.

Fig. 5 is a flowchart illustrating a method for training a pronunciation error detection model according to another embodiment of the present invention. The embodiment shown in fig. 5 of the present invention is extended on the basis of the embodiment shown in fig. 4 of the present invention, and the differences between the embodiment shown in fig. 5 and the embodiment shown in fig. 4 will be emphasized below, and the descriptions of the same parts will not be repeated.

As shown in fig. 5, in the method for training a pronunciation error detection model according to an embodiment of the present invention, before performing error detection training on the pronunciation error detection model based on the acoustic features of the speech signal samples, the acoustic features of at least one pair of confusing phoneme pairs, and the phoneme sequence, the method further includes:

step S450: replacing part of phonemes in the phoneme sequence by using a mask; and performing voice recognition training on a pronunciation error detection model based on the acoustic features of the voice signal samples, the acoustic features of at least one pair of confusing phoneme pairs and the phoneme sequence after mask replacement, wherein the pronunciation error detection model recognizes and outputs phonemes corresponding to the replaced positions.

That is, the pronunciation error detection model may be trained for speech recognition first, and then the pronunciation error detection model may be trained for error detection.

Specifically, the large-scale speech recognition data may be used as the speech signal samples for training the pronunciation error detection model in the embodiment of the present invention, and the text samples of the large-scale speech recognition data may be converted into a phoneme sequence and a tone sequence. At the input of the decoder, parts of the phonemes and tones are randomly deleted, and the deleted parts can be replaced with a mask. For the replaced positions, the corresponding phonemes and tones are predicted at the output of the decoder, and for the positions without replacement, the output is not predicted. Then, a loss value is calculated from the prediction result and the annotation data (i.e., the real phoneme and the tone corresponding to the replaced position), and the parameters of the pronunciation error detection model are updated by back-propagating the loss value until the loss value converges.

Taking the "error detection" two-word as an example, as shown in fig. 6, 12 pairs of targeted acoustic features of the confusing phoneme pair and the acoustic features of the speech signal sample with the "error detection" speech information may be spliced to serve as the acoustic features input by the encoder in the pronunciation error detection model; at the input of the decoder the phonemes "j" and "ian" in the sequence of phonemes are deleted, the tone "3" in the sequence of tones is deleted and the deleted position is replaced with a mask. The decoder can recognize the phonemes "j" and "ian" and the tone "3" corresponding to the position replaced by the mask [ mask ] from the input acoustic features.

According to the technical scheme provided by the embodiment of the invention, partial phonemes in the phoneme sequence are replaced by utilizing the mask; based on the acoustic features of the voice signal sample, the acoustic features of at least one pair of confusing phoneme pairs and the phoneme sequence after mask replacement, the pronunciation error detection model is subjected to voice recognition training, so that the pronunciation error detection accuracy of a subsequent pronunciation error detection model can be improved.

In an embodiment of the present invention, the step S440 may include: replacing part of the phonemes in the phoneme sequence with the wrong phonemes; and performing error detection training on a pronunciation error detection model based on the acoustic features of the voice signal samples, the acoustic features of at least one pair of confusing phoneme pairs and the phoneme sequence after the error phoneme replacement, wherein the pronunciation error detection model outputs whether the position of each phoneme is replaced.

Specifically, the large-scale speech recognition data may be used as the speech signal samples for training the pronunciation error detection model in the embodiment of the present invention, and the text samples of the large-scale speech recognition data may be converted into a phoneme sequence and a tone sequence. Randomly replacing a part of the phonemes or tones in the sequence of phonemes with the wrong phonemes at the input of the decoder, so that the output of the decoder predicts whether each position is replaced, for example, 0 may indicate no replacement, i.e. the position is read correctly, and 1 may indicate replacement, i.e. the position is read incorrectly.

Taking the "error detection" word as an example, the phoneme "c" is randomly replaced with the phoneme "d" at the input of the decoder as shown in fig. 7. The pronunciation error detection model takes the phoneme sequence 'j, ian, d and uo' as a reference, and judges whether each position in the voice signal sample reads correctly according to the input acoustic characteristics. For example, the output of the decoder of FIG. 7 predicts that the phoneme "d" position is read incorrectly and the remaining positions are read correctly.

According to the technical scheme provided by the embodiment of the invention, part of phonemes in the phoneme sequence are replaced by using wrong phonemes; based on the acoustic features of the voice signal sample, the acoustic features of at least one pair of confusing phoneme pairs and the phoneme sequence after the replacement of the error phoneme, the pronunciation error detection model is subjected to error detection training, pre-training can be provided for the pronunciation error detection model, meanwhile, the problem of obvious misreading can be solved, and the error detection effect of the pronunciation error detection model is improved.

In another embodiment of the present invention, the step S440 may include: and performing error detection training on a pronunciation error detection model based on the acoustic characteristics of the voice signal sample subjected to pronunciation error level labeling, the acoustic characteristics of at least one pair of confusing phoneme pairs and the phoneme sequence, wherein the pronunciation error detection model outputs pronunciation error level classification of each phoneme position.

Specifically, the speech signal samples can be finely labeled manually, for example, according to the pronunciation condition of each phoneme and tone, the speech signal samples can be labeled as "correct", "wrong" and "defective". Wherein "defect" may be a condition where the error level is relatively mild, between "correct" and "error". As shown in fig. 8, the pronunciation error detection model is fine-tuned and trained based on the labeling data, and the pronunciation error detection model can perform three categories of "correct", "error" and "defect" for each phoneme and tone position, for example, 0 indicates correct, 1 indicates error, and 2 indicates defect. It should be understood that the above "correct", "incorrect" and "defective" are only exemplary descriptions, and the classification of pronunciation error levels is not specifically limited by the present invention.

According to the technical scheme provided by the embodiment of the invention, the pronunciation error detection model is subjected to error detection training by labeling the pronunciation error grade of the voice signal sample and based on the acoustic characteristics of the voice signal sample subjected to pronunciation error grade labeling, at least one pair of acoustic characteristics confusing the phoneme pair and the phoneme sequence, so that the pronunciation error detection model outputs pronunciation error grade classification of each phoneme position, and the pronunciation error detection model can be subjected to fine tuning training. Because the existing scheme is usually trained based on general large-scale voice recognition data, when a use scene is a specific scene (such as a dialect scene), the effect may be reduced, if the effect of the specific scene needs to be improved, large-scale voice recognition corpora corresponding to the scene need to be collected, and the cost is higher.

In the case where the speech recognition training is performed on the pronunciation error detection model first and then the error detection training is performed on the pronunciation error detection model, after the speech recognition training is completed, the speech recognition module (left oblique line square in fig. 6) in the pronunciation error detection model may be replaced with the pronunciation error detection module (right oblique line square in fig. 7 or 8) to facilitate the pronunciation error detection training.

In another embodiment of the present invention, the step S440 includes: replacing part of the phonemes in the phoneme sequence with the wrong phonemes; performing error detection training on a pronunciation error detection model based on the acoustic features of the voice signal sample, the acoustic features of at least one pair of confusing phoneme pairs and the phoneme sequence after the replacement of the error phoneme, wherein the pronunciation error detection model outputs whether the position of each phoneme is replaced; and performing error detection training on a pronunciation error detection model based on the acoustic features of the voice signal samples subjected to pronunciation error level labeling, the acoustic features of at least one pair of confusing phoneme pairs and the phoneme sequence, wherein the pronunciation error detection model outputs pronunciation error level classification of each phoneme position. According to the technical scheme provided by the embodiment of the invention, the pronunciation error detection model is trained by the two error detection training methods, so that the pronunciation error detection model can obtain a better error detection effect.

Specifically, as shown in fig. 9, the constructing at least one phoneme classification model corresponding to at least one confusing phoneme pair may include:

s910: and segmenting the voice sample to obtain a plurality of voice fragments corresponding to each pair of confusion phoneme pairs in at least one pair of confusion phoneme pairs.

Specifically, taking in/ing confusing phoneme pairs as an example, the speech samples may be aligned forcibly to obtain all isolated speech segments corresponding to in and ing phonemes.

S920: and acquiring a vector of each voice fragment in the plurality of voice fragments corresponding to the first phoneme, and acquiring a vector of each voice fragment in the plurality of voice fragments corresponding to the second phoneme.

In an embodiment of the present invention, for each in or ing corresponding speech segment, the filter bank feature of the speech segment may be extracted, forward operation is performed by an encoder of a trained pronunciation error detection model to obtain a group of output vectors, and the group of output vectors is averaged to obtain a vector representation corresponding to the speech segment.

It should be understood that the vector of each speech segment may also be obtained by other ways, which is not specifically limited by the present invention.

S930: and training a phoneme classification model corresponding to the first phoneme and the second phoneme according to the vectors of the plurality of speech segments corresponding to the first phoneme and the vectors of the plurality of speech segments corresponding to the second phoneme.

For example, the vector representations corresponding to all in speech segments in step S920 are recorded as

The vector representation corresponding to all ing speech segments is noted as

And

and training a phoneme classification model.

In an embodiment of the invention, can be according to

And

In the embodiment of the invention, when the text sample and/or the voice signal sample do not contain the confusion phoneme, the error detection can be directly carried out according to the output result of the decoder of the trained pronunciation error detection model.

When the confusing phoneme is included in the text sample and/or the speech signal sample, the pronunciation error detection model and the result of the phoneme classification model may be fused, that is, pronunciation error detection is performed according to the output results of the pronunciation error detection model and the phoneme classification model.

According to the technical scheme provided by the embodiment of the invention, by training the phoneme classification model, when the text sample and/or the voice signal sample contain the confusion phoneme, the phoneme classification model is used for classifying the confusion phoneme of the isolated voice segment obtained by forced alignment, and the pronunciation error detection model and the output result of the phoneme classification model are fused for pronunciation error detection, so that the error detection effect of the pronunciation error detection model on the confusion phoneme can be improved.

All the above-mentioned optional technical solutions can be combined arbitrarily to form the optional embodiments of the present invention, and are not described herein again.

The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. For details which are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the embodiments of the method of the present invention.

Fig. 10 is a block diagram of a pronunciation error detection apparatus according to an embodiment of the present invention. As shown in fig. 10, the pronunciation error detection apparatus 1000 includes:

the first obtaining module 1010 is configured to obtain a speech signal to be detected as an error and a corresponding reading text;

an extracting module 1020, configured to extract an acoustic feature of the speech signal to be detected, and convert the read-aloud text into a phoneme sequence;

a second obtaining module 1030, configured to obtain acoustic features of at least one confusing phoneme pair;

and the error detection module 1040 is used for performing pronunciation error detection based on the acoustic characteristics of the speech signal to be detected incorrectly, the acoustic characteristics of at least one pair of confusing phoneme pairs and the phoneme sequence by using a pronunciation error detection model.

In an embodiment of the present invention, the second obtaining module 1030 is configured to obtain an acoustic feature of a speech sample, where the speech sample includes a plurality of speech segments corresponding to each confusing phoneme pair in at least one confusing phoneme pair, where each confusing phoneme pair includes a first phoneme and a second phoneme; obtaining a covariance matrix corresponding to each pair of confusion phonemes according to a plurality of voice fragments corresponding to the first phoneme and a plurality of voice fragments corresponding to the second phoneme in each pair of confusion phonemes; and respectively fusing the acoustic features of the voice sample with the covariance matrixes corresponding to each pair of the confusion phonemes to obtain the acoustic features of each pair of the confusion phonemes.

In an embodiment of the present invention, the second obtaining module 1030 is configured to segment the speech sample, and obtain a plurality of speech segments corresponding to a first phoneme and a plurality of speech segments corresponding to a second phoneme in each confusing phoneme pair; respectively extracting acoustic features of a plurality of voice segments corresponding to the first phoneme and clustering the acoustic features to obtain N first-class central vectors; respectively extracting acoustic features of a plurality of voice segments corresponding to the second phoneme and clustering the acoustic features to obtain N second-class central vectors; and reducing the dimensions of the N first-class central vectors and the N second-class central vectors to obtain a covariance matrix.

In an embodiment of the present invention, the pronunciation error detection apparatus further includes a determining module 1050, configured to determine whether the speakable text contains confusing phonemes; when the reading text contains the confused phoneme, the error detection module is used for performing pronunciation error detection according to the pronunciation error detection model and the output result of the phoneme classification model corresponding to the confused phoneme.

The detailed implementation process of the functions and actions of each module in the pronunciation error detection apparatus 1000 is shown in the implementation process of corresponding steps in fig. 1, and is not described herein again.

Fig. 11 is a block diagram of a training apparatus for pronunciation error detection models according to an embodiment of the present invention. As shown in fig. 11, the pronunciation error detection model training apparatus 1100 includes:

a first obtaining module 1110, configured to obtain a training sample, where the training sample includes a speech signal sample and a text sample corresponding to the speech signal sample, where the speech signal sample includes speech information formed by a reader reading the text sample;

an extracting module 1120, configured to extract acoustic features of the speech signal samples, and convert the text samples into a phoneme sequence;

a second obtaining module 1130, configured to obtain acoustic features of at least one confusing phoneme pair;

a training module 1140 for performing error detection training on the pronunciation error detection model based on the acoustic features of the speech signal samples, the acoustic features of at least one pair of confusing phoneme pairs, and the phoneme sequence.

In an embodiment of the present invention, the second obtaining module 1130 is configured to obtain the acoustic features of the speech sample, where the speech sample includes a plurality of speech segments corresponding to each of at least one pair of confusing phoneme pairs, where each pair of confusing phoneme pairs includes a first phoneme and a second phoneme; obtaining a covariance matrix corresponding to each pair of confusion phonemes according to a plurality of voice fragments corresponding to the first phoneme and a plurality of voice fragments corresponding to the second phoneme in each pair of confusion phonemes; and respectively fusing the acoustic features of the voice sample with the covariance matrixes corresponding to each pair of the confusion phonemes to obtain the acoustic features of each pair of the confusion phonemes.

In an embodiment of the present invention, the second obtaining module 1130 is configured to segment the speech sample, and obtain a plurality of speech segments corresponding to the first phoneme and a plurality of speech segments corresponding to the second phoneme in each pair of confusing phonemes; respectively extracting acoustic features of a plurality of voice segments corresponding to the first phoneme and clustering the acoustic features to obtain N first-class central vectors; respectively extracting acoustic features of a plurality of voice segments corresponding to the second phoneme and clustering the acoustic features to obtain N second-class central vectors; and reducing the dimensions of the N first-class central vectors and the N second-class central vectors to obtain a covariance matrix.

In an embodiment of the present invention, before performing error detection training on the pronunciation error detection model based on the acoustic features of the speech signal samples, the acoustic features of at least one pair of confusing phonemes, and the phoneme sequence, the training module 1140 is further configured to replace a part of the phonemes in the phoneme sequence with a mask; and performing voice recognition training on the pronunciation error detection model based on the acoustic characteristics of the voice signal samples and the phoneme sequence after mask replacement, wherein the pronunciation error detection model recognizes and outputs the phonemes corresponding to the replaced positions.

In an embodiment of the present invention, the apparatus for training the pronunciation error detection model further includes a classification module 1150, configured to construct at least one phoneme classification model corresponding to at least one pair of confusing phoneme pairs, so as to perform pronunciation error detection according to the pronunciation error detection model and the output result of the phoneme classification model when the text sample and/or the speech signal sample contains the confusing phoneme, where each pair of confusing phonemes corresponds to one phoneme classification model, each pair of confusing phonemes includes a first phoneme and a second phoneme, and the phoneme classification model is configured to output a probability that the confusing phoneme belongs to the first phoneme or the second phoneme.

In an embodiment of the present invention, the classifying module 1150 is configured to segment a speech sample to obtain a plurality of speech segments corresponding to each pair of confusing phoneme pairs in at least one pair of confusing phoneme pairs; acquiring a vector of each voice fragment in a plurality of voice fragments corresponding to the first phoneme, and acquiring a vector of each voice fragment in a plurality of voice fragments corresponding to the second phoneme; and training a phoneme classification model corresponding to the first phoneme and the second phoneme according to the vectors of the plurality of speech segments corresponding to the first phoneme and the vectors of the plurality of speech segments corresponding to the second phoneme.

In one embodiment of the present invention, the pronunciation error detection model includes an encoder-decoder model.

The detailed implementation process of the functions and actions of each module in the apparatus 1100 is shown in the implementation process of corresponding steps in the embodiments of fig. 4 to fig. 9, and is not described herein again.

Fig. 12 is a block diagram of an electronic device 1200 according to an embodiment of the invention.

Referring to fig. 12, electronic device 1200 includes a processing component 1210 that further includes one or more processors, and memory resources, represented by memory 1220, for storing instructions, such as applications, that are executable by processing component 1210. The application programs stored in memory 1220 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1210 is configured to execute instructions to perform the pronunciation error detection method or the training method of the pronunciation error detection model described above.

The electronic device 1200 may also include a power supply component configured to perform power management of the electronic device 1200, a wired or wireless network interface configured to connect the electronic device 1200 to a network, and an input-output (I/O) interface. The electronic device 1200 may operate based on an operating system stored in the memory 1220, such as Windows Server^TM，Mac OS X^TM，Unix^TM，Linux^TM，FreeBSD^TMOr the like.

A non-transitory computer-readable storage medium having instructions stored thereon, which, when executed by a processor of the electronic device 1200, enable the electronic device 1200 to perform a pronunciation error detection method or a pronunciation error detection model training method.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention or a part thereof, which essentially contributes to the prior art, can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program check codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

It should be noted that the combination of the features in the present application is not limited to the combination described in the claims or the combination described in the embodiments, and all the features described in the present application may be freely combined or combined in any manner unless contradictory to each other.

It should be noted that the above-mentioned embodiments are only specific examples of the present invention, and obviously, the present invention is not limited to the above-mentioned embodiments, and many similar variations exist. All modifications which would occur to one skilled in the art and which are, therefore, directly derived or suggested from the disclosure herein are deemed to be within the scope of the present invention.

It should be understood that the terms such as first, second, etc. used in the embodiments of the present invention are only used for clearly describing the technical solutions of the embodiments of the present invention, and are not used to limit the protection scope of the present invention.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A pronunciation error detection method, comprising:

acquiring a speech signal to be detected and an interpretation text corresponding to the speech signal;

extracting acoustic characteristics of the speech signal to be detected, and converting the reading text into a phoneme sequence;

acquiring acoustic features of at least one pair of confusing phoneme pairs;

and performing pronunciation error detection based on the acoustic features of the speech signal to be detected mistaken, the acoustic features of the at least one pair of confusion phoneme pairs and the phoneme sequence by using a pronunciation error detection model.

2. The pronunciation error detection method of claim 1, wherein the obtaining of the acoustic features of at least one confusing phoneme pair comprises:

obtaining acoustic features of a voice sample, wherein the voice sample comprises a plurality of voice fragments corresponding to each confusing phoneme pair in the at least one confusing phoneme pair, and each confusing phoneme pair comprises a first phoneme and a second phoneme;

obtaining a covariance matrix corresponding to each pair of confusion phonemes according to the plurality of speech fragments corresponding to the first phoneme and the plurality of speech fragments corresponding to the second phoneme in each pair of confusion phonemes;

and respectively fusing the acoustic features of the voice sample with the covariance matrixes corresponding to each pair of confusion phoneme pairs to obtain the acoustic features of each pair of confusion phoneme pairs.

3. The method of claim 2, wherein obtaining a covariance matrix for each confusing phone pair based on the plurality of speech segments for the first phone and the plurality of speech segments for the second phone in the confusing phone pair comprises:

segmenting the voice sample to obtain a plurality of voice fragments corresponding to the first phoneme and a plurality of voice fragments corresponding to the second phoneme in each pair of confusion phoneme pairs;

respectively extracting acoustic features of a plurality of voice segments corresponding to the first phoneme and clustering the acoustic features to obtain N first-class central vectors;

respectively extracting acoustic features of a plurality of voice fragments corresponding to the second phoneme and clustering the acoustic features to obtain N second-class central vectors;

and reducing the dimensions of the N first-class central vectors and the N second-class central vectors to obtain the covariance matrix.

4. The pronunciation error detection method according to any one of claims 1 to 3, further comprising:

judging whether the speakable text contains confusion phonemes or not;

and when the reading text contains the confused phoneme, performing pronunciation error detection according to the pronunciation error detection model and the output result of the phoneme classification model corresponding to the confused phoneme.

5. A method for training a pronunciation error detection model is characterized by comprising the following steps:

acquiring a training sample, wherein the training sample comprises a voice signal sample and a text sample corresponding to the voice signal sample, and the voice signal sample comprises voice information formed by reading the text sample by a reader;

extracting acoustic features of the voice signal samples, and converting the text samples into phoneme sequences;

acquiring acoustic features of at least one pair of confusing phoneme pairs;

and performing error detection training on a pronunciation error detection model based on the acoustic features of the voice signal samples, the acoustic features of the at least one pair of confusing phoneme pairs and the phoneme sequence.

6. The training method of claim 5, wherein the obtaining the acoustic features of at least one pair of aliased phone pairs comprises:

7. The training method of claim 6, wherein obtaining the covariance matrix for each confusing phone pair based on the plurality of speech segments for the first phone and the plurality of speech segments for the second phone in each confusing phone pair comprises:

8. A training method according to any of claims 5 to 7, wherein before said error detection training of a pronunciation error detection model based on the acoustic features of the speech signal samples, the acoustic features of the at least one pair of confusing phoneme pairs and the phoneme sequence, the method further comprises:

replacing part of the phonemes in the phoneme sequence with a mask;

and performing voice recognition training on the pronunciation error detection model based on the acoustic characteristics of the voice signal samples and the phoneme sequence after mask replacement, wherein the pronunciation error detection model recognizes and outputs phonemes corresponding to the replaced positions.

9. Training method according to any of claims 5 to 7, further comprising:

constructing at least one phoneme classification model corresponding to the at least one pair of confused phoneme pairs so as to perform pronunciation error detection according to the pronunciation error detection model and the output result of the phoneme classification model when the text sample and/or the voice signal sample contains the confused phoneme, wherein each pair of confused phoneme pairs corresponds to one phoneme classification model, each pair of confused phoneme pairs comprises a first phoneme and a second phoneme, and the phoneme classification model is used for outputting the probability that the confused phoneme belongs to the first phoneme or the second phoneme.

10. Training method according to claim 9, wherein said constructing at least one phoneme classification model corresponding to said at least one pair of confusing phoneme pairs comprises:

segmenting a voice sample to obtain a plurality of voice fragments corresponding to each pair of confusion phoneme pairs in the at least one pair of confusion phoneme pairs;

acquiring a vector of each voice fragment in a plurality of voice fragments corresponding to the first phoneme, and acquiring a vector of each voice fragment in a plurality of voice fragments corresponding to the second phoneme;

and training a phoneme classification model corresponding to the first phoneme and the second phoneme according to the vectors of the plurality of speech segments corresponding to the first phoneme and the vectors of the plurality of speech segments corresponding to the second phoneme.

11. An utterance error detection apparatus, comprising:

the first acquisition module is used for acquiring the speech signal to be detected as error and the corresponding reading text;

the extraction module is used for extracting the acoustic characteristics of the speech signal to be detected and converting the reading text into a phoneme sequence;

a second obtaining module, configured to obtain acoustic features of at least one pair of confusing phoneme pairs;

and the error detection module is used for performing pronunciation error detection on the basis of the acoustic characteristics of the speech signal to be detected incorrectly, the acoustic characteristics of the at least one pair of confusing phoneme pairs and the phoneme sequence by utilizing a pronunciation error detection model.

12. An apparatus for training a pronunciation error detection model, comprising:

the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring training samples, the training samples comprise voice signal samples and text samples corresponding to the voice signal samples, and the voice signal samples comprise voice information formed by reading the text samples by readers;

the extraction module is used for extracting the acoustic features of the voice signal samples and converting the text samples into phoneme sequences;

and the training module is used for carrying out error detection training on a pronunciation error detection model based on the acoustic features of the voice signal samples, the acoustic features of the at least one pair of confusing phoneme pairs and the phoneme sequence.

13. A computer-readable storage medium having computer-executable instructions stored thereon which, when executed by a processor, implement the method of any one of claims 1 to 10.

14. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory for storing the processor-executable instructions;

the processor configured to perform the method of any of the preceding claims 1 to 10.