CN113990351A - Sound correction method, sound correction device and non-transient storage medium - Google Patents

Sound correction method, sound correction device and non-transient storage medium Download PDF

Info

Publication number
CN113990351A
CN113990351A CN202111283587.8A CN202111283587A CN113990351A CN 113990351 A CN113990351 A CN 113990351A CN 202111283587 A CN202111283587 A CN 202111283587A CN 113990351 A CN113990351 A CN 113990351A
Authority
CN
China
Prior art keywords
phoneme
standard
pronunciation
phonemes
decoded
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111283587.8A
Other languages
Chinese (zh)
Inventor
董秋思
杨晓飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Shengtong Information Technology Co ltd
Original Assignee
Suzhou Shengtong Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Shengtong Information Technology Co ltd filed Critical Suzhou Shengtong Information Technology Co ltd
Priority to CN202111283587.8A priority Critical patent/CN113990351A/en
Publication of CN113990351A publication Critical patent/CN113990351A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/72Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for transmitting results of analysis

Abstract

A sound correction method, a sound correction device and a non-transient storage medium. The sound correction method comprises the following steps: acquiring a word and first audio data; based on the words, pronunciation diagnosis operations are performed on the first audio data to generate pronunciation diagnosis results. The standard pronunciation of the word includes at least one standard phoneme, and the pronunciation diagnosis operation includes: aligning the first audio data with the standard pronunciation based on the first acoustic model to obtain a time boundary of each standard phoneme; determining a score for each of the canonical phonemes based on the time boundary for each of the canonical phonemes; performing a recognition operation on the first audio data based on the second acoustic model to obtain a sequence of decoded phonemes and a time boundary for each decoded phoneme; determining a score for each decoded phoneme based on the time boundary for each decoded phoneme; determining the corresponding relation between each standard phoneme and each decoding phoneme; and generating a pronunciation diagnosis result based on the corresponding relation, the scores of the standard phonemes and the scores of the decoding phonemes.

Description

Sound correction method, sound correction device and non-transient storage medium
Technical Field
Embodiments of the present disclosure relate to a sound correction method, a sound correction apparatus, and a non-transitory storage medium.
Background
With the development of scientific technology, more and more language learners use language learning application programs (APP) to assist language learning. In some language learning applications, an application provider sends learning materials to a client via the internet, and a user acquires the learning materials via the client to perform corresponding learning. In addition to learning grammar and vocabulary, improving pronunciation capability is also an extremely important part of the language learning process. In general, the user can improve the pronunciation capability of the user by reading aloud, reading with the back and the like. However, in most cases, the user cannot know whether the pronunciation is accurate.
Disclosure of Invention
At least some embodiments of the present disclosure provide a method of correcting a tone. The sound correcting method comprises the following steps: acquiring a word and first audio data; performing pronunciation diagnosis operation on the first audio data based on the word to generate a pronunciation diagnosis result; wherein the standard pronunciation of the word comprises at least one standard phoneme; performing the pronunciation diagnosis operation on the first audio data based on the word to generate the pronunciation diagnosis result, comprising: aligning the first audio data with the standard pronunciation based on a first acoustic model to obtain a time boundary of each standard phoneme in the standard pronunciation in the first audio data; determining the grade of each standard phoneme according to the audio segment determined by the time boundary of each standard phoneme; performing a recognition operation on the first audio data based on a second acoustic model to obtain a decoded phoneme sequence and a time boundary of each decoded phoneme in the decoded phoneme sequence in the first audio data, wherein the decoded phoneme sequence comprises at least one decoded phoneme; determining a score for each decoded phoneme according to the audio segment determined by the time boundary of each decoded phoneme; determining a correspondence between each of the standard phonemes in the standard pronunciation and each of the decoded phonemes in the sequence of decoded phonemes; and generating the pronunciation diagnosis result based on the corresponding relation, the scores of the standard phonemes and the scores of the decoding phonemes.
For example, in a sound correction method provided by some embodiments of the present disclosure, determining a correspondence between each standard phoneme in the standard pronunciation and each decoded phoneme in the decoded phoneme sequence includes: and taking phonemes as editing elements, and carrying out editing distance operation on the standard pronunciation and the decoding phoneme sequence to determine the corresponding relation.
For example, in the sound correction method provided by some embodiments of the present disclosure, the edit distance operation includes a phoneme replacement operation, and weights of the phoneme replacement operation between different phonemes are not at least identical.
For example, in a sound correction method provided by some embodiments of the present disclosure, generating the pronunciation diagnosis result based on the correspondence, the score of each standard phoneme, and the score of each decoded phoneme includes: in response to any one of the standard phonemes having a corresponding decoded phoneme, determining whether a score of the any one of the standard phonemes is lower than a first score threshold; in response to the score of the any one of the standard phonemes being lower than the first score threshold, calculating a boundary coincidence degree between the any one of the standard phonemes and the decoded phoneme corresponding to the any one of the standard phonemes from a time boundary of the any one of the standard phonemes and a time boundary of the decoded phoneme corresponding to the any one of the standard phonemes; and indicating in the pronunciation diagnosis result that a misreading condition has occurred for the any one of the standard phonemes in response to the any one of the standard phonemes being different from the decoded phoneme corresponding to the any one of the standard phonemes and a boundary coincidence degree between the any one of the standard phonemes and the decoded phoneme corresponding to the any one of the standard phonemes being not less than a coincidence degree threshold.
For example, in a sound correction method provided in some embodiments of the present disclosure, the generating the pronunciation diagnosis result based on the correspondence, the score of each standard phoneme, and the score of each decoded phoneme further includes: judging whether the difference between the score of the decoded phoneme corresponding to the standard phoneme and the score of the standard phoneme is not less than a second score threshold value; and indicating in the misreading case that the any one of the standard phonemes is misread into a decoded phoneme corresponding to the any one of the standard phonemes in response to a difference between a score of the decoded phoneme corresponding to the any one of the standard phonemes and a score of the any one of the standard phonemes being not less than the second score threshold.
For example, in the tone correction method provided in some embodiments of the present disclosure, the boundary overlap ratio is calculated according to the following formula:
Figure BDA0003332159350000021
where BC denotes the degree of overlap of the boundaries, x1 and y1 denote a start time boundary and an end time boundary of a standard phoneme, respectively, x2 and y2 denote a start time boundary and an end time boundary of a decoded phoneme, respectively, min () is a minimum function, and max () is a maximum function.
For example, in a sound correction method provided by some embodiments of the present disclosure, generating the pronunciation diagnosis result based on the correspondence, the score of each standard phoneme, and the score of each decoded phoneme includes: in response to any of the standard phonemes not having a decoded phoneme corresponding thereto, indicating in the pronunciation diagnosis structure that a misreading condition has occurred for the any of the standard phonemes.
For example, in a sound correction method provided by some embodiments of the present disclosure, generating the pronunciation diagnosis result based on the correspondence, the score of each standard phoneme, and the score of each decoded phoneme includes: in response to any of the decoded phonemes not having a corresponding standard phoneme, it is indicated in the pronunciation diagnosis result that a multiple-reading condition has occurred.
For example, in a sound correction method provided in some embodiments of the present disclosure, the generating the pronunciation diagnosis result based on the correspondence, the score of each standard phoneme, and the score of each decoded phoneme further includes: indicating, in the multi-reading case, that the any decoded phoneme is multi-read in response to the score of the any decoded phoneme not being below a third score threshold.
For example, in the sound correction method provided by some embodiments of the present disclosure, the pronunciation diagnosis operation is performed on the first audio data based on the word to generate the pronunciation diagnosis result, further including: determining a time boundary for a vowel phone in the stressed syllable based on the time boundary for each standard phone in the standard pronunciation and the stressed syllables of the standard pronunciation; extracting feature information of a first audio segment determined by a time boundary of the vowel phone in the stressed syllable; judging whether the stressed syllables are stressed or not through a classification model based on the characteristic information of the first audio frequency segment; and indicating in the pronunciation diagnosis result that the stressed syllable is not stressed in response to the stressed syllable being judged not to be stressed.
For example, in the sound correction method provided by some embodiments of the present disclosure, the pronunciation diagnosis operation is performed on the first audio data based on the word to generate the pronunciation diagnosis result, further including: determining a time boundary for a vowel phone in the non-stressed syllable based on the time boundary for each standard phone in the standard pronunciation and the non-stressed syllables of the standard pronunciation; extracting feature information of a second audio segment determined by a time boundary of the vowel phone in the non-stressed syllable; judging whether the non-stressed syllables are stressed or not through a classification model based on the characteristic information of the second audio frequency segment; and indicating in the pronunciation diagnosis result that the non-stressed syllable is stressed in response to the non-stressed syllable being judged stressed.
For example, in the sound correction method provided by some embodiments of the present disclosure, the score of each standard phoneme and the score of each decoded phoneme are determined based on a pronunciation accuracy algorithm.
For example, some embodiments of the present disclosure provide a method for correcting a tone, further including: and carrying out sound correction guidance according to the pronunciation diagnosis result.
For example, in the sound correction method provided in some embodiments of the present disclosure, the performing the sound correction guidance according to the pronunciation diagnosis result includes: and responding to the sound correction operation, and displaying the standard pronunciation of the word, the pronunciation diagnosis result and a text guide, wherein the text guide is used for guiding a user to make a correct pronunciation.
For example, in the sound correction method provided in some embodiments of the present disclosure, the performing the sound correction guidance according to the pronunciation diagnosis result further includes: and synchronously playing the text guidance by using voice when the text guidance is displayed.
For example, some embodiments of the present disclosure provide a method for correcting a tone, further including: second audio data is obtained for the word and exercise feedback is provided for the second audio data.
For example, in the sound correction method provided by some embodiments of the present disclosure, the pronunciation diagnosis result includes at least one of a syllable error and a pronunciation error, the syllable error includes at least one of a syllable number error and an accent error, and the pronunciation error includes at least one of a misread vowel phone, a misread consonant phone, and a missing consonant phone.
At least some embodiments of the present disclosure also provide a sound correction device. Should rectify the sound device and include: a memory for non-transitory storage of computer readable instructions; and a processor for executing the computer readable instructions, wherein the computer readable instructions, when executed by the processor, perform the sound correction method provided by any embodiment of the present disclosure.
For example, some embodiments of the present disclosure provide an apparatus for correcting a tone, further including: and the audio acquisition device is used for acquiring the first audio data.
At least some embodiments of the present disclosure also provide a non-transitory storage medium that non-transitory stores computer-readable instructions, wherein the non-transitory computer-readable instructions, when executed by a computer, are capable of performing the tone correction method provided by any of the embodiments of the present disclosure.
Drawings
To more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure and are not limiting to the present disclosure.
Fig. 1 is a flow chart of a method of correcting a tone according to at least some embodiments of the present disclosure;
fig. 2 is an exemplary flowchart corresponding to step S200 shown in fig. 1 provided by at least some embodiments of the present disclosure;
fig. 3 is another exemplary flowchart corresponding to step S200 shown in fig. 1 provided by at least some embodiments of the present disclosure;
FIG. 4A is a schematic diagram of a pronunciation diagnosis result displayed on a word-correcting interaction interface according to at least some embodiments of the present disclosure;
FIG. 4B is a schematic illustration of another pronunciation diagnosis result displayed on a word-entangled voice interactive interface according to at least some embodiments of the present disclosure;
FIG. 4C is a schematic illustration of yet another pronunciation diagnosis result displayed on a word-entangled voice interactive interface according to at least some embodiments of the present disclosure;
FIG. 5A is a schematic illustration of a rectification guidance interface provided in accordance with at least some embodiments of the present disclosure;
FIG. 5B is a schematic illustration of another note-correcting guidance interface provided by at least some embodiments of the present disclosure;
FIG. 5C is a schematic illustration of yet another note-correcting guidance interface provided by at least some embodiments of the present disclosure;
FIG. 5D is a schematic illustration of yet another note-correcting guidance interface provided in at least some embodiments of the present disclosure;
FIG. 6 is a schematic illustration of a transition interface provided in accordance with at least some embodiments of the present disclosure;
FIG. 7A is a schematic illustration of an exercise interface provided in accordance with at least some embodiments of the present disclosure;
FIG. 7B is a schematic illustration of another exercise interface provided in at least some embodiments of the present disclosure;
FIG. 7C is a schematic illustration of a feedback interface provided in accordance with at least some embodiments of the present disclosure;
FIG. 7D is a schematic illustration of another feedback interface provided in at least some embodiments of the present disclosure;
fig. 8 is a schematic block diagram of a tone correction apparatus provided in at least some embodiments of the present disclosure; and
fig. 9 is a schematic block diagram of a non-transitory storage medium provided in at least some embodiments of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings of the embodiments of the present disclosure. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are within the scope of protection of the disclosure.
Unless otherwise defined, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.
The present disclosure is illustrated by the following specific examples. To maintain the following description of the embodiments of the present disclosure clear and concise, a detailed description of known functions and known components have been omitted from the present disclosure. When any component of an embodiment of the present disclosure appears in more than one drawing, that component is represented by the same or similar reference numeral in each drawing.
The traditional pronunciation evaluating technology can provide a percentile scoring result for the pronunciation read or followed by the user, but the scoring result lacks guiding significance for correcting the pronunciation of the user due to lack of pronunciation diagnosis. The user is influenced by the mother language 'first entering into the main', and the difference between the pronunciation and the demonstration pronunciation may not be resolved; even after the difference is heard, it is often difficult to adjust the vocal organs to the correct position. In this context, more detailed pronunciation diagnosis and correction techniques are beginning to emerge. However, the existing language pronunciation diagnosis and correction technology can support limited confusion phoneme errors and can not give targeted pronunciation correction feedback.
At least some embodiments of the present disclosure provide a method of correcting a tone. The sound correcting method comprises the following steps: acquiring a word and first audio data; performing pronunciation diagnosis operation on the first audio data based on the words to generate a pronunciation diagnosis result; wherein the standard pronunciation of the word comprises at least one standard phoneme; performing pronunciation diagnosis operations on the first audio data based on the words to generate pronunciation diagnosis results, comprising: aligning the first audio data with the standard pronunciation based on the first acoustic model to obtain a time boundary of each standard phoneme in the standard pronunciation in the first audio data; determining the score of each standard phoneme according to the audio segment determined by the time boundary of each standard phoneme; performing a recognition operation on the first audio data based on the second acoustic model to obtain a decoded phoneme sequence and a time boundary of each decoded phoneme in the decoded phoneme sequence in the first audio data, wherein the decoded phoneme sequence comprises at least one decoded phoneme; determining a score for each decoded phoneme based on the audio segment determined by the time boundary for each decoded phoneme; determining the corresponding relation between each standard phoneme in the standard pronunciation and each decoding phoneme in the decoding phoneme sequence; and generating a pronunciation diagnosis result based on the corresponding relation, the scores of the standard phonemes and the scores of the decoding phonemes.
Some embodiments of the present disclosure also provide a sound correction apparatus and a non-transitory storage medium corresponding to the sound correction method described above.
The pronunciation correction method provided by the embodiment of the disclosure performs pronunciation diagnosis operation based on double-model two-pass decoding (double-model namely, a first acoustic model and a second acoustic model, and two-pass decoding namely, alignment operation and recognition operation), can conveniently and quickly obtain pronunciation diagnosis results, enables a user to pertinently correct existing pronunciation problems according to the pronunciation diagnosis results, improves the language learning efficiency of the user, and has higher practicability.
Some embodiments of the present disclosure and examples thereof are described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.
Fig. 1 is a flowchart of a method for correcting a tone according to at least some embodiments of the present disclosure. For example, the sound correction method may be applied to a computing device including any electronic device having a computing function, such as a smart phone, a notebook computer, a tablet computer, a desktop computer, a server, and the like, and the embodiments of the present disclosure are not limited thereto. For example, the computing device has a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU), and the computing device further includes a memory. The Memory is, for example, a nonvolatile Memory (e.g., a Read Only Memory (ROM)) on which codes of an operating system are stored. For example, the memory further stores codes or instructions, and by executing the codes or instructions, the sound correction method provided by the embodiment of the disclosure can be realized.
For example, as shown in fig. 1, the sound correction method includes the following steps S100 to S400.
Step S100: words and first audio data are obtained.
For example, in some embodiments, the tone correction method shown in fig. 1 may be performed locally by, for example, a client. In this case, the word may be a word arbitrarily selected by the user from various words stored in the client, or may be a preset word provided by the client (for example, a preset word provided by a language learning application program on the client), and the embodiment of the disclosure is not limited thereto; the first audio data (i.e., user audio data) may include speech captured by an audio capture module or device of the client, including but not limited to embodiments of the present disclosure. For example, the word and the first audio data in step S100 may also be acquired from the network by the client.
For example, the client includes, but is not limited to, a smartphone, a tablet, a Personal computer, a Personal Digital Assistant (PDA), a wearable device, a head-mounted display device, a swipe pen, a point-and-click pen, etc., for example, the audio capture module or device includes, but is not limited to, a microphone built in or external to the client. For example, the first audio data may be pre-recorded or real-time recorded, and the embodiment of the disclosure is not limited thereto.
For example, in other embodiments, the tone correction method shown in fig. 1 may also be performed remotely, e.g., by a server. In this case, the server may receive the first audio data uploaded by the user through the client (the word in step S100 may be pre-stored on the server, or may be uploaded to the server by the user through the client), and then perform a sound correction process, and return the pronunciation diagnosis result and the like to the client for the user to refer to.
For example, the word may be a word in english, french, german, russian, spanish, chinese, japanese, korean, etc., and embodiments of the present disclosure include, but are not limited to, this.
For example, in some embodiments, the standard pronunciation of the word may be looked up through a pronunciation dictionary, but is not so limited. For example, a pronunciation dictionary may include a collection of words and their pronunciations that a general-purpose speech recognition engine can handle. For example, the standard pronunciation of the word is typically a sequence of phonemes, which may include at least one phoneme (i.e., a standard phoneme). In the present disclosure, for convenience of explanation and distinction, a phoneme in the standard pronunciation of a word and a phoneme in a decoded phoneme sequence described later are referred to as a "standard phoneme" and a "decoded phoneme", respectively. It should be understood that in practical applications, most words include multiple phonemes. In the case where the standard pronunciation of the word includes only one standard phoneme, "each standard phoneme" and "individual standard phonemes" in the standard pronunciation are used to refer to the standard phoneme; similarly, where the decoded phoneme sequence includes only one decoded phoneme, "each decoded phoneme" and "individual decoded phonemes" in the decoded phoneme sequence are used to refer to the decoded phoneme. For example, there are 48 kinds of phonemes for the international phonetic alphabet in english, including 20 kinds of vowel phonemes and 28 kinds of consonant phonemes.
For example, in general, the first audio data is speech data of a word (allowing the occurrence of misreading phonemes, multi-reading phonemes, accent errors, and the like) that is read or read with the user, so that the sound correction method can accurately diagnose the user's pronunciation problem and evaluate the degree of the user's standard pronunciation for the word.
Step S200: based on the words, pronunciation diagnosis operations are performed on the first audio data to generate pronunciation diagnosis results.
For example, in some embodiments, pronunciation diagnosis operations may be used to diagnose problems that may exist with misread phones, missed phones, and multi-read phones to generate corresponding pronunciation diagnosis results.
Fig. 2 is an exemplary flowchart corresponding to step S200 shown in fig. 1 provided by at least some embodiments of the present disclosure. For example, as shown in fig. 2, step S200 may include the following steps S210 to S260.
Step S210: based on the first acoustic model, the first audio data is aligned with the standard pronunciation to obtain a time boundary of each standard phoneme in the standard pronunciation in the first audio data.
For example, in some examples, step S210 may include: based on the first acoustic model, a Forced Alignment (Forced Alignment) algorithm is adopted to perform Forced Alignment operation on the first audio data and the standard pronunciation so as to obtain a time boundary of each standard phoneme in the standard pronunciation in the first audio data.
For example, an acoustic model is generally trained using a large amount of training data (e.g., a recording of a speaker), and the acoustic model can determine the possibility that an audio frame in first audio data corresponds to any phoneme, and can forcibly align the first audio data with a standard pronunciation of a word or perform phoneme-level free phoneme recognition on the first audio data. For example, in some examples, the acoustic model may be a neural network-based model, embodiments of the present disclosure including, but not limited to; for example, the Neural Network may include, but is not limited to, a Time-Delay Neural Network (TDNN), a Recurrent Neural Network (RNN), a Long-Short Term Memory (LSTM) or bidirectional Long-Short Term Memory (Bi-LSTM), and the like. For example, the specific technical details of the acoustic model and the forced alignment algorithm can refer to the related art in the field of natural language processing, and are not described herein again.
For example, the first acoustic model may be an acoustic model suitable for a forced alignment operation. For example, in some examples, the first acoustic model may be a TDNN model or the like, including but not limited to embodiments of the present disclosure.
For example, the first audio data may be divided into at least one audio segment to correspond to at least one standard phoneme in the standard pronunciation of the word by the above-described forced alignment operation. For example, in some examples, the standard pronunciation of the word includes a plurality of standard phonemes, and the first audio data may be divided into a plurality of audio segments in one-to-one correspondence with the plurality of standard phonemes through the forced alignment operation. For example, in some examples, the audio segment corresponding to each of the standard phonemes may be identified by a time boundary (time boundary) of the audio segment in the first audio data, e.g., the time boundary includes a start time boundary (start time) and an end time boundary (end time) of the audio segment.
Step S220: the score for each of the canonical phonemes is determined based on the audio segment determined by the time boundary for each of the canonical phonemes.
For example, in some examples, a Pronunciation accuracy of probability (GOP) algorithm may be used to calculate a score for each of the standard phonemes based on the audio segment determined by the time boundary of each of the standard phonemes, including but not limited to this.
For example, in some examples, the pronunciation accuracy algorithm may include: firstly, extracting acoustic feature information of an audio segment determined by a time boundary of each standard phoneme, wherein the acoustic feature information comprises but is not limited to Mel-scale Frequency Cepstral Coefficients (MFCC) and the like; then inputting the acoustic feature information into a pre-trained phoneme evaluation model for phoneme evaluation to obtain a GOP value of each standard phoneme; finally, the grade of each standard phoneme is determined based on the GOP value of each standard phoneme. For example, the specific technical details of the pronunciation accuracy algorithm can refer to the related technologies in the speech processing field, and are not described herein again.
For example, the value range of the scores of the standard phonemes may be set according to actual needs, and the embodiment of the disclosure is not limited to this. For example, in some examples, the score of a standard phoneme may range from 0,100, embodiments of the present disclosure include, but are not limited to, this.
Step S230: based on the second acoustic model, a recognition operation is performed on the first audio data to obtain a decoded phoneme sequence and a time boundary of each decoded phoneme in the decoded phoneme sequence in the first audio data.
For example, the second acoustic model may be an acoustic model suitable for a free phoneme recognition operation. For example, the second acoustic model is different from the first acoustic model. For example, the second acoustic model may typically be a larger model, whose structure may be more refined and complex, than the first acoustic model, which may be trained with a larger amount of training data. For example, in some examples, the second acoustic model may be a Factorized time-lapse neural network (factored TDNN, TDNN-F) model or LSTM, among others, including but not limited to.
For example, in some examples, the first audio data may be subjected to a recognition operation in conjunction with the second acoustic model and the language model to obtain the decoded phoneme sequence and a temporal boundary of each decoded phoneme in the decoded phoneme sequence in the first audio data. In this case, the acoustic features of the first audio data may be extracted by the second acoustic model and converted into candidate phoneme sequences, and then the final decoded phoneme sequence may be determined from the candidate phoneme sequences by the second language model and the decoding operation. For example, the language model may be a unigram (unigram) language model trained based on a sequence of pronunciation phonemes for a large amount of training text (e.g., words), including but not limited to. For example, the decoding operation may be performed using a Viterbi (Viterbi) algorithm, and embodiments of the present disclosure include but are not limited thereto. For example, based on the viterbi algorithm, an optimal decoding path may be found to determine a decoded phoneme sequence; for example, further, the time boundary of each decoded phoneme can be obtained in the process of backtracking after the end of the viterbi algorithm. For example, the specific technical details of the language model and the viterbi algorithm can refer to the related art in the natural language processing field, and are not described herein.
Step S240: a score is determined for each decoded phoneme based on the audio segment determined by the time boundary for each decoded phoneme.
For example, in some examples, a Pronunciation accuracy of probability (GOP) algorithm may be employed to calculate a score for each decoded phoneme based on the audio segment determined by the time boundary of each decoded phoneme, including but not limited to embodiments of the present disclosure. For example, the details of the pronunciation accuracy algorithm can refer to the related description in step S220, and are not repeated here.
For example, in some examples, the scores of each of the standard phonemes and the scores of each of the decoded phonemes may be calculated using the same pronunciation accuracy algorithm (e.g., using the same phoneme evaluation model) to improve the comparability of the scores of the standard phonemes and the scores of the decoded phonemes to facilitate a more reliable pronunciation diagnosis result.
For example, the score of the decoded phoneme generally has the same value range as the score of the standard phoneme. For example, in some examples, the score of the decoded phoneme may also range from 0,100, which embodiments of the present disclosure include, but are not limited to.
Step S250: a correspondence between each of the standard phonemes in the standard pronunciation and each of the decoded phonemes in the sequence of decoded phonemes is determined.
For example, in some examples, the standard pronunciation and the decoded phoneme sequence may be subjected to an edit distance operation with phonemes as edit elements to determine a correspondence between each standard phoneme in the standard pronunciation and each decoded phoneme in the decoded phoneme sequence.
The manner in which the edit distance between any two character strings is calculated is briefly described below. The edit distance of a character string, also called Levenshtein (Levenshtein) edit distance, refers to the minimum number of operands required to convert a character string a { i } into a character string b { j } using a character operation (i.e., using a character as an edit element), wherein the character operation includes: (1) deleting a character, (2) inserting a character, (3) replacing a character.
For strings a { i } and b { j }, i represents the length of string a { i } (i.e., the number of included characters), j represents the length of string b { j }, i, j are integers, and i ≧ 0, j ≧ 0. lev (a { i }, b { j }) represents the edit distance between strings a { i } and b { j }, and a Levensan edit distance algorithm includes the following formula:
Figure BDA0003332159350000111
wherein max () is a maximum function, min () is a minimum function, a { i-1} represents a string formed by the first i-1 character of the string a { i }, b { j-1} represents a string formed by the first j-1 character of the string b { j }, a [ i ] represents the ith character (i.e., the last character) in the string a { i }, and b [ j ] represents the jth character (i.e., the last character) in the string b { j }.
The Levensan edit distance algorithm expressed by the formula comprises the following steps:
(1) when the length of one character string is 0 (corresponding to the case where if min (i, j) ═ 0), the edit distance is the length of another character string;
(2) when both the character strings a and b are not 0 in length (corresponding to the case where if min (i, j) ≠ 0):
if the last character of the two strings is the same (corresponding to the case of if a [ i ] ═ b [ j ]), the last character of the two strings a { i } and b { j } may be deleted to obtain two new strings a { i-1} and b { j-1}, determining the edit distance between the strings a { i } and b { j } translates into determining the edit distance between the new strings a { i-1} and b { j-1},
if the last character of the two strings is different (corresponding to the case if a [ i ] ≠ b [ j ]), determining the edit distance between strings a { i } and b { j } translates into determining the minimum of the edit distance between strings a { i-1} and b { j }, the edit distance between strings a { i } and b { j-1}, and the edit distance between strings a { i-1} and b { j-1 }.
It should be understood that in the above levenstein edit distance algorithm, the weight of each character operation is set to 1, that is, the edit distance corresponding to each character operation is 1. In practical application, the weight of each character operation can be set according to practical requirements.
It should be understood that the above-mentioned levenstein edit distance algorithm compares the last character of two character strings (i.e., compares whether the last character of two character strings is the same), including but not limited to in practice. For example, another Levensian edit distance algorithm may compare starting with the first character of two strings.
It should be appreciated that the determination of the levenstein edit distance is a dynamic programming problem that can be computed by a recursive process according to the above formula. Meanwhile, it should also be understood that, in the dynamic programming process for calculating the levensan edit distance, the correspondence between each character in the character string a { i } and each character in the character string b { j } may be determined (i.e., in the case where the edit distance takes the minimum value, a certain character in the character string a { i } is the same as a certain character in the character string b { j } and they correspond to each other).
For example, in some examples, the levens editing distance algorithm described above may be referenced to perform an editing distance operation on the sequence of standard pronunciations and decoded phonemes, with the phonemes as the editing elements, to determine a correspondence between each standard phoneme in the standard pronunciation and each decoded phoneme in the sequence of decoded phonemes. For example, a standard pronunciation may be considered as a phone string a { i }, and a decoded phone sequence may be considered as a phone string b { j }, and then a phone edit distance lev (a { i }, b { j }) between the phone strings a { i } and b { j } may be expressed as:
Figure BDA0003332159350000121
where max () is a function taking the maximum value, min () is a function taking the minimum value, a { i-1} represents a phoneme string formed of the first i-1 phonemes of the phoneme string a { i }, b { j-1} represents a phoneme string formed of the first j-1 phonemes of the phoneme string b { j }, a [ i ] represents the i-th phoneme (i.e., the last phoneme) in the phoneme string a { i }, b [ j ] represents the j-th phoneme (i.e., the last phoneme) in the phoneme string b { j }, f1 represents the weight of the operation of deleting/skipping the phoneme a [ i ] (i.e., the phoneme editing distance contribution value of the operation), f2 represents the weight of the operation of inserting/skipping the phoneme b [ j ] (i.e., the phoneme editing distance contribution value of the operation), and f3 represents the weight of the operation of replacing/misreading the phoneme a [ i ] as the phoneme b [ j ] (i) (i.e., the editing distance contribution value of the operation).
It is to be understood that, in the above-described phoneme edit distance algorithm, the phoneme operation (also referred to as "edit distance operation") may include: (1) deleting a phoneme, (2) inserting a phoneme, and (3) replacing a phoneme. As can be seen from the above phoneme editing distance formula, the phoneme editing distance algorithm includes:
(1) when the length of one phoneme string is 0 (corresponding to the case where if min (i, j) ═ 0), the phoneme editing distance is the length of another phoneme string;
(2) when the lengths of the phoneme strings a and b are both not 0 (corresponding to the case where if min (i, j) ≠ 0):
if the last phoneme of the two phone strings is the same (corresponding to if a [ i ] ═ b [ j ], in which case the standard phoneme a [ i ] has a decoded phoneme b [ j ]), then the last phoneme of the two phone strings a { i } and b { j } may be deleted to obtain two new phone strings a { i-1} and b { j-1}, and determining the phone editing distance between the phone strings a { i } and b { j } is converted into determining the phone editing distance between the new phone strings a { i-1} and b { j-1},
if the last phoneme of the two phoneme strings is different (corresponding to the case if a [ i ] ≠ b [ j ]), determining the phoneme editing distance between the phoneme strings a { i } and b { j } translates into determining the phoneme editing distance between the phoneme strings a { i-1} and b { j } (corresponding to the case of deleting/missing the phoneme a [ i ], when the phoneme a [ i ] has no decoded phoneme corresponding thereto), the phoneme editing distance between the phoneme strings a { i } and b { j-1} (corresponding to the case of inserting/multi-reading the phoneme b [ j ], when the decoded phoneme b [ j ] has no standard phoneme corresponding thereto) and the phoneme editing distance between the phoneme strings a { i-1} and b { j-1} (corresponding to the case of replacing/misreading the phoneme a [ i ] as the phoneme b [ j ], at this time, the standard phoneme a [ i ] has the minimum value of the corresponding decoded phoneme b [ j ])).
For example, in some examples, the weight f1 may be a constant value; in this case, the deletion/misreading possibilities of different phonemes are the same. For example, in other examples, the weight f1 is related to the class of the phoneme a [ i ], i.e., f1 ═ f1(a [ i ]), i.e., the weight f1 is a function value of the phoneme a [ i ]; in this case, the deletion/misreading possibilities of different phonemes are at least not exactly the same, but may of course differ from each other.
For example, in some examples, the weight f2 may be a constant value; in this case, the insertion/multireading possibilities of different phonemes are the same. For example, in other examples, the weight f2 relates to the class of the phoneme b [ j ], i.e., f2 ═ f2(b [ j ]), i.e., the weight f2 is a function of the phoneme b [ j ]; in this case the insertion/multireading possibilities of different phonemes are at least not exactly the same, but may of course also differ from each other.
For example, in some examples, the weight f3 may be a constant value; in this case, the weights of the phoneme substitution operations between different phonemes are the same, i.e., the likelihood of phoneme substitution/misreading between different phonemes is the same. For example, in other examples, the weight f1 relates to the class of the phoneme a [ i ] and the phoneme b [ j ], i.e., f3 ═ f3(a [ i ], b [ j ])), that is, the weight f3 is a function value of the phoneme a [ i ] and the phoneme b [ j ]; in this case, the weights of the phoneme substitution operations between different phonemes are at least not identical, i.e. the probability of phoneme substitution/misreading between different phonemes is at least not identical, but may of course be different from each other.
It should be understood that the types (constant values or function values) of the weights f1, f2, and f3 and the corresponding specific values may be set according to actual needs, and the embodiment of the disclosure is not limited thereto. For example, the types and values (or functional relationships) of the weights f1, f2 and f3 may be set according to the teaching and research experience, and pronunciation problems existing in a large amount of user pronunciation data may be counted or learned to set the types and values (or functional relationships) of the weights f1, f2 and f 3. For example, in one specific example, to simplify the tone correction method provided by embodiments of the present disclosure, the weight f1 may be set to a constant value t1 (e.g., t1 ═ 1); the weight f2 may be set to a constant value t2 (e.g., t2 ═ 1); the weight f3 (corresponding to the case of if a [ i ] ≠ b [ j ]) may be set as a function value, where the weight f3 may be set as a constant value t31 (e.g., t31 ═ 0.5) when both the phoneme a [ i ] and the phoneme b [ j ] are vowel phonemes (i.e., a vowel phoneme is misread as another vowel phoneme, i.e., a confusing vowel phoneme), the weight f3 may be set as a constant value t32 (e.g., t32 ═ 0.5) when both the phoneme a [ i ] and the phoneme b [ j ] are consonant phonemes (i.e., a consonant phoneme is misread as another consonant phoneme, i.e., a confusing consonant phoneme), the constant weight f3 may be set as a constant value t 3668 (e.g., t33 ═ 0.e., here, the occurrence probability of misreading a vowel phoneme as a consonant phoneme or a consonant phoneme as a vowel phoneme is generally smaller than the occurrence probability of the confusing vowel phoneme and the confusing consonant phoneme, and therefore, t33 is generally larger than t31 and t 32. The values of the constant values t1, t2, t31, t32 and t33 can be set according to actual needs, as long as the values are greater than 0.
Step S260: and generating a pronunciation diagnosis result based on the corresponding relation, the scores of the standard phonemes and the scores of the decoding phonemes.
For example, it is possible to determine whether there is a pronunciation problem in the first audio data based on the correspondence, the scores of the respective standard phonemes, and the scores of the respective decoded phonemes, and further determine what kind of pronunciation problem exists in the case where there is a pronunciation problem, to generate a corresponding pronunciation diagnosis result. For example, the pronunciation problem herein includes at least one of the problems of misread phonemes, missed phonemes, and multiple phonemes.
For example, step S260 may include the following steps S261 to S263 to diagnose a possible problem of misreading phonemes and generate a corresponding pronunciation diagnosis result.
Step S261: in response to any of the standard phonemes having a decoded phoneme corresponding thereto, it is determined whether a score of the any of the standard phonemes is below a first score threshold.
For example, in practical applications, the first score threshold may be reasonably set according to a value range of the scores of the phonemes. For example, in some examples, assuming that the score of the phone standard takes on a value range of [0,100], the value range of the first score threshold may be set to, for example, [50,70], and embodiments of the present disclosure include, but are not limited to, this. For example, in the above example, the first score threshold value may be set to 50, 55, 60, 65, 70, etc., as actually required. It will be appreciated that the pronunciation of any one of the phonemes can generally be considered accurate if the score of the phoneme is not below the first score threshold.
Step S262: in response to the score of the any one of the standard phonemes being lower than the first score threshold, a boundary coincidence degree between the any one of the standard phonemes and the decoded phoneme corresponding to the any one of the standard phonemes is calculated based on the time boundary of the any one of the standard phonemes and the time boundary of the decoded phoneme corresponding to the any one of the standard phonemes.
For example, in some examples, the boundary overlap ratio may be calculated according to the following formula:
Figure BDA0003332159350000151
where BC denotes a boundary overlap ratio, x1 and y1 denote a start time boundary and an end time boundary of a standard phoneme, respectively, x2 and y2 denote a start time boundary and an end time boundary of a decoded phoneme, respectively, min () is a minimum function, and max () is a maximum function. It is understood that, when the degree of overlap of the boundaries calculated according to the above formula is less than or equal to 0, both of them indicate that the time boundaries of the two phonemes do not coincide.
Step S263: and indicating that a misreading condition has occurred for the any standard phoneme in the pronunciation diagnosis result in response to the any standard phoneme being different from the decoded phoneme corresponding to the any standard phoneme and a boundary coincidence degree between the any standard phoneme and the decoded phoneme corresponding to the any standard phoneme being not less than a coincidence degree threshold.
For example, in practical applications, the coincidence degree threshold may be reasonably set according to the value range of the scores of the standard phonemes. For example, in some examples, the value range of the overlap ratio threshold may be set to [ 40%, 60% ], for example, including but not limited to. For example, in the above example, the coincidence degree threshold may be set to 40%, 45%, 50%, 55%, 60%, etc., as actually required. It is to be understood that, if the coincidence degree of the boundary between the any one of the standard phonemes and the decoded phoneme corresponding to the any one of the standard phonemes is not less than the coincidence degree threshold, the corresponding relationship between the any one of the standard phonemes and the decoded phoneme corresponding to the any one of the standard phonemes, which is determined in the foregoing step S250, may be generally considered to be accurate.
For example, in steps S261 to S263, if the any one of the standard phonemes is different from the decoded phoneme corresponding to the any one of the standard phonemes and the degree of coincidence of the boundary between the any one of the standard phonemes and the decoded phoneme corresponding to the any one of the standard phonemes is smaller than the degree of coincidence threshold, it is generally indicated that there is a high possibility of a mis-correspondence in the correspondence, and therefore, it may be indicated in the pronunciation diagnosis result that a misreading situation has occurred for the any one of the standard phonemes while indicating that the decoded phoneme corresponding to the any one of the standard phonemes has been misread. Therefore, the adverse effect of the wrong correspondence possibly existing in the correspondence relation on the pronunciation diagnosis result can be avoided. For example, in some examples, step S260 may further include: in response to the fact that the any one of the standard phonemes is different from the decoded phoneme corresponding to the any one of the standard phonemes and the coincidence of the boundary between the any one of the standard phonemes and the decoded phoneme corresponding to the any one of the standard phonemes is less than the coincidence threshold, it is indicated in the pronunciation diagnosis result that a misreading situation has occurred for the any one of the standard phonemes while indicating that the decoded phoneme corresponding to the any one of the standard phonemes has been misread.
It is to be understood that, in steps S261 to S263, if the any one phone standard is the same as the decoded phone corresponding to the any one phone standard and the coincidence degree of the boundary between the any one phone standard and the decoded phone corresponding to the any one phone standard is not less than the coincidence degree threshold, it usually means that the pronunciation of the any one phone standard is accurate; this is usually not the case, which is contradictory to the precondition of "the score of any of the standard phonemes is lower than the first score threshold" in step S262. In addition, in steps S261 to S263, if the any one of the standard phonemes is identical to the decoded phoneme corresponding to the any one of the standard phonemes and the degree of coincidence of the boundary between the any one of the standard phonemes and the decoded phoneme corresponding to the any one of the standard phonemes is less than the degree of coincidence threshold, it usually means that a multi-reading situation has occurred in the vicinity of the any one of the standard phonemes and the pronunciation of the any one of the standard phonemes is likely to be accurate, and therefore, it is usually not necessary to report that a misreading situation has occurred for the any one of the standard phonemes in the pronunciation diagnosis result (it may be indicated that a multi-reading situation has occurred with reference to subsequent step S267).
It is to be understood that when the pronunciation diagnosis result indicates that the misreading condition occurs for the any one of the standard phonemes, the position where the misreading condition occurs (for example, the position of the any one of the standard phonemes in the standard pronunciation, etc.) may also be generally indicated.
For example, on the basis of steps S261 to S263, step S260 may further include the following steps S264 to S265 to generate a more detailed pronunciation diagnosis result for the problem of misreading the phoneme.
Step S264: and judging whether the difference between the score of the decoded phoneme corresponding to the standard phoneme and the score of the standard phoneme is not less than a second score threshold value.
For example, in practical applications, the second score threshold may be reasonably set according to a value range of the scores of the phone standard. For example, in some examples, assuming that the value ranges of the scores of the standard phoneme and the decoded phoneme are both [0,100], the value range of the first score threshold may be set to, for example, [20,40], and embodiments of the present disclosure include, but are not limited to, this. For example, in the above example, the first score threshold value may be set to 20, 25, 30, 35, 40, etc., as actually required.
Step S265: in response to a difference between the score of the decoded phoneme corresponding to the any one of the phonemes and the score of the any one of the phonemes being not less than a second score threshold, misreading of the any one of the phonemes into the decoded phoneme corresponding to the any one of the phonemes is indicated in the misreading situation.
It is to be understood that, in the case that the score of any one of the standard phonemes is lower than the first score threshold (i.e. the pronunciation of any one of the standard phonemes is likely to be inaccurate), if the difference between the score of the decoded phoneme corresponding to any one of the standard phonemes and the score of any one of the standard phonemes is not less than the second score threshold, the decoded phoneme identified in step S230 may be generally considered to be accurate, and therefore, it may be specifically indicated that any one of the standard phonemes is misread as the decoded phoneme corresponding to any one of the standard phonemes in the misreading situation, so that the user may more clearly know the pronunciation problem of the user; on the other hand, if the difference between the score of the decoded phoneme corresponding to the any standard phoneme and the score of the any standard phoneme is smaller than the second score threshold, it may be generally considered that the decoded phoneme identified in step S230 may be inaccurate, and therefore, it may be indicated in the pronunciation diagnosis result that only the misreading situation occurs for the any standard phoneme, but not that the misreading of the any standard phoneme into the decoded phoneme corresponding to the any standard phoneme is specifically indicated, so as to avoid misleading the user.
For example, step S260 may include the following step S266 to diagnose possible problems with missing phonemes and generate a corresponding pronunciation diagnosis result.
Step S266: in response to any of the standard phonemes not having a decoded phoneme corresponding thereto, a misreading condition is indicated in the pronunciation diagnosis structure for the any of the standard phonemes.
It will be appreciated that when a misreading condition is indicated in the pronunciation diagnosis structure for the any one of the phonemes, the location of the occurrence of the misreading condition (e.g., the location of the any one of the phonemes in the standard pronunciation, etc.) may also be indicated generally. It will also be appreciated that if any phone unit has no corresponding phone unit, the score of the phone unit will generally be low (e.g., much lower than the first score threshold); thus, in this case, there is generally no need to simultaneously or further consider the impact of the scoring of any of the standard phonemes.
For example, step S260 may further include the following step S267 to diagnose a possible problem of the multiple phonemes and generate a corresponding pronunciation diagnosis result.
Step S267: in response to any of the decoded phonemes not having a corresponding standard phoneme, it is indicated in the pronunciation diagnosis result that a multiple-reading condition has occurred.
It will be appreciated that when it is indicated in the pronunciation diagnosis structure that a multi-reading condition has occurred, it is also generally indicated where the multi-reading condition has occurred (e.g., the location of the decoded phoneme relative to a standard phoneme, etc.).
For example, on the basis of step S267, step S260 may further include the following step S268 to generate a more detailed pronunciation diagnosis result for the problem of the multiple reading phoneme.
Step S268: in response to the score of the any decoded phoneme not being below the third score threshold, indicating in a case of multi-reading that the any decoded phoneme was multi-read.
For example, in practical applications, the third score threshold may be reasonably set according to a value range of scores of decoded phonemes. For example, in some examples, assuming that the value range of the score of the decoded phoneme is [0,100], the value range of the third score threshold may be set to [50,70], for example, and embodiments of the present disclosure include, but are not limited to, this. For example, in the above example, the third score threshold value may be set to 50, 55, 60, 65, 70, etc., as actually required. For example, the third score threshold may be the same as or different from the first score threshold.
It is understood that if the score of any decoded phoneme is not lower than the third score threshold, it can be generally considered that any decoded phoneme identified in step S230 is accurate, and therefore, it can be specifically indicated that any decoded phoneme is overread in the case of overreading, so that the user can more clearly know the pronunciation problem of the user; on the other hand, if the score of any decoded phoneme is lower than the third score threshold, it can be generally considered that any decoded phoneme identified in step S230 may be inaccurate, and therefore, it may be indicated in the pronunciation diagnosis result only that the multi-reading condition occurs (and the position where the multi-reading condition occurs), and it may not be specifically indicated what phoneme was multi-read, so as to avoid misleading the user.
For example, in some embodiments, the pronunciation diagnosis operation may also be used to diagnose problems such as stress errors to generate corresponding pronunciation diagnosis results. For example, the stress error problem herein includes at least one of non-stressed syllables being stressed and non-stressed syllables being stressed.
Fig. 3 is another exemplary flowchart corresponding to step S200 shown in fig. 1 provided by at least some embodiments of the present disclosure. For example, as shown in fig. 3, on the basis of step S210 to step S260 shown in fig. 2 (for simplicity of illustration, specific contents of step S210 to step S260 are omitted in fig. 3), step S200 may further include step S271 to step S274 below, so as to diagnose a stress error problem that stress is not applied to the stress syllable, which may exist in the case where the standard pronunciation of the word includes the stress syllable, and generate a corresponding pronunciation diagnosis result.
Step S271: determining a time boundary for a vowel phone in the standard pronunciation based on the time boundary for each standard phone in the standard pronunciation and an accented syllable for the standard pronunciation;
step S272: extracting feature information of a first audio segment determined by a time boundary of a vowel phoneme in the stressed syllable;
step S273: judging whether the stressed syllables are stressed or not through a classification model based on the characteristic information of the first audio frequency segment; and
step S274: in response to the stressed syllable being judged not to be stressed, indicating in the pronunciation diagnosis result that the stressed syllable is not stressed.
For example, in step S271, for an accented syllable of a standard pronunciation, a vowel phone in the accented syllable (the vowel phone is one of the standard phones in the standard pronunciation) may be determined first so that the time boundary of the vowel phone may be determined from the time boundaries of the standard phones in the standard pronunciation.
For example, in some examples, in step S272, the feature information may include at least one of energy (e.g., including a normalized energy value), fundamental frequency (e.g., including a normalized fundamental frequency value), short-time average zero-crossing rate, mel-frequency cepstral coefficients, first order mel-frequency cepstral coefficients, second order mel-frequency cepstral coefficients, and so forth. For example, the method for extracting the characteristic information such as energy, fundamental frequency, short-time average zero-crossing rate, mel-frequency cepstrum coefficient, first-order mel-frequency cepstrum coefficient, second-order mel-frequency cepstrum coefficient, etc. may refer to the related art in the field of natural language processing, and is not described herein again. For example, in one specific example, the feature information includes a normalized energy value and a normalized fundamental frequency value of the vowel phoneme. For example, the normalized energy value of the vowel phoneme may be expressed as a ratio of the average energy value of the first audio segment and the average energy value of the first audio data; similarly, the normalized fundamental frequency value may be expressed as a ratio of the average fundamental frequency value of the first audio segment and the average fundamental frequency value of the first audio data. It should be noted that the embodiments of the present disclosure include but are not limited thereto.
For example, in some examples, in step S273, the classification model may be a binary classification model, and may include any one of a Support Vector Machine (SVM) classifier, a Softmax classifier, and the like, for example. It is understood that the classification model may be derived through machine learning. In the machine learning process, feature information of sample audio segments corresponding to vowels in a large number of sample syllables (including stressed syllables and non-stressed syllables) can be extracted as input of a classification model, and the classification model is trained according to stress (for example, artificial labeling) of the sample syllables in sample audio data, so that the trained classification model can predict whether the vowel in a certain syllable is stressed. For example, the training process and details of the classification model can refer to the related art in the field of machine learning, and are not described herein.
For example, as shown in fig. 3, on the basis of the aforementioned steps S210 to S260 (and steps S271 to S274), step S200 may further include steps S281 to S284 for diagnosing a stress error problem of stress on non-stressed syllables that may exist in the case where the standard pronunciation of the word includes the non-stressed syllables, and generating a corresponding pronunciation diagnosis result.
Step S281: determining a time boundary for a vowel phone in the non-stressed syllable based on the time boundary for each standard phone in the standard pronunciation and the non-stressed syllables of the standard pronunciation;
step S282: extracting feature information of a second audio segment determined by the time boundary of the vowel phoneme in the non-stressed syllable;
step S283: judging whether the non-stressed syllables are stressed or not through a classification model based on the characteristic information of the second audio frequency segment; and
step S284: in response to the non-stressed syllable being judged stressed, indicating in the pronunciation diagnosis result that the non-stressed syllable is stressed.
For example, in step S281, for a non-stressed syllable of a standard pronunciation, a vowel phone in the non-stressed syllable (the vowel phone is one of the standard phones in the standard pronunciation) may be determined first so that the time boundary of the vowel phone may be determined from the time boundaries of the standard phones in the standard pronunciation.
For example, in some examples, the kind of the feature information extracted in step S282 and the kind of the feature information extracted in step S272 may be the same; in this case, the details of the feature information extraction in step S282 may refer to the description related to the feature information extraction in step S272, which is not described herein again. It should be noted that the embodiments of the present disclosure include but are not limited thereto.
For example, in some examples, the classification model used in step S283 may be the same classification model used in step S273; in this case, the details of the classification model in step S283 can refer to the related description of the classification model in step S272, which is not repeated herein. It should be noted that the embodiments of the present disclosure include but are not limited thereto. For example, in some embodiments, different classification models may be employed in step S273 and step S283, respectively. For example, the classification model used in step S273 may be a classification model specifically used for predicting whether an accented syllable is accentuated (i.e., whether a vowel in an accented syllable is accentuated), and the classification model used in step S283 may be a classification model specifically used for predicting whether a non-accented syllable is accentuated (i.e., whether a vowel in a non-accented syllable is accentuated); in this case, the type of the feature information extracted in step S282 may be the same as or different from the type of the feature information extracted in step S272.
It is understood that, in some embodiments, before performing steps S271 to S274 and/or steps S281 to S284, it may be determined whether the standard pronunciation of the word includes stressed syllables and/or non-stressed syllables, and then, in the case that the standard pronunciation of the word includes stressed syllables and/or non-stressed syllables, steps S271 to S274 and/or steps S281 to S284 are performed accordingly.
For example, in some embodiments, the pronunciation correction method may classify pronunciation problems existing in the pronunciation diagnosis result into syllabic errors and/or pronunciation errors according to phonetic rules and phonological rules of the language.
For example, syllable errors refer mainly to pronunciation rhythm problems, which are usually expressed as rhythms and/or weight errors of syllables of words. For example, the syllable error may include at least one of a syllable number error and a stress error.
For example, the syllable number error mainly refers to increase or missing of pronunciationSyllables of words are identified which typically include at least one of multiple-pronoun phones and missed-pronoun phones. The syllable number error is usually closely related to the migration of the native language. For example, english learners whose native language is chinese are prone to errors in increasing syllables in pronunciation. Since the syllable of Chinese is mostly composed of initial consonant plus final consonant (similar to consonant and vowel in English), if the syllable ending is the consonant similar to Chinese initial consonant, such as/p/,/b/,/d/,/t/etc., it is easy to subconsciously add/R behind
Figure BDA0003332159350000211
To match the Chinese pronunciation rules, resulting in the word pronunciation being erroneously increased by one syllable. It is understood that the case of the multiple consonantal phones rarely occurs alone, which generally occurs in the case of multiple syllables (i.e., multiple consonantal phones and vowel phones that form a syllable) at the same time, and of course, this case also belongs to the syllable number error.
For example, the accent error refers to a syllable position where accents are erroneously pronounced during pronunciation, and generally includes at least one of cases where accents are not emphasized or are not emphasized.
For example, pronunciation errors refer primarily to confusion in the pronunciation of single or multiple syllables or phonemes of a word, which often manifests as poor quality of pronunciation of the word, difficult to understand, or otherwise. It is also possible to appear that another word is pronounced directly, changing the semantic meaning. For example, the pronunciation error may include at least one of misread vowel phonemes, misread consonant phonemes, and missed consonant phonemes.
For example, misreading vowel phonemes (also referred to as "confusing vowel phonemes") primarily refers to confusion between vowel phonemes having some degree of similarity in pronunciation, including not only situations where such english learners are susceptible to confusion by loose vowels, but also situations where english learners whose native language is chinese often appear/e/and/a due to the effect of chinese and/or dialectI/confusing case, etc.
For example, misreading consonant phonemes (also referred to as "confusing consonant phonemes") mainly refers to confusion between consonant phonemes having some similarity in pronunciation, including not only cases where an unvoiced or voiced consonant is easily confused for such english learners, but also cases where an english learner whose native language is chinese often appears/n/and/l/confuses due to the influence of chinese and/or dialect, and the like.
For example, missing consonant phones typically include instances where syllables end with consonant dropout, and the like.
It is to be understood that, in practical applications, the pronunciation problem in the pronunciation diagnosis result may include one or more syllable errors, may also include one or more pronunciation errors, may also include both syllable errors and pronunciation errors, and so on, and the embodiments of the present disclosure are not limited thereto.
For example, the pronunciation diagnosis result generated in step S200 can be presented on an interactive interface of a language learning application program, for example, for a user to read and browse.
Fig. 4A is a schematic diagram of a pronunciation diagnosis result displayed on a word sound correction interaction interface according to at least some embodiments of the present disclosure, fig. 4B is a schematic diagram of another pronunciation diagnosis result displayed on a word sound correction interaction interface according to at least some embodiments of the present disclosure, and fig. 4C is a schematic diagram of another pronunciation diagnosis result displayed on a word sound correction interaction interface according to at least some embodiments of the present disclosure.
For example, as shown in fig. 4A-4C, at least one of the following may be shown on the word-tone correction interactive interface (as shown by the solid black line boxes in the figure): word spelling according to syllable division (as shown, word spelling is divided by small circles), word part of speech (as shown in the figure, "adj." indicating adjectives and "n." indicating nouns, etc.), word phonetic symbol according to syllable division (as shown, word phonetic symbol is divided by short-dashed lines), indication of the number of pronunciation problems (as shown, "pronunciation problem at x", in fig. 4A-4B, x is 2; in fig. 4C, x is 1), "standard pronunciation" button, "your pronunciation" button, and pronunciation diagnosis result. For example, when a user clicks a "standard pronunciation" button on the interactive interface, an exemplary recording corresponding to the standard pronunciation of the word may be played. For example, when the user clicks the "your pronunciation" button on the interactive interface, the user audio data (i.e., the first audio data described above) may be played.
For example, as shown in fig. 4A-4C, the pronunciation diagnosis result may include whether a syllable error and/or a pronunciation error exists, and what kind of syllable error and/or pronunciation error specifically exists in the case that the syllable error and/or the pronunciation error exists.
For example, fig. 4A exemplarily reports that there are syllable errors (in particular, syllable number errors) and pronunciation errors (in particular, vowel confusion errors, i.e. misreading of vowel phonemes). For example, as shown in fig. 4A, when the number of syllables is reported as being wrong, the number of syllables of the word phonetic symbol, the number of syllables of the user's pronunciation, the difference (more/less) of the two syllable numbers, and the like may be indicated. For example, as shown in fig. 4A, when a vowel confusion error is reported, a syllable location containing an erroneous phoneme, a correct phoneme, a classification of a correct phoneme (monaural/diphthong), an erroneous phoneme pronounced by a user, a classification of an erroneous phoneme (monaural/diphthong), and the like may be designated.
For example, fig. 4B exemplarily reports the presence of syllable errors (specifically, syllable stress errors, i.e., stress errors) and pronunciation errors (specifically, consonant missing errors, i.e., missing consonant phones). For example, as shown in fig. 4B, when a syllable emphasis error is reported, it is possible to specify the syllable position where the word is originally emphasized, the syllable position where the user pronounces emphasis, whether the word is originally emphasized, and the like. For example, as shown in fig. 4B, when a missing reading error of a consonant is reported, a syllable position (the second syllable) including the missing reading of the consonant, a position (the beginning/end/middle) of a syllable where the missing reading of the consonant is located, the missing reading of the consonant, and the like may be specified.
For example, fig. 4C exemplarily reports that there is no syllable error and that there is a pronunciation problem (specifically, a consonant confusion error, i.e., misreading of a consonant phoneme). For example, as shown in fig. 4C, when no syllable error is reported, it can be explained by a symbol (e.g., an icon including a number) and a text (e.g., no syllable problem is found), or the like. For example, when a consonant confusion error is reported, a syllable position (the second syllable), a correct phoneme, a category of correct phoneme (consonant), an incorrect phoneme, and the like, which include the consonant confusion error, may be reported.
It will be appreciated that the number x in the pronunciation issue number indication ("pronunciation issue at x") is equal to the sum of the number of specific pronunciation issues in the pronunciation diagnosis result.
It will be appreciated that in actual practice, the portion of the indicator arrows in FIGS. 4A-4C that is located in the interactive interface does not exist, i.e., the portion of the interactive interface that is shown without the indicator arrows.
It is understood that the pronunciation diagnosis results other than those shown in fig. 4A-4C can be reported by referring to the reporting manner shown in fig. 4A-4C, and will not be described herein again. It should be noted that the word-correcting interaction interfaces shown in fig. 4A-4C are all exemplary, and the reporting manner of the pronunciation diagnosis result is not limited by the embodiments of the present disclosure.
For example, as shown in FIGS. 4A-4C, a "start tone correction" button may also be displayed on the word tone correction interactive interface. When the user clicks the "start to correct the tone" button, the tone correction guidance interface may be entered. For example, in some examples, as shown in fig. 4A-4C, a "start sound correction" button is provided for each specific pronunciation question, and when the user clicks the "start sound correction" button, a sound correction guidance interface for the specific pronunciation question may be entered; of course, a rectification guidance interface for all specific pronunciation issues may also be entered. For example, in other examples, only one "start sound correction" button is provided in the word sound correction interactive interface, and when the user clicks the "start sound correction" button, the sound correction guidance interface for all specific pronunciation problems can be entered.
It should be noted that, the embodiments of the present disclosure do not limit the layout of the content displayed on the word sound correction interactive interface.
It should be noted that, taking english as an example, the above specific pronunciation problem basically covers all errors of english learners who use chinese as the mother language. Meanwhile, the pronunciation diagnosis operation in the sound correction method provided by the embodiment of the disclosure can realize an error detection rate of more than 90%.
Step S300: and carrying out sound correction guidance according to the pronunciation diagnosis result.
For example, in some embodiments, step S300 may include: in response to a sound correction operation (e.g., an operation of clicking the aforementioned "start sound correction" button by a user), a standard pronunciation of a word (e.g., a word phonetic symbol divided according to syllables, etc.), a pronunciation diagnosis result, and a text guide for guiding the user to make a correct pronunciation are presented.
Fig. 5A is a schematic diagram of a rectification guidance interface according to at least some embodiments of the present disclosure. The interface shown in fig. 5A is a rectification guidance interface for the wrong number of syllables. For example, as shown in fig. 5A, the rectification guidance interface may include: word phonetic symbol display according to syllable division (as shown, word phonetic symbol is divided by dash line), visual display of correct syllable (number, lightness), phonetic symbol display of user pronunciation, visual display of wrong syllable of user pronunciation (number, lightness), text guidance, etc. For example, as shown in fig. 5A, in the visual presentation of the syllables, the number of syllables can be represented by the number of circles (of course, other graphic shapes are also possible); at the same time, the lightness of a syllable may also be indicated by the size of the circle, e.g. a larger circle indicates that the corresponding syllable is a stressed syllable, and a smaller circle indicates that the corresponding syllable is a non-stressed syllable. For example, as shown in fig. 5A, in the presentation of the phonetic symbol, the question syllable may be highlighted (for example, brightness and/or color of the question syllable may be changed relative to the normal syllable), and further, the display color of the question phone in the question syllable may be changed (for example, the normal phone is displayed as black, and the question phone is displayed as red, but is not limited thereto); correspondingly, for example, a graphical representation of the question syllable (e.g., the circle in fig. 5A) may also be highlighted in the visualization of the syllable. For example, as shown in fig. 5A, in the interface for guiding the correction of the number of syllables error, the text guide may indicate the number of syllables included in the word phonetic symbol, the number of syllables included in the pronunciation of the user, the consonant of the problem syllable of the pronunciation of the user, the vowel after the consonant of the problem syllable of the pronunciation of the user, the guiding method of the user for correcting the number of syllables error, and the like.
Fig. 5B is a schematic diagram of another audio correction guidance interface provided in at least some embodiments of the present disclosure. The interface shown in fig. 5B is a rectification guidance interface for accent errors. For example, as shown in fig. 5B, the sound correction guidance interface may include: word phonetic symbols segmented according to syllables (as shown, word phonetic symbols are segmented by dashes), visual display of correct syllables (number, lightness), visual display of incorrect syllables pronounced by the user (number, lightness), text guidance, etc. For example, as shown in FIG. 5B, in the presentation of the phonetic symbols, the question syllable that requires the user to focus on may be highlighted (e.g., change the brightness and/or color of the question syllable relative to the normal syllable, etc.). For example, as shown in fig. 5B, in the visual presentation of the syllables, the number of syllables can be represented by the number of circles (of course, other graphic shapes are also possible); at the same time, the lightness of a syllable may also be indicated by the size of the circle, e.g. a larger circle indicates that the corresponding syllable is a stressed syllable, and a smaller circle indicates that the corresponding syllable is a non-stressed syllable. For example, as shown in FIG. 5B, in a visual presentation of syllables (e.g., a visual presentation of the wrong syllable of the user's pronunciation), a graphical representation of the problematic syllable (e.g., a circle in FIG. 5B) may also be highlighted. For example, as shown in fig. 5B, in the pronunciation correction guidance interface for the accent error, the text guidance may indicate the number of syllables included in the word phonetic symbol, a description of the user pronunciation problem, a guidance method for the user to correct the accent error, and the like.
Fig. 5C is a schematic diagram of yet another rectification guidance interface provided in at least some embodiments of the present disclosure. The interface shown in fig. 5C is a correction guidance interface for consonant confusion errors. For example, as shown in fig. 5C, the rectification guidance interface may include: a presentation of word spellings divided according to syllables (as shown, the word spellings are divided by small circles), a presentation of word phonetic symbols divided according to syllables (as shown, the word phonetic symbols are divided by dashes), an actual pronunciation phoneme of the user, and a text guide, etc. For example, as shown in FIG. 5C, the question letter may be highlighted in the presentation of the spelling. For example, as shown in FIG. 5C, in the presentation of the phonetic symbols, the question phonemes may be highlighted. For example, as shown in fig. 5C, in the sound-correction guidance interface for a consonant confusion error, the text guidance may indicate a guidance method (e.g., a correct pronunciation method of a consonant) or the like by which the user corrects the consonant confusion error.
Fig. 5D is a schematic diagram of yet another rectification guidance interface provided in at least some embodiments of the present disclosure. The interface shown in fig. 5D is a correction guidance interface for a consonant misreading error. For example, as shown in fig. 5D, the rectification guidance interface may include: word spelling segmented according to syllables (as shown, the word spelling is segmented by small circles), word phonetic symbol segmented according to syllables (as shown, the word phonetic symbol is segmented by dashes), phonemes missed by the user, text guides, and the like. For example, as shown in FIG. 5D, the question letter may be highlighted in the presentation of the spelling. For example, as shown in FIG. 5D, in the presentation of the phonetic symbols, the question phonemes may be highlighted. For example, as shown in fig. 5D, in the sound correction guidance interface for the consonant misreading error, the text guidance may indicate a guidance method (e.g., a correct pronunciation method for the missing consonant) for correcting the consonant misreading error by the user, and the like.
For example, in some embodiments, as shown in fig. 5A-5D, step S300 may further include: while presenting the text guide, the text guide is played using voice synchronization (see the text prompt "playing rectification guide …" in the upper left corner of FIGS. 5A-5D).
It is understood that the sound correction guidance of other pronunciation problems besides the pronunciation problem shown in fig. 5A-5D can be performed by referring to the sound correction guidance manner shown in fig. 5A-5D, and will not be described herein again. It is also understood that, although the sound correction guidance interface of fig. 5A-5D is used for performing sound correction guidance for one specific pronunciation problem, in practical applications, the sound correction guidance interface may perform sound correction guidance for a plurality of specific pronunciation problems at the same time.
It should be noted that the sound correction guidance interfaces shown in fig. 5A to 5D are all exemplary, and the embodiments of the present disclosure do not limit the sound correction guidance manner. It should be further noted that the embodiments of the present disclosure do not limit the layout of the content displayed on the sound correction guidance interface.
For example, in some embodiments, after the user clicks the aforementioned "start tone correction" button, the user may switch directly from the word tone correction interactive interface to the tone correction guidance interface. For example, in other embodiments, after the user clicks the "start to correct the sound" button, the user may switch from the word sound correction interaction interface to the transition interface and then from the transition interface to the sound correction guidance interface.
Fig. 6 is a schematic illustration of a transition interface provided in at least some embodiments of the present disclosure. For example, as shown in fig. 6, the transition interface may include: the display of word spelling segmented according to syllables (as shown, the word spelling is segmented by small circles), word part of speech (as shown in the figure as "adj." representing adjectives), word phonetic symbol segmented according to syllables (as shown, the word phonetic symbol is segmented by dashes), visual display of correct syllables (quantity, lightness), "demonstration recording" control, and "your pronunciation" control, etc. For example, as shown in fig. 6, in the visualized presentation of the syllables, the number of syllables can be represented by the number of circles (of course, other graphic shapes are also possible); at the same time, the lightness of a syllable may also be indicated by the size of the circle, e.g. a larger circle indicates that the corresponding syllable is a stressed syllable, and a smaller circle indicates that the corresponding syllable is a non-stressed syllable. For example, when a user clicks on the "exemplary recording" control on the transition interface, an exemplary recording corresponding to the standard pronunciation of the word may be played. For example, when the user clicks on the "your pronunciation" control on the transition interface, the user audio data (i.e., the first audio data described above) may be played. For example, while displaying the transition interface, the language learning class application may compare the user audio data with the exemplary recording in the background (see text prompt "compare pronunciation …" in the upper left corner of FIG. 6) and prepare the content for presentation at the sound correction guidance interface; and switching to a sound correction guidance interface when the contents are ready.
Step S400: second audio data about the word is obtained and feedback for the second audio data is provided.
For example, after completing the sound correction guidance, the user may be provided with a pronunciation practice opportunity to verify the sound correction effect. For example, in some embodiments, user exercise audio data (i.e., second audio data) may be captured by an audio capture module or device of the client; then, referring to the step S200, a pronunciation diagnosis operation is performed on the user practice audio data (in this case, the user practice audio data is regarded as the first audio data), if there is no pronunciation problem in the pronunciation diagnosis result of the user practice audio data, a feedback of correct pronunciation is provided to the user, and if there is a pronunciation problem in the pronunciation diagnosis result of the user practice audio data, the steps S300 and S400 are referred to continue to perform the sound correction guidance and provide the practice feedback.
Fig. 7A is a schematic illustration of an exercise interface provided in accordance with at least some embodiments of the present disclosure. For example, the interface shown in fig. 7A is a practice interface for a wrong number of syllables (e.g., the wrong number of syllables shown in fig. 5A). For example, as shown in fig. 7A, the exercise interface may include: a presentation of word spellings segmented according to syllables (segmented by small dots as shown), a presentation of word phonetic symbols segmented according to syllables (segmented by dashes as shown), a visual presentation of correct syllables (quantity, weight), "exemplary recording" control (as shown by the sound control icon in the figure), and a "start recording" button, etc.
Fig. 7B is a schematic view of another exercise interface provided in at least some embodiments of the present disclosure. For example, the interface shown in fig. 7B is a practice interface for a consonant confusion error (e.g., the consonant confusion error shown in fig. 5C). For example, as shown in fig. 7B, the exercise interface may include: a word spelling segmented according to syllables (segmented by small circles as shown), a word phonetic symbol segmented according to syllables (segmented by dash as shown), an "exemplary recording" control (shown as a sound control icon in the figure), and a "start recording" button, etc.
It is understood that the practice interfaces shown in fig. 7A-7B may be referenced to provide corresponding practice interfaces for pronunciation issues other than those referred to in fig. 7A-7B, and will not be described in detail herein.
It should be noted that the exercise interfaces shown in fig. 7A-7B are exemplary, and the embodiments of the disclosure do not limit the content displayed on the exercise interface and the layout of the content on the exercise interface.
For example, in some embodiments, upon entering the practice interface, an exemplary recorded voice and follow-up voice prompt may be automatically played (see the text prompt "please follow-up …" in the upper left corner of FIGS. 7A-7B); user exercise audio data is then captured by an audio capture module or device in response to the user clicking (e.g., long pressing) a "start recording" button. It should be noted that the playing order of the exemplary recording and the follow-up voice prompt is not limited by the embodiments of the present disclosure. For example, before clicking (e.g., long pressing) the "start recording" button, the user may click the "exemplary recording" control to repeatedly listen to the exemplary recording.
For example, if there is a pronunciation problem in the pronunciation diagnosis result of the user practice audio data, the pronunciation diagnosis result may be displayed on the corresponding word rectification interactive interface (for example, the word rectification interactive interface shown in fig. 4A-4C may be referred to). On the basis, the related operations of step S300 and step S400 can be further performed.
For example, if there is no pronunciation problem in the results of pronunciation diagnosis of the user practice audio data, a feedback interface with correct pronunciation can be provided to the user.
Fig. 7C is a schematic diagram of a feedback interface provided in accordance with at least some embodiments of the present disclosure. For example, the interface shown in FIG. 7C is a feedback interface corresponding to an stress error (e.g., the stress error shown in FIG. 5B). For example, as shown in fig. 7C, the feedback interface may include: the presentation of word spellings segmented according to syllables (as shown, the word spellings are segmented by small circles), the presentation of word phonetic symbols segmented according to syllables (as shown, the word phonetic symbols are segmented by dashes), the visual presentation of correct syllables (quantity, lightness), problem correction prompts (as shown in the figure including a number-matching icon and the text "your pronunciation syllable is correct" and "done" button, etc. For example, as shown in FIG. 7C, in the presentation of the spelling, the playback syllable may be highlighted for emphasis. For example, as shown in fig. 7C, in the presentation of the phonetic symbol, the playback syllable may be highlighted to play a role of emphasis.
Fig. 7D is a schematic diagram of another feedback interface provided in at least some embodiments of the present disclosure. For example, the interface shown in fig. 7D is a feedback interface corresponding to a consonant confusion error (e.g., the consonant confusion error shown in fig. 5C). For example, as shown in fig. 7D, the feedback interface may include: a word spelling divided according to syllables (as shown, the word spelling is divided by small circles), a word phonetic symbol divided according to syllables (as shown, the word phonetic symbol is divided by dashes), a problem correction prompt (as shown in the figure including a number-matching icon and the text "your pronunciation is no longer confusing/v/and/w/|"), a "done" button, and the like. For example, as shown in FIG. 7D, in the presentation of the spelling, the question letter may be highlighted for emphasis. For example, as shown in fig. 7D, in the presentation of the phonetic symbol, the question phoneme may be highlighted to emphasize.
For example, in some embodiments, the feedback interface may also include a feedback scheme (as shown in the text "too-stippled" in the upper left corner of fig. 7C-7D) to encourage the user. For example, the feedback document may also be played using voice synchronization.
For example, in response to the user clicking (e.g., long pressing) the "done" button, the feedback interface may be closed, ending the exercise of the current word.
It should be noted that, in the embodiment of the present disclosure, the flow of the sound correction method described above may include more or less operations, and these operations may be executed sequentially or in parallel. Although the flow of the sound correction method described above includes a plurality of operations occurring in a specific order, it should be clearly understood that the order of the plurality of operations is not limited. The above-described tone correction method may be performed once or may be performed a plurality of times according to a predetermined condition.
The pronunciation correction method provided by the embodiment of the disclosure is based on the 'dual-model two-pass decoding' to carry out pronunciation diagnosis operation, and can conveniently and quickly obtain pronunciation diagnosis results, so that a user can pertinently correct existing pronunciation problems according to the pronunciation diagnosis results, the language learning efficiency of the user is improved, and the method has higher practicability. In addition, the sound correction method provided by the embodiment of the disclosure can also identify confusion, addition, deletion errors, stress errors and the like of single or multiple phonemes, and can provide sound correction feedback from two dimensions of syllables and pronunciations to guide a user to correct from wrong pronunciations to exemplary pronunciations.
At least some embodiments of the present disclosure also provide a sound correction device. Fig. 8 is a schematic block diagram of a tone correction apparatus provided in at least some embodiments of the present disclosure. For example, as shown in fig. 8, the sound correction device 100 includes a memory 110 and a processor 120.
For example, the memory 110 is used for non-transitory storage of computer readable instructions, and the processor 120 is used for executing the computer readable instructions, and the computer readable instructions are executed by the processor 120 to execute the sound correction method provided by any embodiment of the disclosure.
For example, the memory 110 and the processor 120 may be in direct or indirect communication with each other. For example, in some examples, as shown in fig. 8, the sound correcting apparatus 100 may further include a system bus 130, and the memory 110 and the processor 120 may communicate with each other through the system bus 130, for example, the processor 120 may access the memory 110 through the system bus 130. For example, in other examples, components such as memory 110 and processor 120 may communicate over a network connection. The network may include a wireless network, a wired network, and/or any combination of wireless and wired networks. The network may include a local area network, the Internet, a telecommunications network, an Internet of Things (Internet of Things) based on the Internet and/or a telecommunications network, and/or any combination thereof, and/or the like. The wired network may communicate by using twisted pair, coaxial cable, or optical fiber transmission, for example, and the wireless network may communicate by using 3G/4G/5G mobile communication network, bluetooth, Zigbee, or WiFi, for example. The present disclosure is not limited herein as to the type and function of the network.
For example, processor 120 may control other components in the sound correction device to perform desired functions. The processor 120 may be a device having data processing capability and/or program execution capability, such as a Central Processing Unit (CPU), Tensor Processor (TPU), or Graphics Processor (GPU). The Central Processing Unit (CPU) may be an X86 or ARM architecture, etc. The GPU may be separately integrated directly onto the motherboard, or built into the north bridge chip of the motherboard. The GPU may also be built into the Central Processing Unit (CPU).
For example, memory 110 may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), USB memory, flash memory, and the like.
For example, one or more computer instructions may be stored on the memory 110 and executed by the processor 120 to implement various functions. Various applications and various data, such as words, first audio data, first acoustic models, second acoustic models, pronunciation diagnosis results, second audio data, and various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.
For example, some of the computer instructions stored by memory 210, when executed by processor 220, may perform one or more steps according to the tone correction method described above.
For example, as shown in fig. 8, the sound correction apparatus 100 may further include an input interface 140 that allows an external device to communicate with the sound correction apparatus 100. For example, the input interface 140 may be used to receive instructions from an external computer device, from a user, and the like. Sound correction apparatus 100 may also include an output interface 150 that interconnects sound correction apparatus 100 and one or more external devices. For example, the sound correction device 100 can output the sound correction result and the like through the output interface 150. External devices that communicate with sound correction apparatus 100 through input interface 140 and output interface 150 may be included in an environment that provides any type of user interface with which a user may interact. Examples of user interface types include graphical user interfaces, natural user interfaces, and the like. For example, a graphical user interface may accept input from a user using an input device such as a keyboard, mouse, remote control, etc., and provide output on an output device such as a display. Furthermore, a natural user interface may enable a user to interact with sound correction apparatus 100 in a manner that does not require the constraints imposed by input devices such as keyboards, mice, remote controls, and the like. Instead, natural user interfaces may rely on speech recognition, touch and stylus recognition, gesture recognition on and near the screen, air gestures, head and eye tracking, speech and semantics, vision, touch, gestures, and machine intelligence, among others.
For example, in some embodiments, the sound correction device 100 may further include an audio capture device (not shown in fig. 8). For example, the audio acquisition apparatus may be an audio acquisition module or device described in the foregoing embodiments of the sound correction method, such as a microphone that is built in or externally connected to the client.
In addition, although the sound correction apparatus 100 is illustrated as a single system in fig. 8, it is understood that the sound correction apparatus 100 may be a distributed system, and may be arranged as a cloud facility (including a public cloud or a private cloud). Thus, for example, several devices may communicate over a network connection and may collectively perform the tasks described as being performed by the sound correction apparatus 100. For example, in some embodiments, the word and the first audio data may be obtained by the client and uploaded to the server; the server returns the pronunciation diagnosis result to the client after executing the pronunciation diagnosis operation process so as to provide the pronunciation diagnosis result for the user, and the server can further provide a sound correction guidance operation; and then, second audio data can be acquired through the client and uploaded to the server, and the server executes pronunciation diagnosis operation on the second audio data and provides feedback.
For example, for detailed description of the implementation process of the sound correction method, reference may be made to the related description in the above embodiment of the sound correction method, and repeated parts are not described herein again.
For example, in some examples, the sound correction apparatus may include, but is not limited to, a smartphone, a tablet, a Personal computer, a Personal Digital Assistant (PDA), a wearable device, a head-mounted display device, a swipe pen, a point-and-click pen, a server, and so forth.
It should be noted that the sound correcting device provided in the embodiment of the present disclosure is illustrative and not restrictive, and the sound correcting device may further include other conventional components or structures according to practical application needs, for example, in order to implement the necessary function of the sound correcting device, a person skilled in the art may set other conventional components or structures according to a specific application scenario, and the embodiment of the present disclosure is not limited thereto.
For technical effects of the sound correction device provided by the embodiment of the present disclosure, reference may be made to corresponding descriptions about the sound correction method in the foregoing embodiments, and details are not repeated here.
At least some embodiments of the present disclosure also provide a non-transitory storage medium. Fig. 9 is a schematic block diagram of a non-transitory storage medium provided by some embodiments of the present disclosure. For example, as shown in fig. 9, the non-transitory storage medium 200 non-transitory stores computer-readable instructions 201, and when the non-transitory computer-readable instructions 201 are executed by a computer (including a processor), the non-transitory computer-readable instructions are capable of executing the sound correction method provided by any embodiment of the present disclosure.
For example, one or more computer readable instructions may be stored on the non-transitory storage medium 200. Some of the computer readable instructions stored on the non-transitory storage medium 200 may be, for example, instructions for implementing one or more steps of the sound correction method described above.
For example, the non-transitory storage medium may include a storage component of a tablet computer, a hard disk of a personal computer, a Random Access Memory (RAM), a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a compact disc read only memory (CD-ROM), a flash memory, or any combination of the above storage media, as well as other suitable storage media.
For technical effects of the non-transitory storage medium provided by the embodiments of the present disclosure, reference may be made to corresponding descriptions about the sound correction method in the foregoing embodiments, and details are not repeated here.
For the present disclosure, there are the following points to be explained:
(1) in the drawings of the embodiments of the present disclosure, only the structures related to the embodiments of the present disclosure are referred to, and other structures may refer to general designs.
(2) Without conflict, embodiments of the present disclosure and features of the embodiments may be combined with each other to arrive at new embodiments.
The above is only a specific embodiment of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present disclosure, and shall be covered by the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (20)

1. A method of correcting a tone, comprising:
acquiring a word and first audio data;
performing pronunciation diagnosis operation on the first audio data based on the word to generate a pronunciation diagnosis result;
wherein the standard pronunciation of the word comprises at least one standard phoneme;
performing the pronunciation diagnosis operation on the first audio data based on the word to generate the pronunciation diagnosis result, comprising:
aligning the first audio data with the standard pronunciation based on a first acoustic model to obtain a time boundary of each standard phoneme in the standard pronunciation in the first audio data;
determining the grade of each standard phoneme according to the audio segment determined by the time boundary of each standard phoneme;
performing a recognition operation on the first audio data based on a second acoustic model to obtain a decoded phoneme sequence and a time boundary of each decoded phoneme in the decoded phoneme sequence in the first audio data, wherein the decoded phoneme sequence comprises at least one decoded phoneme;
determining a score for each decoded phoneme according to the audio segment determined by the time boundary of each decoded phoneme;
determining a correspondence between each of the standard phonemes in the standard pronunciation and each of the decoded phonemes in the sequence of decoded phonemes; and
and generating the pronunciation diagnosis result based on the corresponding relation, the scores of the standard phonemes and the scores of the decoding phonemes.
2. The method of sound correction according to claim 1, wherein determining the correspondence between each of the standard phonemes in the standard pronunciation and each of the decoded phonemes in the sequence of decoded phonemes comprises:
and taking phonemes as editing elements, and carrying out editing distance operation on the standard pronunciation and the decoding phoneme sequence to determine the corresponding relation.
3. A method of sound correction according to claim 2 wherein said edit distance operations include phoneme substitution operations, the weights of which between different phonemes are at least not exactly the same.
4. The sound correction method according to any one of claims 1 to 3, wherein generating the pronunciation diagnosis result based on the correspondence, the score of each standard phoneme and the score of each decoded phoneme comprises:
in response to any one of the standard phonemes having a corresponding decoded phoneme, determining whether a score of the any one of the standard phonemes is lower than a first score threshold;
in response to the score of the any one of the standard phonemes being lower than the first score threshold, calculating a boundary coincidence degree between the any one of the standard phonemes and the decoded phoneme corresponding to the any one of the standard phonemes from a time boundary of the any one of the standard phonemes and a time boundary of the decoded phoneme corresponding to the any one of the standard phonemes; and
indicating in the pronunciation diagnosis result that a misreading condition has occurred for the any one of the standard phonemes in response to the any one of the standard phonemes being different from the decoded phoneme corresponding to the any one of the standard phonemes and a boundary coincidence degree between the any one of the standard phonemes and the decoded phoneme corresponding to the any one of the standard phonemes being not less than a coincidence degree threshold.
5. The method of sound correction according to claim 4, wherein generating the pronunciation diagnosis result based on the correspondence, the score of each standard phoneme and the score of each decoded phoneme further comprises:
judging whether the difference between the score of the decoded phoneme corresponding to the standard phoneme and the score of the standard phoneme is not less than a second score threshold value; and
in response to a difference between a score of a decoded phoneme corresponding to the any one of the standard phonemes and a score of the any one of the standard phonemes being not less than the second score threshold, indicating in the misreading situation that the any one of the standard phonemes is misread to the decoded phoneme corresponding to the any one of the standard phonemes.
6. A method of tone correction according to claim 4 wherein the degree of boundary overlap is calculated according to the formula:
Figure FDA0003332159340000021
where BC denotes the degree of overlap of the boundaries, x1 and y1 denote a start time boundary and an end time boundary of a standard phoneme, respectively, x2 and y2 denote a start time boundary and an end time boundary of a decoded phoneme, respectively, min () is a minimum function, and max () is a maximum function.
7. The sound correction method according to any one of claims 1 to 3, wherein generating the pronunciation diagnosis result based on the correspondence, the score of each standard phoneme and the score of each decoded phoneme comprises:
in response to any of the standard phonemes not having a decoded phoneme corresponding thereto, indicating in the pronunciation diagnosis structure that a misreading condition has occurred for the any of the standard phonemes.
8. The sound correction method according to any one of claims 1 to 3, wherein generating the pronunciation diagnosis result based on the correspondence, the score of each standard phoneme and the score of each decoded phoneme comprises:
in response to any of the decoded phonemes not having a corresponding standard phoneme, it is indicated in the pronunciation diagnosis result that a multiple-reading condition has occurred.
9. The sound correction method according to any one of claims 1 to 3, wherein the generating of the pronunciation diagnosis result based on the correspondence, the score of each standard phoneme, and the score of each decoded phoneme further comprises:
indicating, in the multi-reading case, that the any decoded phoneme is multi-read in response to the score of the any decoded phoneme not being below a third score threshold.
10. The sound correction method according to any one of claims 1-3, wherein the pronunciation diagnosis operation is performed on the first audio data based on the word to generate the pronunciation diagnosis result, further comprising:
determining a time boundary for a vowel phone in the stressed syllable based on the time boundary for each standard phone in the standard pronunciation and the stressed syllables of the standard pronunciation;
extracting feature information of a first audio segment determined by a time boundary of the vowel phone in the stressed syllable;
judging whether the stressed syllables are stressed or not through a classification model based on the characteristic information of the first audio frequency segment; and
indicating in the pronunciation diagnosis result that the stressed syllable is not stressed in response to the stressed syllable being judged not to be stressed.
11. The sound correction method according to any one of claims 1-3, wherein the pronunciation diagnosis operation is performed on the first audio data based on the word to generate the pronunciation diagnosis result, further comprising:
determining a time boundary for a vowel phone in the non-stressed syllable based on the time boundary for each standard phone in the standard pronunciation and the non-stressed syllables of the standard pronunciation;
extracting feature information of a second audio segment determined by a time boundary of the vowel phone in the non-stressed syllable;
judging whether the non-stressed syllables are stressed or not through a classification model based on the characteristic information of the second audio frequency segment; and
indicating in the pronunciation diagnosis result that the non-stressed syllable is stressed in response to the non-stressed syllable being judged stressed.
12. A sound correction method according to any one of claims 1-3 wherein the score for each of the standard phonemes and the score for each of the decoded phonemes are determined based on a pronunciation accuracy algorithm.
13. A method of tone correction according to any one of claims 1-3 and also comprising:
and carrying out sound correction guidance according to the pronunciation diagnosis result.
14. The method of claim 13, wherein performing the sound correction guidance according to the pronunciation diagnosis result comprises:
and responding to the sound correction operation, and displaying the standard pronunciation of the word, the pronunciation diagnosis result and a text guide, wherein the text guide is used for guiding a user to make a correct pronunciation.
15. The method of claim 13, wherein the sound correction guidance is performed according to the pronunciation diagnosis result, further comprising:
and synchronously playing the text guidance by using voice when the text guidance is displayed.
16. The method of tone correction according to claim 13, further comprising:
second audio data is obtained for the word and exercise feedback is provided for the second audio data.
17. The method of sound correction according to any one of claims 1-3, wherein the pronunciation diagnosis result includes at least one of a syllable error and a pronunciation error,
the syllable error includes at least one of a syllable quantity error and a stress error,
the pronunciation error includes at least one of a misread vowel phoneme, a misread consonant phoneme, and a misread consonant phoneme.
18. A sound correction apparatus comprising:
a memory for non-transitory storage of computer readable instructions; and
a processor for executing the computer-readable instructions, wherein the computer-readable instructions, when executed by the processor, perform the sound correction method according to any one of claims 1-17.
19. The tone correction device of claim 18, further comprising:
and the audio acquisition device is used for acquiring the first audio data.
20. A non-transitory storage medium storing, non-transitory, computer readable instructions, wherein the non-transitory computer readable instructions, when executed by a computer, are capable of performing the method of sound correction according to any one of claims 1-17.
CN202111283587.8A 2021-11-01 2021-11-01 Sound correction method, sound correction device and non-transient storage medium Pending CN113990351A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111283587.8A CN113990351A (en) 2021-11-01 2021-11-01 Sound correction method, sound correction device and non-transient storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111283587.8A CN113990351A (en) 2021-11-01 2021-11-01 Sound correction method, sound correction device and non-transient storage medium

Publications (1)

Publication Number Publication Date
CN113990351A true CN113990351A (en) 2022-01-28

Family

ID=79745401

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111283587.8A Pending CN113990351A (en) 2021-11-01 2021-11-01 Sound correction method, sound correction device and non-transient storage medium

Country Status (1)

Country Link
CN (1) CN113990351A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116206496A (en) * 2023-01-30 2023-06-02 齐齐哈尔大学 Oral english practice analysis compares system based on artificial intelligence

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116206496A (en) * 2023-01-30 2023-06-02 齐齐哈尔大学 Oral english practice analysis compares system based on artificial intelligence
CN116206496B (en) * 2023-01-30 2023-08-18 齐齐哈尔大学 Oral english practice analysis compares system based on artificial intelligence

Similar Documents

Publication Publication Date Title
CN109036464B (en) Pronunciation error detection method, apparatus, device and storage medium
US8793118B2 (en) Adaptive multimodal communication assist system
CN103714048B (en) Method and system for correcting text
US11043213B2 (en) System and method for detection and correction of incorrectly pronounced words
US9548052B2 (en) Ebook interaction using speech recognition
CN109256152A (en) Speech assessment method and device, electronic equipment, storage medium
KR20190125154A (en) An apparatus for machine learning the psychological counseling data and a method thereof
CN110797010A (en) Question-answer scoring method, device, equipment and storage medium based on artificial intelligence
US11810471B2 (en) Computer implemented method and apparatus for recognition of speech patterns and feedback
CN112397056B (en) Voice evaluation method and computer storage medium
US20210050004A1 (en) Method and system using phoneme embedding
KR102225435B1 (en) Language learning-training system based on speech to text technology
Lee Language-independent methods for computer-assisted pronunciation training
CN106537489B (en) Method and system for recognizing speech comprising word sequences
CN112599129B (en) Speech recognition method, apparatus, device and storage medium
CN110503956A (en) Audio recognition method, device, medium and electronic equipment
CN113990351A (en) Sound correction method, sound correction device and non-transient storage medium
KR20200140171A (en) Electronic device and Method for controlling the electronic device thereof
US20150127352A1 (en) Methods, Systems, and Tools for Promoting Literacy
CN113012685B (en) Audio recognition method and device, electronic equipment and storage medium
CN114420159A (en) Audio evaluation method and device and non-transient storage medium
CN111128181B (en) Recitation question evaluating method, recitation question evaluating device and recitation question evaluating equipment
Bang et al. An automatic feedback system for English speaking integrating pronunciation and prosody assessments
CN111899581A (en) Word spelling and reading exercise device and method for English teaching
CN113707178B (en) Audio evaluation method and device and non-transient storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination