CN113393864A - Spoken language pronunciation correction method, device, equipment and storage medium - Google Patents
Spoken language pronunciation correction method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN113393864A CN113393864A CN202110652097.4A CN202110652097A CN113393864A CN 113393864 A CN113393864 A CN 113393864A CN 202110652097 A CN202110652097 A CN 202110652097A CN 113393864 A CN113393864 A CN 113393864A
- Authority
- CN
- China
- Prior art keywords
- target
- word
- phoneme
- file
- pronunciation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012937 correction Methods 0.000 title claims abstract description 114
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000012545 processing Methods 0.000 claims abstract description 22
- 230000015572 biosynthetic process Effects 0.000 claims description 19
- 238000003786 synthesis reaction Methods 0.000 claims description 19
- 238000005516 engineering process Methods 0.000 claims description 8
- 238000011156 evaluation Methods 0.000 abstract description 8
- 230000006870 function Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 230000008859 change Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09B—EDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
- G09B5/00—Electrically-operated educational appliances
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09B—EDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
- G09B5/00—Electrically-operated educational appliances
- G09B5/04—Electrically-operated educational appliances with audible presentation of the material to be studied
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Educational Administration (AREA)
- Business, Economics & Management (AREA)
- Educational Technology (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The embodiment of the invention discloses a method, a device, equipment and a storage medium for correcting spoken language pronunciation. The method comprises the following steps: obtaining a reading text and reading voice corresponding to the reading text; determining a target word in the read-aloud text and a target phoneme in the target word according to the read-aloud voice and a preset spoken language scoring standard; generating a sound correction word file according to the target word and the target phoneme; the sound correction word file is a standard pronunciation file obtained by performing variable speed processing on target phonemes in the target words. The technical scheme of the embodiment of the invention solves the problem that the pronunciation is difficult to be corrected in a targeted manner because the conventional spoken language practice software can only perform simple spoken language evaluation, improves the pertinence to misreading pronunciation correction in spoken language practice, and improves the effectiveness of spoken language practice.
Description
Technical Field
The embodiment of the invention relates to the technical field of language teaching, in particular to a method, a device, equipment and a storage medium for correcting spoken language pronunciation.
Background
With the development of global economy integration, communication among countries is increasing. English is becoming more and more important in chinese education as the most common language for international communication. The infant time is a key time for a person to learn a language and is also an optimal time for learning a language. As a foundation of enlightenment education, children's english education has become a focus of social attention, and with the development of social economy and the increase of cultural communication, the goals of language education for people are continuously improved, and parents pay more attention to the oral education for children.
At present, children's english education is mainly divided into two kinds of education types of off-line, on-line, and along with internet technology's high-speed development, on-line english education is aroused gradually, at present stage, realizes "teaching, learning, practicing, survey, commenting" to children through AI big data technology commonly to reach the companion's educational effect of learning the english in-process at children.
However, in the current online english education mode based on artificial intelligence or the product of a robot human being read by a child in the market, the mode for oral english practice is mainly scoring through pronunciation evaluation or follow-up reading aiming at standard voice, and only oral evaluation can be simply carried out to inform a user whether the oral pronunciation is standard or not, so that the targeted oral pronunciation correction is difficult to provide for the child.
Disclosure of Invention
The invention provides a method, a device, equipment and a storage medium for correcting pronunciation errors of spoken words, which are used for extracting misread words when the pronunciation of the children is acquired and generating a pronunciation correction word file after the misread syllables are highlighted and strengthened, so that the children can quickly find and correct the misread problems and the effectiveness of the practice of spoken words is improved.
In a first aspect, an embodiment of the present invention provides a spoken language pronunciation correction method, including:
obtaining a reading text and reading voice corresponding to the reading text;
determining a target word in the read-aloud text and a target phoneme in the target word according to the read-aloud voice and a preset spoken language scoring standard;
generating a sound correction word file according to the target word and the target phoneme;
the sound correction word file is a standard pronunciation file obtained by performing variable speed processing on target phonemes in the target words.
Further, the speakable text is made up of at least one word; after the reading text and the reading voice corresponding to the reading text are obtained, the method further comprises the following steps:
dividing the reading voice into at least one voice word according to the reading text; wherein, each voice word is in one-to-one correspondence with the words in the reading text;
and performing phoneme division on each voice word according to a preset voice dictionary, and determining a phoneme set corresponding to each voice word.
Further, determining a target word in the read-aloud text and a target phoneme in the target word according to the read-aloud voice and a preset spoken language scoring standard, including:
scoring each phoneme in the phoneme set corresponding to each voice word according to a preset spoken language scoring standard, and determining a phoneme score corresponding to each phoneme;
determining phonemes with phoneme scores smaller than a preset score threshold value as target phonemes;
and determining the voice word corresponding to the target phoneme as the target word in the reading text.
Further, generating a sound correction word file according to the target word and the target phoneme includes:
generating a standard pronunciation file of a target word by a voice synthesis technology;
forcibly aligning each phoneme in the target word, and determining syllable boundary positions of syllables of each phoneme in the standard pronunciation file;
determining target syllable boundary positions of the target phoneme in the standard pronunciation file, and determining syllables between the target syllable boundary positions as target syllables;
and lengthening the target syllable, and determining the standard pronunciation file after the target syllable is lengthened as a sound correction word file.
Further, generating a sound correction word file according to the target word and the target phoneme includes:
inputting the target words and the target phonemes into a preset variable speed speech synthesis model;
and determining the output result of the preset variable speed speech synthesis model as a sound correction word file, wherein the sound correction word file is a standard pronunciation file obtained by performing syllable lengthening processing on a target phoneme in a target word.
Further, after obtaining the speakable text, the method further includes:
generating standard reading voice according to the reading text;
and playing standard reading voice.
Further, the preset spoken language scoring standard is a pronunciation goodness GOP scoring standard.
In a second aspect, an embodiment of the present invention further provides a spoken language pronunciation correction device, where the spoken language pronunciation correction device includes:
the text voice acquisition module is used for acquiring the reading text and the reading voice corresponding to the reading text;
the target determining module is used for determining a target word in the read-aloud text and a target phoneme in the target word according to the read-aloud voice and a preset spoken language scoring standard;
the sound correction file generation module is used for generating a sound correction word file according to the target word and the target phoneme; the sound correction word file is a standard pronunciation file obtained by performing variable speed processing on target phonemes in the target words.
In a third aspect, an embodiment of the present invention further provides a spoken language pronunciation correction device, including:
a storage device and one or more processors;
storage means for storing one or more programs;
when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the spoken utterance correction method as described above in the first aspect.
In a fourth aspect, embodiments of the present invention also provide a storage medium containing computer-executable instructions for performing the spoken utterance correction method according to the first aspect as described above when executed by a computer processor.
According to the method, the device and the equipment for correcting the spoken language pronunciation and the storage medium, the reading text and the reading voice corresponding to the reading text are obtained; determining a target word in the read-aloud text and a target phoneme in the target word according to the read-aloud voice and a preset spoken language scoring standard; generating a sound correction word file according to the target word and the target phoneme; the sound correction word file is a standard pronunciation file obtained by performing variable speed processing on target phonemes in the target words. By adopting the technical scheme, when the user follows the reading according to the provided reading text, the reading voice corresponding to the reading text is acquired, and determines misread target words in the reading speech and target phonemes in which misreadings occur specifically in the target words according to a preset accent scoring standard, generating a standard pronunciation file of the target word after the speed change processing is carried out on the target phoneme according to the determined target word and the target phoneme, and providing the standard pronunciation file of the target word as a pronunciation correction word file for a user, the pronunciation correction method has the advantages that the user can quickly find the misread words and correct the pronunciation of the words according to the pronunciation in the pronunciation correction word file, the problems that only simple spoken language evaluation can be carried out in the existing spoken language practice software and pronunciation is difficult to correct in a targeted mode are solved, pertinence of misreading pronunciation correction in spoken language practice is improved, and effectiveness of spoken language practice is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a flowchart illustrating a spoken utterance correction method according to an embodiment of the present invention;
FIG. 2 is a flowchart of a spoken language pronunciation correction method according to a second embodiment of the present invention;
fig. 3 is an exemplary diagram of an interface for scoring phonemes in the spoken speech according to a preset spoken language scoring criterion according to a second embodiment of the present invention;
fig. 4 is a schematic flowchart of generating a sound correction word file according to a target word and a target phoneme in the second embodiment of the present invention;
fig. 5 is a schematic flowchart of generating a sound correction word file according to a target word and a target phoneme in the second embodiment of the present invention;
fig. 6 is a schematic structural diagram of a spoken language pronunciation correction apparatus according to a third embodiment of the present invention;
fig. 7 is a schematic structural diagram of a spoken language pronunciation base station device in the fourth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
In the description of the present invention, it is to be understood that the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not necessarily used to describe a particular order or sequence, nor are they to be construed as indicating or implying relative importance. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
Example one
Fig. 1 is a flowchart of a spoken language pronunciation correction method according to an embodiment of the present invention, where the present embodiment is applicable to a situation where a user performs targeted correction on a pronunciation error word during a spoken language practice, and the method may be executed by a spoken language pronunciation correction device, where the spoken language pronunciation correction device may be implemented by software and/or hardware, and the spoken language pronunciation correction device may be configured on a computer device, where the computer device may be formed by two or more physical entities or may be formed by one physical entity.
As shown in fig. 1, a spoken language pronunciation correction method provided in this embodiment specifically includes the following steps:
s101, reading text and reading voice corresponding to the reading text are obtained.
In the present embodiment, the speakable text may be understood as a text file composed of a plurality of words provided to the user by software or a device having a spoken utterance correction function, so that the user can read the text file. The reading speech can be understood as a sound file which is obtained by following the reading of the reading text by a user and collected by software or a device with a function of correcting spoken pronunciation.
Specifically, a text file for a user to read at the current moment is obtained, the text file is determined to be a reading text, a sound file generated by the user to read the reading text is received, and the sound file is used as reading voice corresponding to the reading text.
Optionally, the reading text may be a text file randomly provided to the user by software or a device having a spoken language pronunciation correction function, or a text file that the user wants to read after according to an actual requirement, or a text file that the software or the device automatically matches the user according to a historical learning progress of the user, and it is clear that the reading text may be selected in a manner that any text selection logic is satisfied, and the reading text may be a text stored in the software or the device, or a text imported by the user may be a text downloaded through a network, which is not limited in the embodiments of the present invention.
S102, determining a target word in the read-aloud text and a target phoneme in the target word according to the read-aloud voice and a preset spoken language scoring standard.
Wherein the speakable text is comprised of at least one word.
The preset spoken language scoring standard is a pronunciation goodness GOP scoring standard.
In this embodiment, the preset spoken language scoring criterion may be understood as a criterion for scoring whether the pronunciation of each word in the obtained user reading speech is standard, and the minimum granularity of scoring may be divided into phonemes, and the pronunciation of each phoneme in each word may be scored. Optionally, the preset spoken language scoring standard is a Pronunciation Goodness of Pronunciation (GOP) scoring standard, the GOP scoring standard is proposed by silk Witt of the university of massachusetts in his doctor's paper, and other spoken language scoring standards are mostly similar to the GOP scoring standard or derived from the GOP scoring standard, so that the preset spoken language scoring standard may adopt any one of scoring standards in which the minimum granularity reaches a phoneme in the existing scoring standard, which is not limited in the embodiment of the present invention, and only the GOP scoring standard is taken as an example.
In this embodiment, the target word may be understood as a word corresponding to a word in the spoken text in which the spoken pronunciation error exists in the spoken speech. The target phoneme can be understood as a phoneme in which a spoken pronunciation error exists in the target word.
Specifically, each word in the obtained reading speech is scored through a preset spoken language scoring standard, the smallest granularity of the scoring of the preset spoken language scoring standard is a phoneme, so that the scoring can be performed on each phoneme in each word, the phoneme score corresponding to each phoneme is determined, the phoneme with inaccurate pronunciation or the phoneme with misreading can be determined, that is, the pronunciation problem also exists in the word corresponding to the phoneme with inaccurate pronunciation or the phoneme with misreading, because the correspondence exists between the word contained in the reading speech and the word in the reading text, when the word with the pronunciation problem exists in the reading speech is determined, the word can be determined as a target word in the reading text, and the phoneme with the pronunciation problem exists in the target word is determined according to the position of the phoneme with inaccurate pronunciation or the phoneme with misreading in the word with the pronunciation problem, and determines the phoneme having the spoken pronunciation problem as a target phoneme in the target word.
And S103, generating a sound correction word file according to the target word and the target phoneme.
The sound correction word file is a standard pronunciation file obtained by performing variable speed processing on target phonemes in the target words.
In this embodiment, the sound-correcting word file may be understood as a standard pronunciation file generated according to a word with a pronunciation error in spoken language of the user, and used for providing the standard pronunciation file to the user for playing, so that the user can correct the word reading method with the pronunciation error in spoken language according to the pronunciation of the word in the sound-correcting word file.
Specifically, according to the determined target word in the reading text, a standard pronunciation file corresponding to the target word is generated according to the standard pronunciation of the word, because the target phoneme is a phoneme with a pronunciation error in the target word, the target phoneme in the target word needs to be highlighted when the sound correction word file is generated, the phoneme corresponding to the target phoneme can be determined in the standard pronunciation file, the determined phoneme is subjected to speed change processing, the standard pronunciation file after speed change processing is used as a sound correction word file corresponding to the target word, and the sound correction word file can be provided to a user so that the user can correct the word reading method with the pronunciation error in spoken language according to the word pronunciation in the sound correction word file.
Furthermore, when processing is performed on phonemes corresponding to target elements in the standard pronunciation file, the determined phonemes may also be subjected to pronunciation emphasis processing, so that the phonemes with the misreading problem in the generated sound correction word file are emphasized in a manner of being read again.
The method comprises the steps of obtaining a reading text and reading voice corresponding to the reading text; determining a target word in the read-aloud text and a target phoneme in the target word according to the read-aloud voice and a preset spoken language scoring standard; generating a sound correction word file according to the target word and the target phoneme; the sound correction word file is a standard pronunciation file obtained by performing variable speed processing on target phonemes in the target words. By adopting the technical scheme, when the user follows the reading according to the provided reading text, the reading voice corresponding to the reading text is acquired, and determines misread target words in the reading speech and target phonemes in which misreadings occur specifically in the target words according to a preset accent scoring standard, generating a standard pronunciation file of the target word after the speed change processing is carried out on the target phoneme according to the determined target word and the target phoneme, and providing the standard pronunciation file of the target word as a pronunciation correction word file for a user, the pronunciation correction method has the advantages that the user can quickly find the misread words and correct the pronunciation of the words according to the pronunciation in the pronunciation correction word file, the problems that only simple spoken language evaluation can be carried out in the existing spoken language practice software and pronunciation is difficult to correct in a targeted mode are solved, pertinence of misreading pronunciation correction in spoken language practice is improved, and effectiveness of spoken language practice is improved.
Example two
Fig. 2 is a flowchart of a spoken language pronunciation correction method according to the second embodiment of the present invention, which is further optimized based on the optional technical solutions, and after the speakable text is obtained, the standard speakable speech is generated according to the speakable text and played, so that the user can follow up according to the played standard speakable speech to obtain the speakable speech corresponding to the speakable text. After the reading voice is obtained, the reading voice can be divided into a plurality of words corresponding to the reading text according to the reading text, each word is subjected to phoneme division according to a preset voice dictionary to obtain a phoneme set corresponding to each voice word, so that each voice word in the reading voice is scored according to a preset spoken language scoring standard, a target word in the reading text and a target phoneme in the target word are determined according to a scoring result, and two methods for generating a sound correction word file by using the target word and the target phoneme are provided, so that the sound correction word file provided for a user does not only contain the standard pronunciation of the target word, and the target phoneme of the user who misreads in the target word is emphasized, so that the user can more intuitively define the phoneme of the misreading in the word when the sound correction word file is obtained, and the correction is carried out in a targeted manner, the pertinence of misreading pronunciation correction in the oral practice is improved, and the effectiveness of the oral practice is improved.
As shown in fig. 2, a spoken language pronunciation correction method provided by the second embodiment of the present invention specifically includes the following steps:
s201, obtaining a reading text.
And S202, generating standard reading voice according to the reading text.
Specifically, the language corresponding to the reading text is determined, the current standard pronunciation rule of the language is obtained, the standard word pronunciation of each word and the standard sentence pronunciation of each sentence are synthesized from the reading text by a speech synthesis technology according to the standard pronunciation rule, the synthesized standard word pronunciation and the standard sentence pronunciation are combined, and the combination result is determined as the standard reading speech corresponding to the reading text.
And S203, playing standard reading voice.
Specifically, the standard reading voice can be played after the standard reading voice is generated, whether the trigger operation of the user is obtained or not can be monitored after the standard reading voice is generated, and then the standard reading voice is played when the trigger operation of the user is obtained, so that the user can follow up reading of the reading text according to the played standard reading voice, and the reading voice is generated.
And S204, obtaining the reading voice corresponding to the reading text.
And S205, dividing the reading voice into at least one voice word according to the reading text.
And each voice word corresponds to a word in the reading text one by one.
Specifically, the reading text may be composed of one or more words, and the pronunciation of each word is different, so that the reading speech may be divided into a plurality of speech words according to the word of the reading text, in combination with the pause in the reading speech and the different pronunciations of each word in the reading speech, and each speech word obtained after division has a one-to-one correspondence with the word in the reading text.
And S206, dividing phonemes of each voice word according to a preset voice dictionary, and determining a phoneme set corresponding to each voice word.
In the present embodiment, the predetermined phonetic dictionary may be understood as a pronunciation dictionary used for dividing words into phonemes, which can be used as a basis for dividing the phonemes. Optionally, the preset speech dictionary used in the present application may be a CMU pronunciation dictionary of the university of kanji merlon, or may be another universal pronunciation dictionary, which is not limited in this embodiment of the present invention.
For example, if the speakable text is "welome to Chinese", the phoneme division is performed on the obtained speakable speech according to the CMU pronunciation dictionary, and the phonemes corresponding to the speakable speech are obtained as follows:
w eh1 l-k ah0 m/t uw1/ch ay1-n ah0
that is, it can be determined that the three phonetic words in the spoken speech are "welcome", "to", and "China", respectively, and the corresponding phone sets are denoted as [ w eh1 l k ah0 m ], [ t uw1], and [ ch ay1 n ah0], respectively.
S207, scoring each phoneme in the phoneme set corresponding to each voice word according to a preset spoken language scoring standard, and determining a phoneme score corresponding to each phoneme.
Specifically, according to the known speakable text corresponding to each voice word, each voice word is forced to align with the word in the corresponding speakable text once, the likelihood score value obtained after forced alignment is compared with the likelihood score value obtained according to the voice word when the speakable text is unknown, and the phoneme score corresponding to each phoneme in the phoneme set is determined according to the obtained likelihood ratio.
For example, fig. 3 is an exemplary diagram of an interface for scoring phonemes in the spoken speech according to a preset spoken language scoring criterion according to a second embodiment of the present invention. The interface displays the reading text which is read by the user, correct pronunciation obtained after each voice word corresponding to the reading text is subjected to phoneme division according to a preset voice dictionary, and scores corresponding to each phoneme in the reading voice obtained after the reading by the user.
And S208, determining the phoneme with the phoneme score smaller than a preset score threshold value as a target phoneme.
Specifically, when the score of a phoneme is less than a preset score threshold, the pronunciation of the phoneme may be considered to be incorrect, or the pronunciation of the phoneme in the word may be considered to be incorrect, and the phoneme may be determined as the target phoneme.
Optionally, the preset score threshold may be 60 scores, or may be another score threshold set according to an actual situation, which is not limited in this embodiment of the present invention. For example, a preset score threshold may be set according to a pre-selected evaluation criterion, and if the evaluation criterion is extensive reading, the preset score threshold may be set to 60 points, that is, it may be considered that the pronunciation of the user reaches a level that can be generally understood by the user and then the pronunciation of the user passes through the standard; if the evaluation criterion is the accurate reading, the preset score threshold value can be set to 80 points, that is, the user pronunciation under the criterion can be considered to pass only if the user pronunciation reaches a degree which is more appropriate to the standard pronunciation.
S209, determining the voice word corresponding to the target phoneme as the target word in the reading text.
Specifically, after the target phoneme is determined, since the target phoneme is a phoneme having a misreading problem, the speech word corresponding to the target phoneme may also be regarded as a word having a misreading problem, and further, since the speech word has a one-to-one correspondence with a word in the spoken text, when the speech word corresponding to the target phoneme is determined, the word corresponding to the speech word in the spoken text may be determined as the target word.
And S210, generating a sound correction word file according to the target word and the target phoneme.
The sound correction word file is a standard pronunciation file obtained by performing variable speed processing on target phonemes in the target words.
Further, fig. 4 is a schematic flowchart of a process for generating a sound-correcting word file according to a target word and a target phoneme according to a second embodiment of the present invention, as shown in fig. 4, specifically including the following steps:
s301, generating a standard pronunciation file of the target word through a voice synthesis technology.
Specifically, according to the current standard pronunciation rule of the target word, the corresponding standard word pronunciation is generated by combining the speech synthesis technology with the standard pronunciation rule, and the standard word pronunciation is determined as the standard pronunciation file of the target word.
S302, forcibly aligning each phoneme in the target word, and determining syllable boundary positions of syllables where each phoneme is located in the standard pronunciation file.
Specifically, one word can correspond to a plurality of phonemes, so that the target word can be divided into a plurality of phonemes, each phoneme in the target word is forcibly aligned by utilizing a forced alignment function in the existing acoustic model, and one syllable can contain a plurality of phonemes, the syllable position of each phoneme in the word can be determined according to the position of the phoneme, and further the syllable boundary position corresponding to the syllable of each phoneme in the standard pronunciation file corresponding to the target word is determined.
S303, determining the target syllable boundary position of the target phoneme in the standard pronunciation file, and determining syllables between the target syllable boundary positions as the target syllables.
Specifically, since the target phoneme is a phoneme having a pronunciation error in the target word, and the misread phoneme part needs to be processed to achieve the purpose of emphasis when the sound correction word file is generated, the target syllable boundary position of the target phoneme in the standard pronunciation file, that is, the position of the syllable in the standard pronunciation file to be processed, can be determined according to the position of the target phoneme in the target word, and the syllable between the target syllable boundary positions is determined as the target syllable, so that the target syllable can be processed later.
S304, the target syllable is elongated, and the standard pronunciation file after the target syllable is elongated is determined as a sound correction word file.
Specifically, in order to enhance the effect of the misreading position, the target syllable between the determined target syllable boundary positions may be extended, and for example, the target syllable between the target syllable boundary positions may be extended to twice the length of the original syllable, so that the standard pronunciation file after syllable extension extends the playing time of the syllable corresponding to the target phoneme to enhance the effect of the target pronunciation when playing, and the standard pronunciation file after syllable extension is determined as the sound correction word file.
Further, fig. 5 is another schematic flow chart of generating a sound-correction word file according to a target word and a target phoneme according to the second embodiment of the present invention, as shown in fig. 5, specifically including the following steps:
s401, inputting the target words and the target phonemes into a preset variable speed speech synthesis model.
In this embodiment, the preset variable speed speech synthesis model may be understood as a pre-trained speech generation model integrated with a speech variable speed algorithm, which directly performs accenting, lengthening, shortening, and the like on part of phonemes in a word to be generated during speech generation.
S402, determining the output result of the preset variable speed voice synthesis model as a sound correction word file.
The sound correction word file is a standard pronunciation file which is obtained by carrying out syllable lengthening processing on a target phoneme in a target word.
Specifically, the target word and the target phoneme are input into a preset variable speed speech synthesis model, so that the preset variable speed speech synthesis model generates and outputs a standard speech file for elongating the syllable of the target phoneme in the target word, and the output result of the preset variable speed speech synthesis model is directly determined as a sound correction word file.
According to the technical scheme, after the reading text is obtained, the standard reading voice is generated according to the reading text and is played, so that the user can follow up reading according to the played standard reading voice, and then the reading voice corresponding to the reading text is obtained. After the reading speech is obtained, the reading speech can be divided into a plurality of words corresponding to the reading text according to the reading text, each word is subjected to phoneme division according to a preset speech dictionary to obtain a phoneme set corresponding to each speech word, so that each speech word in the reading speech can be divided according to a preset spoken language scoring standard, a target word in the reading text and a target phoneme in the target word are determined according to a scoring result, a standard pronunciation file can be generated according to the target word, the target phoneme in the standard pronunciation file is elongated to obtain a sound correction word file, the target word and the target phoneme can be directly input into a preset variable speed speech synthesis model, the output result of the model is directly determined as the sound correction word file, and the sound correction word file provided for a user can be emphasized aiming at the target phoneme of the target word which is misread by the user during playing, therefore, when the user obtains the sound correcting word file, the user can more visually confirm the phonemes which are misread in the word and correct the phonemes in a targeted manner, the pertinence of misreading and pronunciation correction in the oral practice is improved, and the effectiveness of the oral practice is improved.
EXAMPLE III
Fig. 6 is a schematic structural diagram of a spoken language pronunciation correction device according to a third embodiment of the present invention, where the spoken language pronunciation correction device includes: a text voice obtaining module 51, a target determining module 52 and a sound correction file generating module 53.
The text voice acquiring module 51 is configured to acquire a reading text and a reading voice corresponding to the reading text; the target determining module 52 is configured to determine a target word in the read-aloud text and a target phoneme in the target word according to the read-aloud speech and a preset spoken language scoring standard; a sound correction file generation module 53, configured to generate a sound correction word file according to the target word and the target phoneme; the sound correction word file is a standard pronunciation file obtained by performing variable speed processing on target phonemes in the target words.
Optionally, the speakable text is made up of at least one word.
Further, the spoken language pronunciation correction device further includes:
the standard voice generating module is used for generating standard reading voice according to the reading text; and playing standard reading voice.
The voice word dividing module is used for dividing the reading voice into at least one voice word according to the reading text; and each voice word corresponds to a word in the reading text one by one.
And the phoneme division module is used for carrying out phoneme division on each voice word according to a preset voice dictionary and determining a phoneme set corresponding to each voice word.
Further, the goal determination module 52 includes:
and the phoneme score determining unit is used for scoring each phoneme in the phoneme set corresponding to each voice word according to a preset spoken language scoring standard and determining the phoneme score corresponding to each phoneme.
And the target element determining unit is used for determining the phoneme with the phoneme score smaller than a preset score threshold value as the target phoneme.
And the target word determining unit is used for determining the voice word corresponding to the target phoneme as the target word in the reading text.
Further, the sound correction file generating module 53 is specifically configured to: generating a standard pronunciation file of a target word by a voice synthesis technology; forcibly aligning each phoneme in the target word, and determining syllable boundary positions of syllables of each phoneme in the standard pronunciation file; determining target syllable boundary positions of the target phoneme in the standard pronunciation file, and determining syllables between the target syllable boundary positions as target syllables; and lengthening the target syllable, and determining the standard pronunciation file after the target syllable is lengthened as a sound correction word file.
Further, the sound correction file generating module 53 is further configured to: inputting the target words and the target phonemes into a preset variable speed speech synthesis model; and determining the output result of the preset variable speed speech synthesis model as a sound correction word file, wherein the sound correction word file is a standard pronunciation file obtained by performing syllable lengthening processing on a target phoneme in a target word.
Optionally, the preset spoken language scoring standard is a pronunciation goodness GOP scoring standard.
The spoken language pronunciation correction device provided by the embodiment of the invention can execute the spoken language pronunciation correction method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
Example four
Fig. 7 is a schematic structural diagram of a spoken language pronunciation correction device according to a fourth embodiment of the present invention. The computer device includes: a processor 60, a storage device 61, a display 62, an input device 63, and an output device 64. The number of the processors 60 in the spoken language pronunciation correcting device may be one or more, and one processor 60 is exemplified in fig. 7. The number of the storage means 61 in the spoken language pronunciation correcting apparatus may be one or more, and one storage means 61 is exemplified in fig. 7. The processor 60, the storage device 61, the display 62, the input device 63, and the output device 64 of the spoken language pronunciation correction apparatus may be connected by a bus or other means, as exemplified by the bus connection in fig. 7. In an embodiment, the spoken language pronunciation correction device may be a computer, a notebook, or a smart tablet, etc.
The storage device 61 is a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the spoken utterance correction apparatus according to any embodiment of the present application (for example, the text speech acquiring module 51, the target determining module 52, and the sound correction file generating module 53). The storage device 61 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like. Further, the storage device 61 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the storage 61 may further include memory located remotely from the processor 60, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The display screen 62 may be a touch-enabled display screen 62, which may be a capacitive screen, an electromagnetic screen, or an infrared screen. In general, the display screen 62 is used for displaying data according to instructions from the processor 60, and is also used for receiving touch operations applied to the display screen 62 and sending corresponding signals to the processor 60 or other devices.
The input means 63 may be used for receiving input numeric or character information and generating key signal inputs related to user settings and function controls of the presentation apparatus, and may be a camera for acquiring images and a sound pickup apparatus for acquiring audio data. The output device 64 may include an audio device such as a speaker. It should be noted that the specific composition of the input device 63 and the output device 64 can be set according to actual conditions.
The processor 60 executes various functional applications of the apparatus and data processing by executing software programs, instructions and modules stored in the storage device 61, that is, implements the spoken utterance correction method described above.
The computer device provided above can be used to execute the spoken language pronunciation correction method provided in any of the above embodiments, and has corresponding functions and advantages.
EXAMPLE five
An embodiment of the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a spoken utterance correction method, the method including:
obtaining a reading text and reading voice corresponding to the reading text;
determining a target word in the read-aloud text and a target phoneme in the target word according to the read-aloud voice and a preset spoken language scoring standard;
generating a sound correction word file according to the target word and the target phoneme;
the sound correction word file is a standard pronunciation file obtained by performing variable speed processing on target phonemes in the target words.
Of course, the storage medium containing the computer-executable instructions provided by the embodiments of the present invention is not limited to the method operations described above, and may also perform related operations in the spoken language pronunciation correction method provided by any embodiments of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.
It should be noted that, in the embodiment of the above search apparatus, each included unit and module are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.
Claims (10)
1. A method for correcting spoken language pronunciation, comprising:
obtaining a reading text and reading voice corresponding to the reading text;
determining a target word in the read-aloud text and a target phoneme in the target word according to the read-aloud voice and a preset spoken language scoring standard;
generating a sound correction word file according to the target word and the target phoneme;
the sound correction word file is a standard pronunciation file obtained by performing variable speed processing on the target phoneme in the target word.
2. The method of claim 1, wherein the speakable text is comprised of at least one word; after the reading text and the reading voice corresponding to the reading text are obtained, the method further comprises:
dividing the reading voice into at least one voice word according to the reading text; each voice word corresponds to a word in the reading text one by one;
and performing phoneme division on each voice word according to a preset voice dictionary, and determining a phoneme set corresponding to each voice word.
3. The method of claim 2, wherein determining the target words in the speakable text and the target phonemes in the target words according to the speakable speech and preset spoken scoring criteria comprises:
scoring each phoneme in the phoneme set corresponding to each speech word according to a preset spoken language scoring standard, and determining a phoneme score corresponding to each phoneme;
determining the phoneme with the phoneme score smaller than a preset score threshold value as a target phoneme;
and determining the voice word corresponding to the target phoneme as the target word in the speakable text.
4. The method of claim 1, wherein generating a sound-correction word file from the target words and the target phonemes comprises:
generating a standard pronunciation file of the target word by a voice synthesis technology;
forcibly aligning each phoneme in the target word, and determining syllable boundary positions of syllables of each phoneme in the standard pronunciation file;
determining a target syllable boundary position of the target phoneme in the standard pronunciation file, and determining syllables between the target syllable boundary positions as target syllables;
and lengthening the target syllable, and determining the standard pronunciation file after the target syllable is lengthened as a sound correction word file.
5. The method of claim 1, wherein generating a sound-correction word file from the target words and the target phonemes comprises:
inputting the target words and the target phonemes into a preset variable speed speech synthesis model;
and determining the output result of the preset variable speed speech synthesis model as a sound correction word file, wherein the sound correction word file is a standard pronunciation file obtained by performing syllable lengthening processing on the target phoneme in the target word.
6. The method of claim 1, after obtaining the speakable text, further comprising:
generating standard reading voice according to the reading text;
and playing the standard reading voice.
7. The method according to any one of claims 1 to 6, wherein the preset spoken language scoring criterion is a pronunciation goodness GOP scoring criterion.
8. A spoken utterance correction apparatus, comprising:
the text voice acquisition module is used for acquiring the reading text and the reading voice corresponding to the reading text;
the target determining module is used for determining a target word in the read-aloud text and a target phoneme in the target word according to the read-aloud voice and a preset spoken language scoring standard;
the sound correction file generation module is used for generating a sound correction word file according to the target word and the target phoneme; the sound correction word file is a standard pronunciation file obtained by performing variable speed processing on the target phoneme in the target word.
9. A spoken utterance correction apparatus, comprising a storage device and one or more processors;
the storage device to store one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the spoken utterance correction method of any one of claims 1-7.
10. A storage medium containing computer-executable instructions for performing the spoken utterance correction method according to any one of claims 1 to 7 when executed by a computer processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110652097.4A CN113393864A (en) | 2021-06-11 | 2021-06-11 | Spoken language pronunciation correction method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110652097.4A CN113393864A (en) | 2021-06-11 | 2021-06-11 | Spoken language pronunciation correction method, device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113393864A true CN113393864A (en) | 2021-09-14 |
Family
ID=77620478
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110652097.4A Pending CN113393864A (en) | 2021-06-11 | 2021-06-11 | Spoken language pronunciation correction method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113393864A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109036464A (en) * | 2018-09-17 | 2018-12-18 | 腾讯科技(深圳)有限公司 | Pronounce error-detecting method, device, equipment and storage medium |
CN109545244A (en) * | 2019-01-29 | 2019-03-29 | 北京猎户星空科技有限公司 | Speech evaluating method, device, electronic equipment and storage medium |
US20200066251A1 (en) * | 2017-05-24 | 2020-02-27 | Nippon Hoso Kyokai | Audio guidance generation device, audio guidance generation method, and broadcasting system |
CN111370001A (en) * | 2018-12-26 | 2020-07-03 | Tcl集团股份有限公司 | Pronunciation correction method, intelligent terminal and storage medium |
CN112634862A (en) * | 2020-12-18 | 2021-04-09 | 北京大米科技有限公司 | Information interaction method and device, readable storage medium and electronic equipment |
CN112908360A (en) * | 2021-02-02 | 2021-06-04 | 早道(大连)教育科技有限公司 | Online spoken language pronunciation evaluation method and device and storage medium |
-
2021
- 2021-06-11 CN CN202110652097.4A patent/CN113393864A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200066251A1 (en) * | 2017-05-24 | 2020-02-27 | Nippon Hoso Kyokai | Audio guidance generation device, audio guidance generation method, and broadcasting system |
CN109036464A (en) * | 2018-09-17 | 2018-12-18 | 腾讯科技(深圳)有限公司 | Pronounce error-detecting method, device, equipment and storage medium |
CN111370001A (en) * | 2018-12-26 | 2020-07-03 | Tcl集团股份有限公司 | Pronunciation correction method, intelligent terminal and storage medium |
CN109545244A (en) * | 2019-01-29 | 2019-03-29 | 北京猎户星空科技有限公司 | Speech evaluating method, device, electronic equipment and storage medium |
CN112634862A (en) * | 2020-12-18 | 2021-04-09 | 北京大米科技有限公司 | Information interaction method and device, readable storage medium and electronic equipment |
CN112908360A (en) * | 2021-02-02 | 2021-06-04 | 早道(大连)教育科技有限公司 | Online spoken language pronunciation evaluation method and device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5787230A (en) | System and method of intelligent Mandarin speech input for Chinese computers | |
CN105845134B (en) | Spoken language evaluation method and system for freely reading question types | |
CN109817201B (en) | Language learning method and device, electronic equipment and readable storage medium | |
US11810471B2 (en) | Computer implemented method and apparatus for recognition of speech patterns and feedback | |
CN103714048B (en) | Method and system for correcting text | |
US9548052B2 (en) | Ebook interaction using speech recognition | |
JP2017167368A (en) | Voice recognition error correction device, method, and program | |
KR101819457B1 (en) | Voice recognition apparatus and system | |
KR20010096490A (en) | Spelling speech recognition apparatus and method for mobile communication | |
KR101487005B1 (en) | Learning method and learning apparatus of correction of pronunciation by input sentence | |
JP2009145856A (en) | Method for constructing module of recognizing english pronunciation variation, and computer readable recording medium with program for achieving construction of module stored therein | |
CN110675866A (en) | Method, apparatus and computer-readable recording medium for improving at least one semantic unit set | |
JP6678545B2 (en) | Correction system, correction method and program | |
CN113268981A (en) | Information processing method and device and electronic equipment | |
Liao et al. | A prototype of an adaptive Chinese pronunciation training system | |
CN105786204A (en) | Information processing method and electronic equipment | |
CN109473007B (en) | English natural spelling teaching method and system combining phonemes with sound side | |
CN115083222B (en) | Information interaction method and device, electronic equipment and storage medium | |
CN113393864A (en) | Spoken language pronunciation correction method, device, equipment and storage medium | |
JP6366179B2 (en) | Utterance evaluation apparatus, utterance evaluation method, and program | |
KR101487006B1 (en) | Learning method and learning apparatus of correction of pronunciation for pronenciaion using linking | |
KR101487007B1 (en) | Learning method and learning apparatus of correction of pronunciation by pronunciation analysis | |
JP6879521B1 (en) | Multilingual Speech Recognition and Themes-Significance Analysis Methods and Devices | |
CN113990351A (en) | Sound correction method, sound correction device and non-transient storage medium | |
KR20220032200A (en) | User Equipment with Artificial Inteligence for Forign Language Education and Method for Forign Language Education |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |