CN109461436B

CN109461436B - Method and system for correcting pronunciation errors of voice recognition

Info

Publication number: CN109461436B
Application number: CN201811239934.5A
Authority: CN
Inventors: 魏誉荧
Original assignee: Guangdong Genius Technology Co Ltd
Current assignee: Guangdong Genius Technology Co Ltd
Priority date: 2018-10-23
Filing date: 2018-10-23
Publication date: 2020-12-15
Anticipated expiration: 2038-10-23
Also published as: CN109461436A

Abstract

The invention provides a method and a system for correcting pronunciation errors in speech recognition, wherein the method comprises the following steps: establishing a mapping table between a standard acoustic model and an error acoustic model corresponding to pronounce error-prone characters; acquiring user voice information; recognizing the user voice information, and when the pronunciation-prone wrong character is contained in the voice information, extracting an audio segment corresponding to a word containing the pronunciation-prone wrong character in the user voice information; and when the audio frequency segment is matched with the voice audio frequency matching result in the error acoustic model, prompting a user that the pronunciation is easy to miss and the pronunciation is wrong, and outputting the corresponding voice audio frequency in the standard acoustic model according to the mapping table. The invention prompts and outputs corresponding correct audio when recognizing that the pronunciation of the pronouncing error-prone character of the user is wrong by establishing a mapping table between the standard acoustic model and the error acoustic model.

Description

Method and system for correcting pronunciation errors of voice recognition

Technical Field

The present invention relates to the field of speech recognition technology, and in particular, to a method and system for correcting pronunciation errors in speech recognition.

Background

With the rapid development of the internet, the lives of people become more and more intelligent. Voice interaction is also becoming more popular with users as one of the mainstream communication applications of human-computer interaction in intelligent terminals. The intelligent terminal takes corresponding measures based on the voice input by the user, so the accuracy of the voice input by the user through the terminal seriously influences the feedback made by the intelligent terminal.

The Chinese characters have a large number of polyphonic characters, shape-near characters and the like, and for some users, it is difficult to distinguish rarely used polyphonic characters and shape-near characters, and even more, the common pronunciation of some users of some polyphonic characters and shape-near characters is wrong.

In addition, for pupils, in the learning process, particularly under the condition of small literacy amount, the situation that the words containing polyphone or the characters with similar shapes are wrongly read often occurs, and when the intelligent terminal identifies the pronunciation, the situation can cause the identification error and cannot give out the correct result to be inquired or the corresponding accurate feedback. Therefore, a method and a system for correcting pronunciation errors in speech recognition are needed to solve the above problems.

Disclosure of Invention

The invention aims to provide a method and a system for correcting pronunciation errors of voice recognition, which can prompt and output corresponding correct audio when pronunciation errors of pronunciations of characters easy to be mistaken by a user are recognized by establishing a mapping table between a standard acoustic model and an error acoustic model.

The technical scheme provided by the invention is as follows:

the invention provides a method for correcting pronunciation errors in speech recognition, which is characterized by comprising the following steps:

establishing a mapping table between a standard acoustic model and an error acoustic model corresponding to pronounce error-prone characters;

acquiring user voice information;

recognizing the user voice information, and when the pronunciation-prone wrong character is contained in the voice information, extracting an audio segment corresponding to a word containing the pronunciation-prone wrong character in the user voice information;

and when the audio frequency segment is matched with the voice audio frequency matching result in the error acoustic model, prompting a user that the pronunciation is easy to miss and the pronunciation is wrong, and outputting the corresponding voice audio frequency in the standard acoustic model according to the mapping table.

Further, before the establishing of the mapping table between the standard acoustic model and the error acoustic model corresponding to the pronunciation-prone and error-prone word, the method further includes:

acquiring the pronouncing error-prone characters, and generating target words according to the pronouncing error-prone characters;

acquiring the voice audio of the target word, and generating the standard acoustic model according to the voice audio of the target word;

acquiring pronunciation confusion words of the pronunciation easily-wrong words, and replacing the pronunciation easily-wrong words in the target words with the pronunciation confusion words to generate confusion words;

and acquiring the voice audio of the confusing words, and generating the error acoustic model according to the voice audio of the confusing words.

Further, the method also comprises the following steps:

and when the audio frequency segment is matched with the voice audio frequency matching result in the standard acoustic model, prompting the user that the pronunciation of the error-prone character is correct.

Further, the method also comprises the following steps:

when the audio segment does not accord with the voice audio matching result in the acoustic model, converting the audio segment into recognition text, wherein the acoustic model comprises the standard acoustic model and the error acoustic model;

if the target words contain the recognition texts, judging whether the pronunciation of the audio clips is correct, and if so, updating the standard acoustic model according to the audio clips; otherwise, updating the error acoustic model according to the audio fragment, prompting the user that the pronunciation of the pronunciation-prone character is wrong, and outputting the corresponding voice audio frequency in the standard acoustic model according to the mapping table;

and if the target word does not contain the identification text, updating the target word according to the identification text, and updating the acoustic model according to the audio fragment.

Further, when the target word does not include the identification text, updating the target word according to the identification text, and updating the acoustic model according to the audio segment specifically includes:

when the target word does not contain the recognition text, updating the target word according to the recognition text;

if the pronunciation of the audio clip is correct, updating the standard acoustic model according to the audio clip, updating the confusion terms according to the updated target terms, and then updating the error acoustic model according to the voice audio of the updated confusion terms;

and if the audio fragment is mispronounced, acquiring correct voice audio of the recognized text, updating the standard acoustic model according to the correct voice audio, updating the confusion terms according to the updated target terms, and updating the wrong acoustic model according to the updated voice audio of the confusion terms.

The present invention also provides a system for correcting pronunciation errors in speech recognition, comprising:

the mapping table establishing module is used for establishing a mapping table between a standard acoustic model and an error acoustic model corresponding to the pronounce error-prone character;

the acquisition module acquires user voice information;

the extraction module is used for identifying the user voice information acquired by the acquisition module, and extracting an audio clip corresponding to a word containing the pronouncing error-prone character in the user voice information when the pronunciation error-prone character is contained in the voice information;

and the processing module prompts a user that the pronunciation is easy to make wrong pronunciation when the audio frequency segment extracted by the extraction module is matched with the voice audio frequency matching result in the wrong acoustic model in the mapping table establishing module, and outputs the corresponding voice audio frequency in the standard acoustic model according to the mapping table established by the mapping table establishing module.

Further, the method also comprises the following steps:

the error-prone character acquisition module is used for acquiring the pronunciation error-prone characters;

the target word generation module is used for generating a target word according to the pronunciation error-prone characters acquired by the error-prone character acquisition module;

the audio acquisition module is used for acquiring the voice audio of the target words generated by the target word generation module;

the acoustic model generating module is used for generating the standard acoustic model according to the voice audio of the target word acquired by the audio acquiring module;

the confusing character acquisition module is used for acquiring the pronunciation confusing characters of the pronunciation confusing characters acquired by the confusing character acquisition module;

a confusion word generation module, which replaces the pronunciation-prone wrong character in the target word generated by the target word generation module with the pronunciation-prone wrong character acquired by the confusion word acquisition module to generate a confusion word;

the audio acquisition module is used for acquiring the voice audio of the confusion word generated by the confusion word generation module;

the acoustic model generating module is used for generating the error acoustic model according to the voice audio of the confusing words acquired by the audio acquiring module.

Further, the method also comprises the following steps:

and the processing module prompts a user that the pronunciation of the error-prone character is correct when the audio frequency segment extracted by the extraction module is matched with the voice audio frequency matching result in the standard acoustic model in the mapping table establishing module.

Further, the method also comprises the following steps:

the processing module is used for converting the audio clip into an identification text when the audio clip extracted by the extraction module is not consistent with a voice audio matching result in an acoustic model in the mapping table establishing module, wherein the acoustic model comprises the standard acoustic model and the error acoustic model;

the control module is used for judging whether the pronunciation of the audio clip is correct or not if the target word contains the recognition text converted by the processing module, and updating the standard acoustic model according to the audio clip if the pronunciation of the audio clip is correct; otherwise, updating the error acoustic model according to the audio fragment, prompting the user that the pronunciation of the pronunciation-prone character is wrong, and outputting the corresponding voice audio frequency in the standard acoustic model according to the mapping table;

and if the target word does not contain the identification text converted by the processing module, the control module updates the target word according to the identification text and updates the acoustic model according to the audio segment.

Further, the control module specifically includes:

a target word updating unit, which updates the target word according to the recognition text when the target word does not contain the recognition text converted by the processing module;

the control unit is used for updating the standard acoustic model according to the audio segment if the pronunciation of the audio segment extracted by the extraction module is correct, updating the confusion word according to the target word updated by the target word updating unit, and then updating the wrong acoustic model according to the voice audio of the updated confusion word;

the control unit is used for acquiring correct voice audio of the recognized text if the audio segment extracted by the extraction module is in wrong pronunciation, updating the standard acoustic model according to the correct voice audio, updating the confusion word according to the target word updated by the target word updating unit, and then updating the wrong acoustic model according to the updated voice audio of the confusion word.

The method and the system for correcting the pronunciation error of the voice recognition can bring at least one of the following beneficial effects:

1. in the invention, a mapping table between a standard acoustic model and an error acoustic model corresponding to the pronouncing error-prone character is established so as to be convenient for matching and recognizing the user voice information subsequently and correct the wrong pronouncing.

2. In the invention, when the voice information of the user is matched, only the audio frequency segment corresponding to the word containing the pronouncing easily-wrong character is extracted for matching, so that the difficulty and the workload in the matching process are reduced, the matching speed is accelerated, and the matching accuracy is improved.

Drawings

The above features, technical features, advantages and implementations of a method and system for correcting pronunciation errors in speech recognition will be further described in the following detailed description of preferred embodiments in a clearly understandable manner in conjunction with the accompanying drawings.

FIG. 1 is a flow chart of a first embodiment of a method for correcting pronunciation errors in speech recognition according to the present invention;

FIG. 2 is a flowchart of a second embodiment of a method for correcting pronunciation errors in speech recognition according to the present invention;

FIG. 3 is a flowchart of a third embodiment of a method for correcting pronunciation errors in speech recognition according to the present invention;

FIG. 4 is a schematic structural diagram of a fourth embodiment of a system for correcting pronunciation errors in speech recognition according to the present invention;

FIG. 5 is a schematic structural diagram of a fifth embodiment of a system for correcting pronunciation errors in speech recognition according to the present invention;

fig. 6 is a schematic structural diagram of a sixth embodiment of a system for correcting pronunciation errors in speech recognition according to the present invention.

The reference numbers illustrate:

1000 pronunciation error correction system for speech recognition

1100 mapping table establishing module 1200 obtaining module 1300 extracting module

1400 processing module 1500 error-prone word acquisition module 1600 target word generation module

1700 Audio acquisition Module 1800 Acoustic model Generation Module 1850 confusion word acquisition Module

1900 confusing term generating module

1950 control module 1951 target word update unit 1952 control unit

Detailed Description

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.

For the sake of simplicity, the drawings only schematically show the parts relevant to the present invention, and they do not represent the actual structure as a product. In addition, in order to make the drawings concise and understandable, components having the same structure or function in some of the drawings are only schematically illustrated or only labeled. In this document, "one" means not only "only one" but also a case of "more than one".

A first embodiment of the present invention, as shown in fig. 1, is a method for correcting pronunciation errors in speech recognition, including:

s100, establishing a mapping table between a standard acoustic model and an error acoustic model corresponding to the pronounceable error-prone character.

Specifically, polyphone characters and Chinese characters with a plurality of similar characters are used as pronouncing error-prone characters, correct pronunciations are used as standard acoustic models, polyphone characters and similar characters are used as error acoustic models, and then mapping tables corresponding to the standard acoustic models and the error acoustic models are established so as to find corresponding correct pronunciations for correcting the error pronunciations.

S200, acquiring user voice information.

Specifically, the user voice information is obtained, which may be input by the user in real time, for example, the user needs to perform voice correction when reading aloud with the intelligent terminal, or a pupil needs to check the learning result of the pupil when learning a new vocabulary through the intelligent terminal. It is also possible to implement recorded audio, for example, to check whether the pronunciation in the audio recorded by the student is accurate. Prompt correction is performed when the system detects a wrong pronunciation, so the user can autonomously select whether the system or a single application starts the function of prompting correction of the wrong pronunciation.

S300, recognizing the user voice information, and when the pronunciation error prone characters are contained in the voice information, extracting audio segments corresponding to the words containing the pronunciation error prone characters in the user voice information.

Specifically, the acquired user voice information is firstly converted into text information, whether the text information contains any one or more pronouncing error-prone characters is identified, if yes, the components of sentences in the text information and the parts of speech of the words are analyzed through a word segmentation technology, the words containing the pronouncing error-prone characters are marked, and finally one or more audio segments corresponding to the marked words are extracted from the user voice information according to the marking result.

S400, when the audio frequency segment is matched with the voice audio frequency matching result in the error acoustic model, prompting a user that the pronunciation is easy to miss, and outputting the corresponding voice audio frequency in the standard acoustic model according to the mapping table.

Specifically, the speech audio of the pronouncing error-prone character corresponding to the standard acoustic model and the wrong acoustic model is found according to the pronouncing error-prone character corresponding to the extracted audio segment, and then the extracted audio segment is matched with the found speech audio one by one.

If the matching result is matched with the voice audio belonging to the wrong acoustic model, the voice audio is indicated as a part of the pronunciation error-prone character in the voice information of the user, and therefore the user is prompted to pronounce the pronunciation error of the pronunciation error-prone character.

Since the same pronunciation-prone character may appear many times in the user voice information, but pronunciation errors do not occur everywhere, when a pronunciation error is prompted, the specific position of the pronunciation-prone character needs to be clearly explained in combination with the context of the pronunciation-prone character in the user voice information.

In addition, the correct voice audio corresponding to the word formed by the pronouncing error-prone character at the position is found through a mapping table between the standard acoustic model and the error acoustic model, and the correct voice audio is output to the user, so that the user can correct pronouncing conveniently.

In this embodiment, after the user voice information is acquired, the pronouncing error-prone character included in the user voice information is firstly recognized, and then an audio segment corresponding to a word including the pronouncing error-prone character in the user voice information is extracted to be matched with a standard acoustic model and a voice audio frequency in an error acoustic model. Only effective audio segments are extracted for matching, so that the voice length required to be matched is reduced, the requirement on the matching capacity of the system is lowered, the matching speed is increased, and the matching accuracy is increased.

A second embodiment of the present invention is an optimized embodiment of the first embodiment, and as shown in fig. 2, the second embodiment includes:

and S010 obtains the pronouncing error-prone character, and generates a target word according to the pronouncing error-prone character.

Specifically, the polyphonic character and the Chinese character with a plurality of shape-similar characters are used as pronunciation-prone characters, a target word is generated according to the pronunciation-prone characters, the target word is a word including pronunciation-prone characters, the pronunciation-prone characters are polyphonic characters or have a plurality of shape-similar characters, and pronunciations of the pronunciation-prone characters in the target word are easy to be confused, such as shen, downloading, pressing, and the like. Or the pronunciation of a part of words does not affect the communication of the daily life of the user, but actually the pronunciation is wrong, such as stubborn, frame and the like, and correction should be given, especially when the elementary school students just start to learn new words, correct pronunciation should be formed from the beginning to avoid later correction.

S020 obtains the voice audio of the target word, and the standard acoustic model is generated according to the voice audio of the target word.

Specifically, the voice audio of each target word is obtained, wherein each pronunciation-prone word may correspond to one or more voice audios, wherein each target word obtains the audio of the standard mandarin pronunciation, and then the audio of the standard mandarin pronunciation is preferentially selected by the system output when the pronunciation of the user needs to be corrected.

However, due to the influence of gender, age, region, dialect accent, cavity tone and the like of the user, the pronunciation of the user is correct but cannot be matched with the pronunciation audio of the standard mandarin, so that besides the audio of the standard mandarin pronunciation, each target word also needs to acquire the audio of the correct pronunciation of people in different ages and regions as much as possible, and the user can independently select whether to prompt and correct the subsequent pronunciation under the condition of correct pronunciation but not standard.

And generating a standard acoustic model according to the voice audio of the target word, classifying the standard acoustic model according to the pronouncing error-prone characters, and classifying again according to the target word under the classification of the pronouncing error-prone characters when a certain pronouncing error-prone character generates more target words or the voice audio corresponding to a certain target word is more.

S030 acquires the pronunciation confusion word of the pronunciation easily-mistaken word, and replaces the pronunciation confusion word in the target word with the pronunciation confusion word to generate the confusion word.

Specifically, the pronunciation confusion word of the pronunciation easy-to-mistake character is obtained, the pronunciation confusion word is the pronunciation easy-to-mistake character of other pronunciations or the shape near character of the pronunciation easy-to-mistake character, the pronunciation easy-to-mistake character in the target word is replaced by the pronunciation confusion word to generate the confusion word, namely, the pronunciation easy-to-mistake character of other pronunciations and the target word are combined or the shape near character of the pronunciation easy-to-mistake character and the target word are combined to generate the confusion word.

The confusing words are words corresponding to voices which may be wrongly read by users when reading the target words, and the confusing words do not conform to the word composition rule in composition, and actually do not have the words, and the confusing words are only used for obtaining matched voice audio subsequently.

S040 obtains the speech audio of the confusing word, and generates the error acoustic model according to the speech audio of the confusing word.

Specifically, the voice audio of each confusing word is obtained, and similarly, due to the influence of the gender, age, region, dialect accent, accent and the like of the user, there may exist a plurality of voice audios corresponding to the same confusing word. And generating an error acoustic model according to the voice audio of the confusing words, classifying the error acoustic model according to the corresponding pronunciation easily-wrong words, and classifying again according to the confusing words under the classification of the pronunciation easily-wrong words when more confusing words correspond to a certain pronunciation easily-wrong word or more voice audio correspond to a certain confusing word.

S200, acquiring user voice information.

S500, when the audio frequency segment is matched with the voice audio frequency matching result in the standard acoustic model, prompting the user that the pronunciation of the error-prone character is correct.

If the voice audio matched with the matching result is the voice audio belonging to the standard acoustic model, the part of the voice information of the user about the pronounceable wrong character pronounces correctly, but whether the voice audio matched with the matching result is the standard mandarin pronunciation audio needs to be further judged, if the pronunciation of the user is correct and standard, the user is prompted to pronounce the wrong pronounce character correctly but not standard, and the user is asked whether the standard mandarin pronunciation audio needs to be output.

In the embodiment, a standard acoustic model of correct pronunciation of the pronunciation error-prone character and a corresponding incorrect acoustic model of incorrect pronunciation of a combination of the polyphone character and the similar character aiming at the confusable pronunciation of the pronunciation error-prone character are respectively generated, and a mapping table of a corresponding relation between the two acoustic models is established for subsequently identifying the voice information of the user and then carrying out corresponding matching, so that the place of the pronunciation error in the voice information of the user can be accurately identified and corrected.

A third embodiment of the present invention is a preferable embodiment of the first embodiment, and as shown in fig. 3, the third embodiment includes:

S200, acquiring user voice information.

S600, when the audio segment is not matched with the voice audio matching result in the acoustic model, the audio segment is converted into a recognition text, and the acoustic model comprises the standard acoustic model and the error acoustic model.

Specifically, the acoustic models, namely the speech audio of the pronouncing error-prone characters corresponding to the standard acoustic model and the error acoustic model, are found according to the pronouncing error-prone characters corresponding to the extracted audio segments, and then the extracted audio segments are matched with the found speech audio one by one.

If the matching results of the audio segment and the voice audio in the acoustic model are not consistent, it indicates that the pronunciation corresponding to the audio segment is not included in the acoustic model, and therefore the target word, the confusing word and the acoustic model should be updated according to the audio segment. Identifying the audio clip converts the audio clip into identifying text.

S610, if the target word contains the recognition text, judging whether the pronunciation of the audio clip is correct, and if so, updating the standard acoustic model according to the audio clip; otherwise, updating the error acoustic model according to the audio fragment, prompting the user that the pronunciation of the pronunciation-prone character is wrong, and outputting the corresponding voice audio frequency in the standard acoustic model according to the mapping table.

Specifically, the recognition text is recognized, the recognition text is matched with a target word, if the matching result shows that the recognition text is consistent with a certain target word, whether the pronunciation of the audio clip is correct is judged, if the pronunciation is correct, the audio clip is not recorded in the standard acoustic model possibly due to the fact that accents are less common and the like, therefore, the standard acoustic model is updated according to the audio clip, and then a user is prompted whether the pronunciation of the mispronounced character is correct but not standard, and whether the standard mandarin pronunciation audio needs to be output or not.

If the pronunciation of the audio segment is incorrect, the audio segment is indicated to be a wrong pronunciation of a target word which is not included in the wrong acoustic model, so that the wrong acoustic model is updated according to the audio segment, then the user is prompted to make the pronunciation of the wrong-prone word wrong, and corresponding standard mandarin pronunciation audio is output to correct the user.

S620, if the target word does not contain the identification text, updating the target word according to the identification text, and updating the acoustic model according to the audio fragment.

Specifically, the recognition text is recognized, the recognition text is matched with the target words, if the matching result shows that the recognition text is not matched with all the target words, the recognition text is not included as the target words, therefore, the target words are updated according to the recognition text, and then the acoustic model is updated according to the audio segments.

Step S620, if the target word does not include the identification text, updating the target word according to the identification text, and updating the acoustic model according to the audio piece specifically includes:

s621, when the target word does not contain the recognition text, updating the target word according to the recognition text.

S622, if the pronunciation of the audio segment is correct, updating the standard acoustic model according to the audio segment, updating the confusing word according to the updated target word, and then updating the incorrect acoustic model according to the speech audio of the updated confusing word.

Specifically, after the target word is updated according to the recognition text, the pronunciation of the audio segment is judged, if the pronunciation is correct, the standard acoustic model is updated according to the audio segment, and then other pronunciation audio update standard acoustic models of the recognition text can be obtained.

And then replacing the pronunciation-prone wrong characters in the updated target words with pronunciation-prone wrong characters to correspondingly update the confused words, re-acquiring the voice audio of the updated confused words, and updating the wrong acoustic model according to the re-acquired voice audio.

S623, if the audio segment is mispronounced, acquiring correct voice audio of the recognized text, updating the standard acoustic model according to the correct voice audio, updating the confusion terms according to the updated target terms, and then updating the wrong acoustic model according to the updated voice audio of the confusion terms.

Specifically, after the target word is updated according to the recognized text, the pronunciation of the audio segment is judged, if the pronunciation is incorrect, correct voice audios corresponding to the recognized text, including standard mandarin pronunciation audio and pronunciation audios of other dialects, cavity tones and the like, are obtained, and the standard acoustic model is updated according to the obtained correct voice audios.

In this embodiment, when the extracted audio segment does not match the matching result of the speech audio in the acoustic model, the audio segment is converted into an identification text, the identification text and the audio segment are identified, different situations are classified, the pronunciation of the user is quickly and accurately judged, and then corresponding processing is performed.

A fourth embodiment of the present invention, as shown in fig. 4, is a system 1000 for correcting pronunciation errors in speech recognition, comprising:

the mapping table establishing module 1100 is configured to establish a mapping table between a standard acoustic model and an error acoustic model corresponding to the pronunciation-prone erroneous word.

Specifically, the mapping table establishing module 1100 takes polyphone and chinese characters having a plurality of near-form characters as pronouncing error-prone characters, takes correct pronunciations as a standard acoustic model, takes polyphone and near-form characters as an erroneous acoustic model, and then establishes a mapping table corresponding to each other between the standard acoustic model and the erroneous acoustic model, so as to find corresponding correct pronunciations for correcting the erroneous pronunciations.

The obtaining module 1200 obtains the user voice information.

Specifically, the obtaining module 1200 obtains the user voice information, which may be input by the user in real time, for example, the user needs to perform voice correction when reading aloud with the smart terminal, or the pupil needs to check the learning result of the pupil when learning a new vocabulary through the smart terminal. It is also possible to implement recorded audio, for example, to check whether the pronunciation in the audio recorded by the student is accurate. Prompt correction is performed when the system detects a wrong pronunciation, so the user can autonomously select whether the system or a single application starts the function of prompting correction of the wrong pronunciation.

The extracting module 1300 is configured to identify the user voice information acquired by the acquiring module 1200, and when the voice information includes the pronunciation-prone character, extract an audio segment corresponding to a word including the pronunciation-prone character in the user voice information.

Specifically, the extracting module 1300 converts the acquired user voice information into text information, then identifies whether the text information includes any one or more pronouncing and error-prone characters, analyzes the components of sentences and the parts of speech of the words in the text information by a word segmentation technique if the text information includes any one or more pronouncing and error-prone characters, marks the words including the pronouncing and error-prone characters, and finally extracts one or more audio segments corresponding to the marked words from the user voice information according to the marking result.

The processing module 1400 prompts the user that the pronunciation is prone to wrong words and pronunciation errors when the audio segment extracted by the extraction module 1300 matches the voice audio matching result in the wrong acoustic model in the mapping table establishing module 1100, and outputs the corresponding voice audio in the standard acoustic model according to the mapping table established by the mapping table establishing module 1100.

Specifically, the processing module 1400 finds the speech audio of the pronouncing error-prone character corresponding to the standard acoustic model and the error acoustic model according to the pronouncing error-prone character corresponding to the extracted audio segment, and then matches the extracted audio segment with the found speech audio one by one. If the matching result is matched with the voice audio belonging to the wrong acoustic model, the voice audio is indicated as a part of the pronunciation error-prone character in the voice information of the user, and therefore the user is prompted to pronounce the pronunciation error of the pronunciation error-prone character.

Since the same pronunciation-prone character may appear many times in the user voice information, but pronunciation errors do not occur everywhere, when the processing module 1400 prompts pronunciation errors, the specific position of the pronunciation-prone character needs to be clearly explained in combination with the context of the pronunciation-prone character in the user voice information.

A fifth embodiment of the present invention is a preferable embodiment of the fourth embodiment, and as shown in fig. 5, the fifth embodiment includes:

the error-prone character acquisition module 1500 acquires the pronunciation error-prone character.

The target word generation module 1600 generates a target word according to the pronunciation error-prone character acquired by the error-prone character acquisition module 1500.

Specifically, the error-prone character acquisition module 1500 uses polyphonic characters and chinese characters with multiple shape-similar characters as pronunciation error-prone characters, and the target term generation module 1600 generates target terms according to the pronunciation error-prone characters, where the target terms are terms including pronunciation error-prone characters, and because the pronunciation error-prone characters are polyphonic characters or have multiple shape-similar characters, and pronunciations of the pronunciation error-prone characters in the target terms are more easily confused, such as shen zi, download, and pressure axis. Or the pronunciation of a part of words does not affect the communication of the daily life of the user, but actually the pronunciation is wrong, such as stubborn, frame and the like, and correction should be given, especially when the elementary school students just start to learn new words, correct pronunciation should be formed from the beginning to avoid later correction.

The audio obtaining module 1700 obtains the voice audio of the target word generated by the target word generating module 1600.

Specifically, the audio obtaining module 1700 obtains a speech audio of each target word, wherein each pronunciation-prone word may correspond to one or more speech audios, wherein each target word obtains an audio of a standard mandarin pronunciation, and subsequently, if the pronunciation of the user needs to be corrected, the audio of the standard mandarin pronunciation is preferentially selected by the system.

An acoustic model generating module 1800, configured to generate the standard acoustic model according to the speech audio of the target word acquired by the audio acquiring module 1700.

Specifically, the acoustic model generation module 1800 generates a standard acoustic model according to the speech audio of the target word, the standard acoustic model is classified according to the pronouncing error-prone character, and when a certain pronouncing error-prone character generates more target words or a certain target word has more speech audio corresponding to a certain target word, the standard acoustic model can be selected to be classified again according to the target word under the classification of the pronouncing error-prone character.

The confusing word acquiring module 1850 acquires the pronunciation confusing word of the pronunciation confusing word acquired by the error-prone word acquiring module 1500.

The confusing word generating module 1900 replaces the pronunciation-prone character in the target word generated by the target word generating module 1600 with the pronunciation-confusing character acquired by the confusing character acquiring module 1850 to generate the confusing word.

Specifically, the confusing word acquiring module 1850 acquires a confusing pronunciation word of the confusing pronunciation word, where the confusing pronunciation word is a confusing pronunciation word of another pronunciation or a similar word of the confusing pronunciation word, and the confusing word generating module 1900 replaces the confusing pronunciation word in the target word with the confusing pronunciation word to generate the confusing word, that is, combines the confusing pronunciation word of the other pronunciation with the target word or combines the similar word of the confusing pronunciation word with the target word to generate the confusing word.

The audio obtaining module 1700 obtains the speech audio of the confusing word generated by the confusing word generating module 1900.

The acoustic model generation module 1800 generates the error acoustic model according to the speech audio of the confusing word acquired by the audio acquisition module 1700.

Specifically, the audio obtaining module 1700 obtains the voice audio of each confusing word, and similarly, there may exist a plurality of voice audios corresponding to the same confusing word due to the influence of the gender, age, region, dialect accent, and accent of the user. The acoustic model generation module 1800 generates an incorrect acoustic model according to the speech audio of the confusing words, and the incorrect acoustic model is classified according to the corresponding pronunciation-prone character, and when there are more confusing words corresponding to a certain pronunciation-prone character or there are more speech audio corresponding to a certain confusing word, the model can be selected to be classified again according to the confusing words under the classification of pronunciation-prone characters.

The mapping table creating module 1100 creates a mapping table between the standard acoustic model and the error acoustic model corresponding to the pronounceable erroneous word according to the standard acoustic model and the error acoustic model generated by the acoustic model generating module 1800.

The obtaining module 1200 obtains the user voice information.

The processing module 1400 prompts the user that the pronunciation-prone word pronounces correctly when the audio segment extracted by the extracting module 1300 matches the voice audio matching result in the standard acoustic model in the mapping table establishing module 1100.

Specifically, the processing module 1400 finds the speech audio of the pronouncing error-prone character corresponding to the standard acoustic model and the error acoustic model according to the pronouncing error-prone character corresponding to the extracted audio segment, and then matches the extracted audio segment with the found speech audio one by one.

If the matching result of the processing module 1400 is the speech audio belonging to the standard acoustic model, it indicates that the part of the user speech information related to the mispronounced word is correctly pronounced, but it needs to further determine whether the matching speech audio is the standard mandarin speech audio, if so, it indicates that the user pronunciation is not only correct but also standard, and if not, it prompts the user that the mispronounced word is correctly pronounced but not standard, and asks the user whether the user needs to output the standard mandarin speech audio.

A sixth embodiment of the present invention is a preferable embodiment of the fourth embodiment, and as shown in fig. 6, the sixth embodiment includes:

The obtaining module 1200 obtains the user voice information.

The processing module 1400, when the audio segment extracted by the extracting module 1300 does not match the voice audio matching result in the acoustic model in the mapping table establishing module 1100, converts the audio segment into a recognition text, where the acoustic model includes the standard acoustic model and the error acoustic model.

Specifically, the processing module 1400 finds the acoustic models, i.e., the standard acoustic model and the speech audio of the pronunciation-prone character corresponding to the error acoustic model, according to the pronunciation-prone character corresponding to the extracted audio segment, and then matches the extracted audio segment with the found speech audio one by one.

A control module 1950, determining whether the pronunciation of the audio clip is correct if the target word includes the recognition text converted by the processing module 1400, and if so, updating the standard acoustic model according to the audio clip; otherwise, updating the error acoustic model according to the audio fragment, prompting the user that the pronunciation of the pronunciation-prone character is wrong, and outputting the corresponding voice audio frequency in the standard acoustic model according to the mapping table.

The control module 1950, if the target word does not include the recognition text converted by the processing module 1400, updates the target word according to the recognition text and updates the acoustic model according to the audio clip.

The control module 1950 specifically includes:

a target word updating unit 1951, when the target word does not include the recognition text converted by the processing module 1400, updating the target word according to the recognition text.

The control unit 1952, if the audio segment extracted by the extracting module 1300 pronounces correctly, updates the standard acoustic model according to the audio segment, updates the confusing word according to the target word updated by the target word updating unit 1951, and then updates the incorrect acoustic model according to the speech audio of the updated confusing word.

Specifically, after updating the target word according to the recognition text, the control unit 1952 determines the pronunciation of the audio segment, and if the pronunciation is correct, updates the standard acoustic model according to the audio segment, and then may further obtain another pronunciation audio update standard acoustic model of the recognition text.

The control unit 1952, if the audio segment extracted by the extraction module 1300 is mispronounced, obtains a correct speech audio of the recognized text, updates the standard acoustic model according to the correct speech audio, updates the confusing word according to the target word updated by the target word updating unit 1951, and then updates the wrong acoustic model according to the updated speech audio of the confusing word.

Specifically, after updating the target word according to the recognized text, the control unit 1952 determines the pronunciation of the audio segment, and if the pronunciation is incorrect, obtains the correct voice audio corresponding to the recognized text, including the standard mandarin pronunciation audio and pronunciation audio of other dialects, the caval key, etc., and updates the standard acoustic model according to the obtained correct voice audio.

It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for correcting pronunciation errors in speech recognition, comprising:

acquiring user voice information;

when the audio frequency fragment is matched with the voice audio frequency matching result in the error acoustic model, prompting a user that the pronunciation is easy to miss, and outputting the corresponding voice audio frequency in the standard acoustic model according to the mapping table;

wherein, before establishing the mapping table between the standard acoustic model and the error acoustic model corresponding to the pronunciation error-prone word, the method further comprises: acquiring the pronouncing error-prone characters, and generating target words according to the pronouncing error-prone characters;

if the target words contain the recognition texts, judging whether the pronunciation of the audio clips is correct, and if so, updating the standard acoustic model according to the audio clips; otherwise, updating the error acoustic model according to the audio fragment;

2. The method for correcting pronunciation errors in speech recognition according to claim 1, wherein the step of establishing the mapping table between the standard acoustic model and the incorrect acoustic model corresponding to the pronunciation-prone word further comprises:

3. The method for correcting pronunciation errors in speech recognition according to claim 1 or 2, further comprising:

4. The method according to claim 2, wherein the updating the target word according to the recognized text and the acoustic model according to the audio piece when the target word does not include the recognized text specifically includes:

5. A system for correcting pronunciation errors in speech recognition, comprising:

the acquisition module acquires user voice information;

the processing module prompts a user that the pronunciation is easy to miss and the pronunciation is wrong when the audio frequency segment extracted by the extraction module is matched with the voice audio frequency matching result in the wrong acoustic model in the mapping table establishing module, and outputs the corresponding voice audio frequency in the standard acoustic model according to the mapping table established by the mapping table establishing module;

the error-prone character acquisition module is used for acquiring the pronunciation error-prone characters before the mapping table establishment module establishes a mapping table between a standard acoustic model and an error acoustic model corresponding to the pronunciation error-prone characters;

the control module is used for judging whether the pronunciation of the audio clip is correct or not if the target word contains the recognition text converted by the processing module, and updating the standard acoustic model according to the audio clip if the pronunciation of the audio clip is correct; otherwise, updating the error acoustic model according to the audio fragment;

6. The system for correcting pronunciation errors in speech recognition according to claim 5, further comprising:

7. The system for correcting pronunciation errors in speech recognition according to claim 5 or 6, further comprising:

8. The system for correcting pronunciation errors in speech recognition according to claim 6, wherein the control module specifically comprises: