CN111369974B

CN111369974B - Dialect pronunciation marking method, language identification method and related device

Info

Publication number: CN111369974B
Application number: CN202010165255.9A
Authority: CN
Inventors: 王磊; 冯大航; 陈孝良; 常乐
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2024-01-19
Anticipated expiration: 2040-03-11
Also published as: CN111369974A

Abstract

The invention provides a dialect pronunciation marking method, a language identification method and a related device, wherein the dialect pronunciation marking method comprises the following steps: performing audio-text alignment on the obtained dialect training set to obtain word boundaries of each word in the dialect training set; performing voice-phoneme decoding on the dialect training set by using the mandarin voice recognition model to obtain a pronunciation phoneme sequence of each voice in the dialect training set; determining the pronunciation phoneme sequence of each word in the dialect training set according to the pronunciation phoneme sequence of each voice in the decoded dialect training set and the word boundary of each word in the dialect training set; determining target words with multiple pronunciations according to the pronunciation phoneme sequence of each word in the dialect training set; and adding the target pronunciation of the target word into the Mandarin pronunciation dictionary to obtain the target pronunciation dictionary. The embodiment of the invention can automatically finish dialect pronunciation marking without relying on manpower, thereby saving manpower and time cost.

Description

Dialect pronunciation marking method, language identification method and related device

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a dialect pronunciation labeling method, a language recognition method, and related devices.

Background

Language is one of the most direct and natural ways for human to realize information interaction, and dialects used by people are different according to different regions. With the development of society and the popularization of artificial intelligence, the voice recognition of related dialects has a great challenge.

Wherein the pronunciation dictionary is the basis of speech recognition, some regional dialects may exhibit accent variations in pronunciation compared to mandarin chinese, such as: the pronunciation of the Chinese character' is changed from 2 sound to 3 sound, it is also possible that a phrase may be read continuously and lose sound when the dialect pronounces, for example: mandarin "unaware of < bu zhi dao >" becomes "not in the northeast words < bu dao >" and these dialect utterances are not labeled on the pronunciation dictionary.

At present, the pronunciation labels of the dialect words are mainly manual labels, the construction of the dialect dictionary also depends on manual construction, the pronunciation of the multi-pronunciation words is added by manual summarization, and different dialect pronunciations can be more changeable, and the time and the labor can be wasted by purely manually adding labels, so that the efficiency is low.

Disclosure of Invention

The embodiment of the invention provides a dialect pronunciation marking method, a language identification method and a related device, which are used for solving the problems that the conventional dialect pronunciation marking method is labor-consuming and has low efficiency.

In order to solve the technical problems, the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a dialect pronunciation labeling method, including:

performing audio-text alignment on the obtained dialect training set to obtain word boundaries of each word in the dialect training set, wherein the word boundaries of each word are audio start frames and audio end frames of the word in the dialect training set, and the dialect training set comprises dialect voices and corresponding texts;

performing voice-phoneme decoding on the dialect training set by using a mandarin voice recognition model to obtain a pronunciation phoneme sequence of each voice in the dialect training set;

determining the pronunciation phoneme sequence of each word in the dialect training set according to the pronunciation phoneme sequence of each voice in the dialect training set obtained by decoding and the word boundary of each word in the dialect training set;

and labeling the dialect pronunciation of each word in the dialect training set according to the pronunciation phoneme sequence of each word in the dialect training set.

Optionally, after labeling the dialect pronunciation of each word in the dialect training set, the method further includes:

determining target words with a plurality of pronunciations according to the pronunciation phoneme sequence of each word in the dialect training set and combining the pronunciations marked by each word in the mandarin pronunciation dictionary;

And adding the target pronunciation of the target word into the mandarin pronunciation dictionary to obtain a target pronunciation dictionary.

Optionally, after the determining the target word with multiple pronunciations, before adding the target pronunciation of the target word to the mandarin chinese pronunciation dictionary to obtain a target pronunciation dictionary, the method further includes:

and determining target pronunciations of the target word based on the occurrence frequencies of the pronunciations of the target word in the dialect training set, wherein the target pronunciations are pronunciations of which the occurrence frequencies meet preset conditions.

Optionally, the performing audio-text alignment on the obtained dialect training set includes:

using an acoustic model as a voice recognition training model, using dialect voices and mandarin pronunciation dictionaries in the dialect training set as model inputs, using corresponding dialect words in the dialect training set as model outputs, and training to obtain a first voice recognition model;

and performing audio-text alignment on the acquired dialect training set by using the first voice recognition model.

Optionally, the performing speech-phoneme decoding on the dialect training set by using the mandarin chinese speech recognition model to obtain a pronunciation phoneme sequence of each speech in the dialect training set includes:

And performing voice-phoneme decoding on the dialect training set by using a mandarin acoustic model and a phoneme language model to obtain a pronunciation phoneme sequence corresponding to each voice in the dialect training set, wherein the phoneme language model is trained by using a mandarin phoneme set, and the mandarin phoneme set comprises mandarin pronunciation phonemes.

Optionally, before the speech-to-phoneme decoding of the dialect training set using the mandarin chinese acoustic model and the phoneme language model, the method further includes:

and using a language model as a phoneme language training model, taking phonemes in the mandarin phone set and a phoneme pronunciation dictionary as model inputs, taking a corresponding pronunciation phoneme sequence conforming to the mandarin pronunciation rule as model output, and training to obtain the phoneme language model, wherein the phoneme pronunciation dictionary stores the correspondence between phonemes of words and pronunciation phoneme sequences.

In a second aspect, an embodiment of the present invention provides a method for voice recognition, including:

constructing a target pronunciation dictionary based on the dialect pronunciation of each word in the labeled dialect training set, wherein the dialect pronunciation of each word is labeled by using the dialect pronunciation labeling method described in the first aspect, and the target pronunciation dictionary is labeled with the dialect pronunciation of each word;

Generating a dialect voice recognition model by utilizing the dialect training set and the target pronunciation dictionary training;

and performing dialect recognition by using the dialect voice recognition model.

In a third aspect, an embodiment of the present invention provides a dialect pronunciation marking device, including:

the alignment module is used for carrying out audio-text alignment on the acquired dialect training set to obtain word boundaries of each word in the dialect training set, wherein the word boundaries of each word are audio start frames and audio end frames of the word in the dialect training set, and the dialect training set comprises dialect voice and corresponding texts;

the decoding module is used for carrying out voice-phoneme decoding on the dialect training set by utilizing a mandarin voice recognition model to obtain a pronunciation phoneme sequence of each voice in the dialect training set;

the first determining module is used for determining the pronunciation phoneme sequence of each word in the dialect training set according to the pronunciation phoneme sequence of each voice in the dialect training set obtained by decoding and the word boundary of each word in the dialect training set;

and the labeling module is used for labeling the dialect pronunciation of each word in the dialect training set according to the pronunciation phoneme sequence of each word in the dialect training set.

Optionally, the dialect pronunciation marking device further includes:

the second determining module is used for determining target words with a plurality of pronunciations according to the pronunciation phoneme sequence of each word in the dialect training set and combining the pronunciations marked by each word in the mandarin pronunciation dictionary;

and the second dictionary construction module is used for adding the target pronunciation of the target word into the mandarin pronunciation dictionary to obtain a target pronunciation dictionary.

Optionally, the dialect pronunciation marking device further includes:

and the third determining module is used for determining target pronunciations of the target word based on the occurrence frequencies of the pronunciations of the target word in the dialect training set, wherein the target pronunciations are pronunciations of which the occurrence frequencies meet preset conditions.

Optionally, the alignment module includes:

the model training unit is used for using an acoustic model as a voice recognition training model, inputting dialect voice and mandarin pronunciation dictionary in the dialect training set as models, outputting corresponding dialect words in the dialect training set as models, and training to obtain a first voice recognition model;

and the alignment unit is used for carrying out audio-text alignment on the acquired dialect training set by utilizing the first voice recognition model.

Optionally, the decoding module is configured to perform speech-to-phoneme decoding on the dialect training set by using a mandarin acoustic model and a phoneme language model to obtain a pronunciation phoneme sequence corresponding to each speech in the dialect training set, where the phoneme language model is obtained by training by using a mandarin phoneme set, and the mandarin phoneme set includes mandarin pronunciation phonemes.

Optionally, the dialect pronunciation marking device further includes:

the model training module is used for using a language model as a phoneme language training model, taking phonemes in the mandarin phone set and a phoneme pronunciation dictionary as model inputs, taking a corresponding pronunciation phoneme sequence conforming to the mandarin pronunciation rule as model output, and training to obtain the phoneme language model, wherein the phoneme pronunciation dictionary stores the corresponding relation between the phonemes of the word and the pronunciation phoneme sequence.

In a fourth aspect, an embodiment of the present invention provides a voice recognition apparatus, including:

the first dictionary construction module is used for constructing a target pronunciation dictionary based on the dialect pronunciation of each word in the marked dialect training set, wherein the dialect pronunciation of each word is marked by using the dialect pronunciation marking method in the first aspect, and the target pronunciation dictionary is marked with the dialect pronunciation of each word;

The model generation module is used for generating a dialect voice recognition model by utilizing the dialect training set and the target pronunciation dictionary training;

and the voice recognition module is used for performing dialect recognition by utilizing the dialect voice recognition model.

In a fifth aspect, an embodiment of the present invention provides a dialect pronunciation marking device, including a processor, a memory, and a computer program stored in the memory and capable of running on the processor, where the computer program when executed by the processor implements the steps in the dialect pronunciation marking method described in the first aspect.

In a sixth aspect, an embodiment of the present invention provides a speech recognition device, including a processor, a memory, and a computer program stored on the memory and executable on the processor, where the computer program when executed by the processor implements the steps in the speech recognition method described in the second aspect.

In a seventh aspect, an embodiment of the present invention provides a computer readable storage medium, where a computer program is stored, where the computer program when executed by a processor implements the steps in the dialect pronunciation annotation method described in the first aspect above; alternatively, the computer program when executed by a processor implements the steps of the speech recognition method according to the second aspect.

In the embodiment of the invention, the word boundary of each word in the dialect training set is obtained by carrying out audio-text alignment on the dialect training set, and the correct pronunciation phoneme sequence of each voice in the dialect training set is obtained by carrying out voice-phoneme decoding on the dialect training set, so that the pronunciation phoneme sequence of each word in the dialect training set is determined according to the correct pronunciation phoneme sequence of each voice in the dialect training set and the word boundary of each word in the dialect training set, and the pronunciation marking of the dialect is completed based on the pronunciation phoneme sequence. Therefore, the automatic process can realize the pronunciation marking of the words, without relying on manpower, and saves labor and time cost. In addition, the target pronunciation dictionary marked with the dialect pronunciation can be further constructed by utilizing the dialect pronunciation marking method, and the dialect is identified by utilizing the target pronunciation dictionary, so that the accuracy of dialect identification can be improved, and the dialect identification effect is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

FIG. 1 is a flowchart of a dialect pronunciation annotation method provided by an embodiment of the invention;

FIG. 2 is an exemplary flowchart of a dialect pronunciation annotation method provided by an embodiment of the present invention;

FIG. 3 is a flow chart of another speech recognition method provided by an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a dialect pronunciation marking device according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a voice recognition device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, fig. 1 is a flowchart of a dialect pronunciation labeling method according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:

step 101, performing audio-text alignment on the obtained dialect training set to obtain word boundaries of each word in the dialect training set, wherein the word boundaries of each word are audio start frames and audio end frames of the word in the dialect training set, and the dialect training set comprises dialect voice and corresponding text.

The dialect training set can comprise dialect voices and corresponding texts, wherein the dialect voices can be obtained through prerecorded dialect voice data, and in order to ensure that better training effects and labels can obtain more accurate dialect pronunciation, even a more complete dialect pronunciation dictionary is built, a large number of dialect voices with standard pronunciation can be prerecorded; the corresponding text may be pre-annotated with the text meaning of each dialect, respectively. For example, the dialect voice "dry yam < g n sh-a z ǐ >" may be pre-labeled with the corresponding text "dry yam" in the dialect training set.

It should be noted that, in order to target the pronunciation of a local dialect specifically, the dialect training corpus in the dialect training set may be a training set recorded specifically for a local dialect, for example, the dialect training set may be a northeast dialect training set, a Sichuan dialect training set, or the like, so that the dialect pronunciation marked or the dialect pronunciation dictionary constructed by marking may be used to identify the dialect of a certain area specifically. Of course, the dialects of different regions can be used as the dialect training set to label the pronunciation of a plurality of dialects at one time, so that the target pronunciation dictionary with the pronunciation labels of the plurality of dialects can be constructed, and the pronunciation dictionary constructed in the manner can be used for identifying the dialects of different regions.

In this embodiment, after the dialect training set is obtained, the dialect training set may be subjected to voice alignment, that is, audio-text alignment, so as to obtain word boundaries of each word in the dialect training set, that is, by aligning each section of dialect audio in the dialect training set with a corresponding text, an audio start frame and an audio end frame of each word in the text in the dialect training set in the corresponding section of dialect audio, that is, word boundaries of each word, may be determined. For example, the word boundary corresponding to the word "futilely" is the audio start frame when the pronunciation is g-a-n and the audio end frame when the pronunciation is z ǐ in the corresponding dialect audio.

The audio-text alignment of the dialect training set may be implemented by using a Viterbi alignment algorithm, or may be automatically implemented by using some alignment gadgets (such as voice alignment, forced alignment, etc.).

In an alternative embodiment, the audio-text alignment of the dialect training set may be achieved by training a speech recognition model, and by continuously optimizing the trained speech recognition model, a better or even optimal audio-text alignment result may be ensured.

Specifically, an acoustic model may be selected as a speech recognition training model, for example, the mainstream acoustic model is modeled by using a hidden markov model, a neural network model, and the like, then dialect speech and a mandarin pronunciation dictionary in the dialect training set are input as models, corresponding dialect words in the dialect training set are output as models, the speech recognition training model is trained, and a first speech recognition model capable of recognizing a dialect is obtained, wherein the mandarin pronunciation dictionary may be an existing general mandarin pronunciation dictionary, and the corresponding relation between words and mandarin pronunciation phonemes is stored in the dictionary.

In the training process, the pronunciation phonemes of the dialect voice can be firstly identified by using the voice recognition training model, then the words corresponding to the pronunciation phonemes of the dialect voice are obtained by searching the mandarin pronunciation dictionary, finally whether the words are consistent with the dialect words corresponding to the dialect training set or not is verified, if not, the structural parameters of the voice recognition training model can be readjusted, the training process is repeated until the words output by the model are consistent with the dialect words in the dialect training set almost every time, and finally the determined model structure is the first voice recognition model obtained through training. In order to obtain the optimal speech recognition model as much as possible, the structural parameters of the training model can be continuously optimized by using an optimization function in the training process, or when the current selected speech recognition training model is found to be poor in effect, other speech recognition training models can be replaced to train to obtain the first speech recognition model with better recognition effect.

And then, the first voice recognition model can be utilized to perform audio-text alignment on the dialect training set, so that dialect voice in the dialect training set can be input into the first voice recognition model to perform voice recognition, each section of dialect audio frequency and corresponding dialect words are obtained through recognition, and word boundaries of each word in the dialect training set can be obtained according to recognition results.

In this way, the training set of dialects is aligned by the trained first speech recognition model, so that not only can the audio-text alignment process be completed rapidly, but also a more accurate alignment result can be ensured.

And 102, performing voice-phoneme decoding on the dialect training set by using a mandarin voice recognition model to obtain a pronunciation phoneme sequence of each voice in the dialect training set.

In this step, the dialect training set may be subjected to speech-phoneme decoding to obtain a pronunciation phoneme sequence of each speech in the dialect training set, specifically, each speech in the dialect training set may be decoded by using a mandarin speech recognition model to obtain a pronunciation phoneme sequence of each speech, and specifically, since each audio frame of each speech is decoded in the decoding process, there may be multiple possible combinations of the pronunciation phoneme sequences obtained by decoding, in the decoding process, front and rear speech information in the dialect training set, that is, front and rear audio frames, may be further combined to determine a correct pronunciation phoneme sequence corresponding to each speech.

For example, the pronunciation phoneme sequence of "g-e (i) -n" is obtained by decoding the mandarin chinese speech recognition model, but in combination with the front and rear audio frames, "gin" is a pronunciation phoneme that is not possible, and "gen" is a correct pronunciation phoneme, so the final decoding result can be determined as "gen".

The mandarin speech recognition model may be an existing speech recognition model for recognizing mandarin, and in order to ensure that a more accurate decoding result is obtained, a mandarin speech recognition model with a better recognition effect may be selected as far as possible, and the mandarin speech recognition model may also be obtained by training with a training algorithm of a neural network, such as a gradient descent algorithm.

It should be noted that, the execution timing of the step 101 and the step 102 is not limited, and may be executed in parallel, or may be executed sequentially, for example, one step is executed first, and then the other step is executed.

Optionally, the step 102 includes:

In an alternative implementation manner of phoneme decoding, the dialect training set may be decoded by combining a mandarin acoustic model and a phoneme language model, and through the two models, the most likely pronunciation phoneme sequence corresponding to each voice in the dialect training set may be output, so as to obtain a pronunciation phoneme sequence corresponding to each voice in the dialect training set, where the mandarin acoustic model may decode an audio feature of each voice in the dialect training set, and the phoneme language model may further decode the pronunciation phoneme sequence corresponding to each voice.

In this embodiment, the phone language model may be trained using a mandarin chinese phone set, i.e., a set of phones in mandarin chinese pronunciation, and 26 pinyin such as a, b, c, d, e are phones in mandarin chinese pronunciation, and the model is used to decode an input phone sequence to obtain a correct pronunciation phone sequence, where the correct pronunciation phone sequence is a phone sequence that can be combined in mandarin chinese pronunciation, and the model corrects phones that cannot be combined in the decoding process.

Specifically, the phone language model may be obtained by using a mandarin chinese phone set as a training corpus and using a language model in language recognition as a training model, such as a Weighted Finite State Transducer (WFST) algorithm model, and training the model to obtain the correct pronunciation phone combination.

After the dialect training set is decoded to obtain the corresponding candidate pronunciation phonemes, the phonemic language model can decode the candidate pronunciation phonemes again to output a correct pronunciation phoneme sequence which accords with the pronunciation rules of words in mandarin.

In this way, by training a phone language model and combining the mandarin chinese acoustic model and the phone language model to decode the dialect training set, it is ensured that the correct pronunciation phone sequence corresponding to the correct pronunciation phone sequence is quickly and accurately decoded.

Before the phoneme language model is used for carrying out phoneme decoding on the candidate pronunciation phonemes corresponding to each voice, the method further comprises the following steps:

In this embodiment, a mandarin chinese phone set and a phoneme pronunciation dictionary may be used to train to obtain a phone language model with a better phone decoding effect, so as to ensure accuracy and reliability of decoding results obtained by decoding candidate pronunciation phones by using the phone language model.

Specifically, a language model may be selected as a phoneme language training model, such as a WFST model, and one of three algorithms, such as Composition, determination and Minimization, may be used as a training algorithm according to the need, and then the phonemes and phoneme pronunciation dictionary in the mandarin chinese phone set are used as model inputs, the corresponding pronunciation phoneme sequence conforming to the mandarin chinese pronunciation rule is used as model output, and the phoneme language training model is trained to obtain a phoneme language model capable of decoding a correct pronunciation phoneme sequence, where the phoneme pronunciation dictionary may be obtained by constructing a correspondence between phonemes and pronunciation phoneme sequences of words.

In the training process, a plurality of phonemes of the pronunciation of the corresponding word in the mandarin chinese phone set can be input into the phoneme language training model to obtain a phoneme combination composed of the phonemes, and then a pronunciation phoneme sequence of the word corresponding to the phoneme combination is searched from the phoneme pronunciation dictionary and is output, so that the trained model can decode the input pronunciation phonemes capable of corresponding word and output the pronunciation phonemes corresponding to the word as a correct pronunciation phoneme sequence, and for the input pronunciation phonemes incapable of corresponding word, the candidate pronunciation phoneme combination is excluded because the corresponding pronunciation phoneme sequence cannot be searched and is judged to be an incorrect pronunciation phoneme sequence.

For example, the phonemes in the mandarin chinese phone set may be "a b c d e …", and the phoneme pronunciation dictionary may include two columns, where the first column is a word of phonemes, such as "ai bo cu di …", and the second column is a word corresponding pronunciation phoneme sequence, such as "ai bo cu di …", and in training, if the phoneme "a-i" is input, the pronunciation phoneme sequence output by the training model is "ai", and if the phoneme "g-e-n" is input, the corresponding pronunciation phoneme sequence "gen" is output by the training model.

Therefore, the candidate pronunciation phonemes corresponding to each voice are subjected to phoneme decoding by training the language recognition model, so that a phoneme decoding process can be rapidly completed, and more accurate decoding results can be ensured.

Step 103, determining the pronunciation phoneme sequence of each word in the dialect training set according to the pronunciation phoneme sequence of each voice in the dialect training set obtained by decoding and the word boundary of each word in the dialect training set.

After obtaining the correct pronunciation phoneme sequence of each voice in the dialect training set and the word boundary of each word in the dialect training set, the pronunciation phoneme sequence of each word in the dialect training set can be determined by combining the correct pronunciation phoneme sequence of each voice in the dialect training set and the word boundary of each word in the dialect training set, specifically, the correct pronunciation phoneme sequence of each voice in the dialect training set can be segmented according to the word boundary to obtain the pronunciation phoneme sequence of each word in the dialect training set.

For example, the pronunciation phoneme sequence of a dialect is "n ǐ z a sign n z ǐ", and then the pronunciation phoneme sequence of "n ǐ z a sig x a sign z ǐ" is divided according to the audio start frame and the audio end frame of each word ("you", "in" and "dry" in the speech), so that the pronunciation phoneme sequences of each word can be determined as follows: the pronunciation phoneme sequence of "you" is "n ǐ", the pronunciation phoneme sequence of "in" is "z a i", and the pronunciation phoneme sequence of "dry" is "g a nsh a z ǐ".

And 104, labeling the dialect pronunciation of each word in the dialect training set according to the pronunciation phoneme sequence of each word in the dialect training set.

After the pronunciation phoneme sequence of each word in the dialect training set is obtained, the dialect pronunciation of each word in the dialect training set can be determined based on the pronunciation phoneme sequence, and the determined dialect pronunciation of each word can be respectively marked as the dialect pronunciation of the corresponding word, so that the dialect pronunciation marking work is completed. For example, after determining that the pronunciation phoneme sequence of "dry yam" is "g-nsh-hakuh z ǐ", the pronunciation of the dialect "dry yam" can be labeled as "g-nsh-hakuh z ǐ".

Of course, the obtained pronunciation phoneme sequence of each word in the dialect training set may deviate from the actual pronunciation due to various factors, such as influence of the dialect training set itself, recognition and decoding models, etc., so, in order to reduce the dialect pronunciation annotation error as much as possible and improve the annotation accuracy, the occurrence times of each pronunciation of each word in the dialect training set may be counted first based on the obtained pronunciation phoneme sequence of each word in the dialect training set, some pronunciations with fewer occurrence times are removed, and pronunciations with more occurrence times are regarded as trusted pronunciations and marked as dialect pronunciations of corresponding words.

Optionally, the method further comprises:

In this embodiment, after obtaining the pronunciation phoneme sequence of each word in the dialect training set, the pronunciations of the words can be statistically summarized based on the sequences to determine all possible pronunciations of each word, so that the target word with multiple pronunciations can be determined according to the statistical result, for example, if the pronunciation of the dialect word "dry-kanji" after decoding is counted to include "g_nsh_h_z ǐ" and "g_nsh_z ǐ", the word can be determined to be the target word with multiple pronunciations.

And when determining the target word with multiple pronunciations, the pronunciations marked in the Mandarin pronunciation dictionary by each word can be further combined, namely, the target word with multiple pronunciations is determined according to the dialect pronunciations of each word and the Mandarin pronunciations of the words, for example, if the pronunciation of the decoded dialect word 'dry yarn' is 'g atr nsh n z ǐ', and then the pronunciations of the words are combined 'g atr nsh a z ǐ', the word can be determined to be the target word with multiple pronunciations.

Thus, by combining the mandarin pronunciation of each word in the dialect training set, a target word having multiple pronunciations can be more comprehensively determined.

After determining a target word with multiple pronunciations, the target pronunciations of the target word can be added into a Mandarin pronunciation dictionary to obtain a new pronunciation dictionary, namely the target pronunciation dictionary, wherein the target pronunciations of the target word can be determined after screening among the multiple pronunciations of the determined target word so as to eliminate pronunciations with inaccurate labels caused by pronunciation errors in a dialect training set or errors caused by model decoding and the like, and the screening can be performed by introducing manual verification or adopting preset condition elimination and the like, such as rejecting pronunciations with lower occurrence times of some of the target word in the dialect training set, and only remaining pronunciations are reserved as target pronunciations of the word.

For example, when the pronunciation of "dry yam" of Fang Yanci "is determined to include" g a nsh a z ǐ "and its frequency of occurrence in the dialect training set is high, the pronunciation can be added to the mandarin pronunciation dictionary so that the new target pronunciation dictionary," dry yam "is a multiple pronunciation word, including two pronunciations" g a nsh a z ǐ "and" g a nsh b z ǐ ".

In this way, the target pronunciation dictionary marked with mandarin pronunciation and dialect pronunciation can be automatically constructed by the embodiment, without relying on manpower, and the constructed target pronunciation dictionary can be used for dialect recognition.

Optionally, after the target word having the plurality of pronunciations is specified, before the target pronunciations of the target word are added to the mandarin pronunciation dictionary to obtain the target pronunciation dictionary, the method further includes:

In this embodiment, in order to ensure accuracy of the constructed target pronunciation dictionary, after obtaining the multiple pronunciations of the target word, the target pronunciation may be determined according to occurrence frequencies of the multiple pronunciations of the word in the dialect training set, and specifically, the pronunciation whose occurrence frequency meets a preset condition may be determined as the target pronunciation, where the preset condition may be that the frequency is higher than a preset frequency, or the top N ranked according to the frequency, for example, the top three with the highest frequency are selected, or the frequency is ranked top and higher than the preset frequency, where the occurrence frequency may be represented by the occurrence number. Thus, those sounds with lower occurrence frequencies are treated as having lower confidence and are rejected, while sounds with higher occurrence frequencies are preserved by having higher confidence.

In this embodiment, the target pronunciation of the target word is determined based on the occurrence frequency of multiple pronunciations of the target word in the dialect training set, so that the target pronunciation of the target word can be ensured to have higher reliability, and further, a target pronunciation dictionary constructed based on the target pronunciation of the target word also has higher accuracy.

According to the dialect pronunciation marking method in the embodiment, the word boundary of each word in the dialect training set is obtained by carrying out audio-text alignment on the dialect training set, and the correct pronunciation phoneme sequence of each voice in the dialect training set is obtained by carrying out voice-phoneme decoding on the dialect training set, so that the pronunciation phoneme sequence of each word in the dialect training set is determined according to the correct pronunciation phoneme sequence of each voice in the dialect training set and the word boundary of each word in the dialect training set, and pronunciation marking of the dialect is completed based on the pronunciation phoneme sequence. Therefore, the automatic process can realize the pronunciation marking of the words, without relying on manpower, and saves labor and time cost.

The implementation of the embodiment shown in fig. 1 is illustrated below with reference to fig. 2 by taking the construction of a dialect pronunciation dictionary as an example:

Step 201, training a speech recognition model by using a dialect training set and a mandarin pronunciation dictionary, wherein the training is performed to ensure that an optimal speech recognition model is trained as much as possible so as to obtain an optimal alignment result.

Step 202, performing audio-text alignment on the dialect training set by using the trained voice recognition model to obtain word boundaries of each word in the dialect training set.

Step 203, constructing a phoneme language model according to the mandarin chinese phone set, wherein phonemes in the mandarin chinese phone set are used as training corpus, and a phoneme pronunciation dictionary is used for training the phoneme language model.

And 204, decoding the dialect training set by using the mandarin acoustic model and the phoneme language model to obtain a pronunciation phoneme sequence of each voice in the dialect training set.

Step 205, according to the word boundary of each word in the dialect training set obtained in step 202, and the pronunciation phoneme sequence of each voice in the obtained dialect training set is decoded, the pronunciation of each dialect word in the dialect training set represented by mandarin phonemes, namely the dialect pronunciation phoneme sequence thereof, is obtained. For example, the pronunciation of the northeast dialect "dry yam" marked in the original Mandarin pronunciation dictionary is "g-ansh-z ǐ", and the dialect pronunciation phoneme sequence of "g-ansh-z ǐ" is obtained through decoding.

Step 206, determining that there may be multiple pronunciations for each dialect according to the foregoing steps, further sorting the pronunciations according to the occurrence frequency of the pronunciations in the dialect training set, selecting the pronunciations with higher frequency as target pronunciations, and adding the target pronunciations into the original mandarin pronunciation dictionary to generate a new pronunciation dictionary for dialect recognition.

Referring to fig. 3, fig. 3 is a flowchart of a voice recognition method according to an embodiment of the present invention, as shown in fig. 3, the method includes the following steps:

step 301, constructing a target pronunciation dictionary based on the dialect pronunciation of each word in the labeled dialect training set, wherein the dialect pronunciation of each word is labeled by using a dialect pronunciation labeling method in the embodiment of the method shown in fig. 1, and the target pronunciation dictionary is labeled with the dialect pronunciation of each word;

in this embodiment, the dialect pronunciation of each word in the dialect training set obtained by labeling in the method embodiment shown in fig. 1 may be used to construct a target pronunciation dictionary labeled with the dialect pronunciation of each word, and a specific construction manner may be referred to a related description in an alternative implementation manner of how to construct the target pronunciation dictionary in the method embodiment shown in fig. 1, which is not repeated here for avoiding repetition.

And 302, training and generating a dialect voice recognition model by using the dialect training set and the target pronunciation dictionary.

Dialect recognition is completed by utilizing the constructed target pronunciation dictionary, so that different pronunciation problems of some local dialects are effectively solved. Specifically, a large number of dialect training sets, such as recorded dialect voices and corresponding texts, can be obtained first, then the dialect voices in the dialect training sets are input as models, the corresponding texts are output as models, the training models can be acoustic models commonly used in voice recognition systems, such as hidden markov models, neural network models and the like, the training process is similar to the training process of the voice recognition models in the prior art, and the related description of the training of the voice recognition models in the foregoing embodiments can be particularly referred to. The target pronunciation dictionary can be used in the training process, after the training model decodes the dialect pronunciation to obtain a pronunciation phoneme sequence, the target pronunciation dictionary is queried to obtain a text corresponding to the pronunciation phoneme sequence, so that a dialect voice recognition model capable of recognizing the dialect can be generated through the training.

And 302, performing dialect recognition by using the dialect voice recognition model.

After the dialect voice recognition model is generated through training, the dialect voice recognition can be performed by utilizing the dialect voice recognition model generated through training, namely the dialect voice input by the user can be recognized, specifically, the received dialect voice is input into the model, and the model outputs the text meaning corresponding to the dialect voice.

In this way, the dialect voice recognition model is obtained based on training of the target pronunciation dictionary marked with the dialect pronunciation, so that the user dialect can be recognized more accurately, and the dialect recognition effect is improved.

According to the voice recognition method, the target pronunciation dictionary and the dialect training corpus obtained by the dialect pronunciation labeling method are used for training to generate the dialect voice recognition model, and the model is used for recognizing the dialect, so that the accuracy of dialect recognition can be improved, and the dialect recognition effect is further improved.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a dialect pronunciation marking device according to an embodiment of the present invention, and as shown in fig. 4, a dialect pronunciation marking device 400 includes:

an alignment module 401, configured to perform audio-text alignment on an obtained dialect training set to obtain word boundaries of each word in the dialect training set, where the word boundaries of each word are an audio start frame and an audio end frame of the word in the dialect training set, and the dialect training set includes dialect speech and a corresponding text;

A decoding module 402, configured to perform speech-to-phoneme decoding on the dialect training set by using a mandarin chinese speech recognition model, so as to obtain a pronunciation phoneme sequence of each speech in the dialect training set;

a first determining module 403, configured to determine a pronunciation phoneme sequence of each word in the dialect training set according to the pronunciation phoneme sequence of each voice in the dialect training set obtained by decoding and a word boundary of each word in the dialect training set;

and the labeling module 404 is configured to label the dialect pronunciation of each word in the dialect training set according to the pronunciation phoneme sequence of each word in the dialect training set.

Optionally, the dialect pronunciation annotation device 400 further includes:

Optionally, the alignment module 401 includes:

Optionally, the decoding module 402 is configured to perform speech-to-phoneme decoding on the dialect training set using a mandarin chinese speech recognition model and a phoneme language model to obtain a pronunciation phoneme sequence corresponding to each speech in the dialect training set, where the phoneme language model is trained using a mandarin chinese phoneme set, and the mandarin chinese phoneme set includes mandarin chinese pronunciation phonemes.

Optionally, the dialect pronunciation annotation device 400 further includes:

The dialect-pronunciation marking device 400 can implement the respective processes in the method embodiments of fig. 1 and fig. 2, and will not be described herein again for avoiding repetition. According to the embodiment of the invention, the dialect pronunciation marking device 400 performs audio-text alignment through the dialect training set to obtain word boundaries of each word in the dialect training set, and performs voice-phoneme decoding through the dialect training set to obtain correct pronunciation phoneme sequences of each voice in the dialect training set, so that the pronunciation phoneme sequences of each word in the dialect training set are determined according to the correct pronunciation phoneme sequences of each voice in the dialect training set and the word boundaries of each word in the dialect training set, and pronunciation marking of the dialect is completed based on the pronunciation phoneme sequences. Therefore, the automatic process can realize the pronunciation marking of the words, without relying on manpower, and saves labor and time cost.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a voice recognition device according to an embodiment of the present invention, and as shown in fig. 5, a voice recognition device 500 includes:

a first dictionary construction module 501, configured to construct a target pronunciation dictionary based on the dialect pronunciation of each word in the labeled dialect training set, where the dialect pronunciation of each word is labeled by using the dialect pronunciation labeling method in the method embodiment shown in fig. 1, and the target pronunciation dictionary is labeled with the dialect pronunciation of each word;

A model generation module 502, configured to generate a dialect speech recognition model by using the dialect training set and the target pronunciation dictionary training;

and the voice recognition module 503 is configured to perform dialect recognition by using the dialect voice recognition model.

The speech recognition device 500 is capable of implementing the various processes in the method embodiment of fig. 3, and is not described here again to avoid repetition. According to the voice recognition device 500 provided by the embodiment of the invention, the target pronunciation dictionary is constructed through the dialect pronunciation obtained based on the dialect pronunciation labeling method, the constructed target pronunciation dictionary and the dialect training set are used for training to generate the dialect voice recognition model, and the model is used for recognizing the dialect, so that the accuracy of recognition of the dialect can be improved, and the recognition effect of the dialect is further improved.

The embodiment of the invention also provides a dialect pronunciation marking device, which comprises a processor, a memory and a computer program stored in the memory and capable of running on the processor, wherein the computer program realizes each process of the embodiment of the dialect pronunciation marking method shown in fig. 1 when being executed by the processor, and can achieve the same technical effect, and the repetition is avoided.

The embodiment of the invention also provides a voice recognition device, which comprises a processor, a memory and a computer program stored in the memory and capable of running on the processor, wherein the computer program realizes each process of the voice recognition method embodiment shown in fig. 3 when being executed by the processor, and can achieve the same technical effect, and the repetition is avoided, so that the description is omitted.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements each process of the embodiment of the dialect pronunciation marking method shown in fig. 1, and can achieve the same technical effect, and in order to avoid repetition, the description is omitted here; alternatively, the computer program when executed by the processor implements the processes of the embodiment of the speech recognition method shown in fig. 3, and can achieve the same technical effects, so that repetition is avoided, and no further description is given here. Wherein the computer readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present invention and the scope of the claims, which are to be protected by the present invention.

Claims

1. A method for labeling a pronunciation of a dialect, comprising:

labeling the dialect pronunciation of each word in the dialect training set according to the pronunciation phoneme sequence of each word in the dialect training set;

after labeling the dialect pronunciation of each word in the dialect training set, the method further comprises:

Adding target pronunciations of the target words into the Mandarin pronunciation dictionary to obtain a target pronunciation dictionary, wherein the target pronunciations of the target words are pronunciations determined after screening in a plurality of pronunciations of the target words, and the screening comprises rejecting pronunciations with lower occurrence times in the dialect training set in the plurality of pronunciations;

the method for performing voice-phoneme decoding on the dialect training set by using the mandarin voice recognition model to obtain a pronunciation phoneme sequence of each voice in the dialect training set comprises the following steps:

2. The method of claim 1, wherein after the determining the target word having the plurality of pronunciations, before adding the target pronunciation of the target word to the mandarin chinese pronunciation dictionary to obtain a target pronunciation dictionary, the method further comprises:

3. The method according to claim 1 or 2, wherein said audio-text alignment of the acquired dialect training set comprises:

4. The method of claim 1, wherein prior to speech-to-phoneme decoding the training set of dialects using the mandarin chinese acoustic model and the phoneme language model, the method further comprises:

5. A method of speech recognition, comprising:

Constructing a target pronunciation dictionary based on the dialect pronunciation of each word in the annotated dialect training set, wherein the dialect pronunciation of each word is obtained by annotating by using the dialect pronunciation annotation method of any one of claims 1 to 4, and the target pronunciation dictionary is annotated with the dialect pronunciation of each word;

6. A dialect pronunciation annotation device, comprising:

The labeling module is used for labeling the dialect pronunciation of each word in the dialect training set according to the pronunciation phoneme sequence of each word in the dialect training set;

the dialect pronunciation marking device further comprises:

the second dictionary construction module is used for adding the target pronunciation of the target word into the Mandarin pronunciation dictionary to obtain a target pronunciation dictionary, wherein the target pronunciation of the target word is a pronunciation determined after screening in a plurality of pronunciations of the target word, and the screening comprises rejecting pronunciations with lower occurrence times in the dialect training set in the plurality of pronunciations;

the decoding module is configured to perform speech-to-phoneme decoding on the dialect training set by using a mandarin acoustic model and a phoneme language model to obtain a pronunciation phoneme sequence corresponding to each speech in the dialect training set, where the phoneme language model is trained by using a mandarin phoneme set, and the mandarin phoneme set includes mandarin pronunciation phonemes.

7. A speech recognition apparatus, comprising:

a first dictionary construction module, configured to construct a target pronunciation dictionary based on the dialect pronunciation of each word in the labeled dialect training set, where the dialect pronunciation of each word is labeled by using the dialect pronunciation labeling method of any one of claims 1 to 4, and the target pronunciation dictionary is labeled with the dialect pronunciation of each word;

8. A dialect pronunciation annotation device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the dialect pronunciation annotation method as claimed in any one of claims 1 to 4.

9. A speech recognition device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, which when executed by the processor performs the steps in the speech recognition method as claimed in claim 5.

10. A computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, which when executed by a processor, implements the steps in the dialect pronunciation annotation method as claimed in any one of claims 1 to 4; alternatively, the computer program is executed by a processor to implement the steps in the speech recognition method as claimed in claim 5.