CN111369974A

CN111369974A - Dialect pronunciation labeling method, language identification method and related device

Info

Publication number: CN111369974A
Application number: CN202010165255.9A
Authority: CN
Inventors: 王磊; 冯大航; 陈孝良; 常乐
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2020-07-03
Anticipated expiration: 2040-03-11
Also published as: CN111369974B

Abstract

The invention provides a dialect pronunciation labeling method, a language identification method and a related device, wherein the dialect pronunciation labeling method comprises the following steps: performing audio-text alignment on the obtained dialect training set to obtain a word boundary of each word in the dialect training set; performing speech-phoneme decoding on the dialect training set by using a mandarin speech recognition model to obtain a pronunciation phoneme sequence of each speech in the dialect training set; determining the pronunciation phoneme sequence of each word in the dialect training set according to the pronunciation phoneme sequence of each voice in the dialect training set obtained by decoding and the word boundary of each word in the dialect training set; determining a target word with a plurality of pronunciations according to the pronunciation phoneme sequence of each word in the dialect training set; and adding the target pronunciation of the target word into the Mandarin pronunciation dictionary to obtain a target pronunciation dictionary. The embodiment of the invention can automatically finish dialect pronunciation labeling without depending on manpower, and can save manpower and time cost.

Description

Dialect pronunciation labeling method, language identification method and related device

Technical Field

The invention relates to the technical field of voice processing, in particular to a dialect pronunciation labeling method, a language identification method and a related device.

Background

The language is one of the most direct and natural ways for human to realize information interaction, and dialects used by people are different with different regions. With the social development and the popularization of artificial intelligence, speech recognition of dialects is greatly challenged.

Where a pronunciation dictionary is the basis of speech recognition, some regional dialects may have accents that vary in pronunciation compared to mandarin, for example: the putonghua "dry what < g a n sh a ǐ >" and the kataka "dry what < g a n sh a z ǐ >", "what" pronounces from 2 sounds to 3 sounds, and there is also a possibility that a phrase will read continuously and be silent during the pronunciation of the dialect, for example: mandarin "does not know < bu zhi dao >" becomes "does not know < bu dao >" in northeast, and these dialect pronunciations are not labeled on the pronunciation dictionary.

At present, pronunciation labels of dialect words are mainly manual labels, construction of dialect dictionaries also depends on manual construction, the words with multiple pronunciations are added by manual summary, different dialect pronunciations are possibly more changeable, time and labor are wasted by adding labels only manually, and efficiency is low.

Disclosure of Invention

The embodiment of the invention provides a dialect pronunciation labeling method, a language identification method and a related device, which are used for solving the problems that the existing dialect pronunciation labeling method needs manpower, wastes time and labor and is low in efficiency.

In order to solve the technical problem, the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a dialect pronunciation labeling method, including:

performing audio-text alignment on the obtained dialect training set to obtain a word boundary of each word in the dialect training set, wherein the word boundary of each word is an audio starting frame and an audio ending frame of the word in the dialect training set, and the dialect training set comprises dialect voices and corresponding texts;

performing speech-phoneme decoding on the dialect training set by using a mandarin speech recognition model to obtain a pronunciation phoneme sequence of each speech in the dialect training set;

determining the pronunciation phoneme sequence of each word in the dialect training set according to the pronunciation phoneme sequence of each voice in the dialect training set obtained by decoding and the word boundary of each word in the dialect training set;

and labeling the dialect pronunciation of each word in the dialect training set according to the pronunciation phoneme sequence of each word in the dialect training set.

Optionally, after labeling the dialect pronunciation of each word in the dialect training set, the method further includes:

determining a target word with a plurality of pronunciations according to the pronunciation phoneme sequence of each word in the dialect training set and in combination with the pronunciations of each word labeled in the Mandarin pronunciation dictionary;

and adding the target pronunciation of the target word into the Mandarin pronunciation dictionary to obtain a target pronunciation dictionary.

Optionally, after determining the target word with multiple pronunciations, before adding the target pronunciation of the target word to the mandarin chinese pronunciation dictionary to obtain a target pronunciation dictionary, the method further includes:

and determining a target pronunciation of the target word based on the occurrence frequency of the pronunciations of the target word in the dialect training set, wherein the target pronunciation is a pronunciation of which the occurrence frequency meets a preset condition in the pronunciations of the target word.

Optionally, the performing audio-text alignment on the obtained dialect training set includes:

using an acoustic model as a speech recognition training model, inputting dialect speech and a mandarin pronunciation dictionary in the dialect training set as model inputs, outputting corresponding dialect words in the dialect training set as models, and training to obtain a first speech recognition model;

and carrying out audio-text alignment on the obtained dialect training set by utilizing the first speech recognition model.

Optionally, the performing speech-phoneme decoding on the dialect training set by using a mandarin chinese speech recognition model to obtain a pronunciation phoneme sequence of each speech in the dialect training set includes:

and carrying out voice-phoneme decoding on the dialect training set by utilizing a mandarin acoustic model and a phoneme language model to obtain a pronunciation phoneme sequence corresponding to each voice in the dialect training set, wherein the phoneme language model is obtained by utilizing a mandarin phoneme set which comprises mandarin pronunciation phonemes.

Optionally, before performing speech-phoneme decoding on the dialect training set by using a mandarin chinese acoustic model and a phoneme language model, the method further includes:

and using a language model as a phoneme language training model, inputting the phonemes and the phoneme pronunciation dictionary in the Mandarin Chinese phoneme set as a model, outputting a corresponding pronunciation phoneme sequence which accords with the pronunciation rule of the Mandarin Chinese as the model, and training to obtain the phoneme language model, wherein the phoneme pronunciation dictionary stores the corresponding relation between the phonemes of the words and the pronunciation phoneme sequence.

In a second aspect, an embodiment of the present invention provides a speech recognition method, including:

constructing a target pronunciation dictionary based on the dialect pronunciation of each word in the labeled dialect training set, wherein the dialect pronunciation of each word is obtained by labeling by using the dialect pronunciation labeling method of the first aspect, and the target pronunciation dictionary is labeled with the dialect pronunciation of each word;

training and generating a dialect voice recognition model by utilizing the dialect training set and the target pronunciation dictionary;

and carrying out dialect recognition by utilizing the dialect speech recognition model.

In a third aspect, an embodiment of the present invention provides a dialect pronunciation labeling apparatus, including:

the alignment module is used for performing audio-text alignment on the obtained dialect training set to obtain a word boundary of each word in the dialect training set, wherein the word boundary of each word is an audio starting frame and an audio ending frame of the word in the dialect training set, and the dialect training set comprises dialect voices and corresponding texts;

the decoding module is used for carrying out voice-phoneme decoding on the dialect training set by utilizing a Mandarin speech recognition model to obtain a pronunciation phoneme sequence of each voice in the dialect training set;

a first determining module, configured to determine a pronunciation phoneme sequence of each word in the dialect training set according to the pronunciation phoneme sequence of each speech in the dialect training set obtained through decoding and a word boundary of each word in the dialect training set;

and the marking module is used for marking the dialect pronunciation of each word in the dialect training set according to the pronunciation phoneme sequence of each word in the dialect training set.

Optionally, the dialect pronunciation labeling device further includes:

the second determining module is used for determining a target word with a plurality of pronunciations according to the pronunciation phoneme sequence of each word in the dialect training set and in combination with the pronunciations of each word labeled in the Mandarin pronunciation dictionary;

and the second dictionary construction module is used for adding the target pronunciation of the target word into the Mandarin pronunciation dictionary to obtain a target pronunciation dictionary.

Optionally, the dialect pronunciation labeling device further includes:

and the third determining module is used for determining the target pronunciation of the target word based on the occurrence frequency of the pronunciations of the target word in the dialect training set, wherein the target pronunciation is the pronunciation of which the occurrence frequency meets a preset condition in the pronunciations of the target word.

Optionally, the alignment module includes:

the model training unit is used for using an acoustic model as a speech recognition training model, inputting dialect speech and a mandarin pronunciation dictionary in the dialect training set as models, outputting corresponding dialect words in the dialect training set as models, and training to obtain a first speech recognition model;

and the alignment unit is used for carrying out audio-text alignment on the obtained dialect training set by utilizing the first voice recognition model.

Optionally, the decoding module is configured to perform speech-phoneme decoding on the dialect training set by using a mandarin chinese acoustic model and a phoneme language model to obtain a pronunciation phoneme sequence corresponding to each piece of speech in the dialect training set, where the phoneme language model is obtained by using a mandarin chinese phoneme set, and the mandarin chinese phoneme set includes a mandarin chinese pronunciation phoneme.

Optionally, the dialect pronunciation labeling device further includes:

and the model training module is used for using a language model as a phoneme language training model, inputting the phonemes and the phoneme pronunciation dictionary in the Mandarin Chinese phoneme set as a model, outputting a corresponding pronunciation phoneme sequence which accords with the pronunciation rule of the Mandarin Chinese as the model, and training to obtain the phoneme language model, wherein the phoneme pronunciation dictionary stores the corresponding relation between the phonemes of the words and the pronunciation phoneme sequence.

In a fourth aspect, an embodiment of the present invention provides a speech recognition apparatus, including:

a first dictionary construction module, configured to construct a target pronunciation dictionary based on the dialect pronunciation of each word in the labeled dialect training set, where the dialect pronunciation of each word is obtained by labeling with the dialect pronunciation labeling method according to the first aspect, and the target pronunciation dictionary is labeled with the dialect pronunciation of each word;

the model generation module is used for training and generating a dialect voice recognition model by utilizing the dialect training set and the target pronunciation dictionary;

and the speech recognition module is used for carrying out dialect recognition by utilizing the dialect speech recognition model.

In a fifth aspect, an embodiment of the present invention provides a dialect pronunciation labeling apparatus, including a processor, a memory, and a computer program stored on the memory and executable on the processor, where the computer program, when executed by the processor, implements the steps in the dialect pronunciation labeling method according to the first aspect.

In a sixth aspect, an embodiment of the present invention provides a speech recognition apparatus, including a processor, a memory, and a computer program stored on the memory and executable on the processor, where the computer program, when executed by the processor, implements the steps in the speech recognition method according to the second aspect.

In a seventh aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program implements the steps in the dialect pronunciation labeling method according to the first aspect; alternatively, the computer program realizes the steps in the speech recognition method according to the second aspect when executed by a processor.

In the embodiment of the invention, the word boundary of each word in the dialect training set is obtained by performing audio-text alignment on the dialect training set, and the speech-phoneme decoding is performed on the dialect training set to obtain the correct pronunciation phoneme sequence of each speech in the dialect training set, so that the pronunciation phoneme sequence of each word in the dialect training set is determined according to the correct pronunciation phoneme sequence of each speech in the dialect training set and the word boundary of each word in the dialect training set, and the pronunciation labeling of the dialect words is completed based on the pronunciation phoneme sequence. Therefore, the pronunciation marking of the spoken words can be realized through the automatic process without depending on the manpower, and the manpower and the time cost are saved. In addition, the dialect pronunciation labeling method can be used for further constructing and obtaining a target pronunciation dictionary labeled with dialect pronunciations, and the target pronunciation dictionary is used for recognizing the dialect, so that the accuracy of dialect recognition can be improved, and the dialect recognition effect is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a flowchart of a dialect pronunciation labeling method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a dialect pronunciation labeling method according to an embodiment of the present invention;

FIG. 3 is a flow chart of another speech recognition method provided by an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a dialect pronunciation labeling apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart of a dialect pronunciation labeling method according to an embodiment of the present invention, and as shown in fig. 1, the method includes the following steps:

step 101, performing audio-text alignment on the obtained dialect training set to obtain a word boundary of each word in the dialect training set, where the word boundary of each word is an audio start frame and an audio end frame of the word in the dialect training set, and the dialect training set includes dialect voices and corresponding texts.

The dialect training set can comprise dialect voices and corresponding texts, wherein the dialect voices can be obtained through prerecorded dialect voice data, and a large amount of dialect voices with standard pronunciations can be prerecorded in order to ensure that a better training effect and labels can obtain more accurate dialect pronunciations and even construct a more complete dialect pronunciation dictionary; the corresponding text can be pre-labeled for the text meaning of each dialect. For example, the tetrakawa dialect speech "dry what < g a n sh a z ǐ >" can be pre-labeled with the corresponding text as "dry what" in the dialect training set.

It should be noted that, in order to target the pronunciation of a local dialect, the dialect training corpus in the dialect training set may be a training set recorded specifically for a local dialect, for example, the dialect training set may be a northeast dialect training set, a sichuan dialect training set, or the like, and the dialect pronunciation thus obtained or the constructed dialect pronunciation dictionary may be used to target-specifically identify the dialect in a certain area. Of course, dialects in different regions can be used as a dialect training set to label pronunciations of multiple dialects at one time, and a target pronunciation dictionary with multiple dialect pronunciation labels can be constructed, and the pronunciation dictionary constructed in this way can be used for recognizing the dialects in different regions.

In this embodiment, after obtaining the dialect training set, the dialect training set may be subjected to speech alignment, that is, audio-text alignment, to obtain a word boundary of each word in the dialect training set, that is, by aligning each section of dialect audio in the dialect training set with a corresponding text, an audio start frame and an audio end frame of each word in the text in the dialect training set in the corresponding section of dialect audio, that is, a word boundary of each word, may be determined and obtained. For example, the word boundaries corresponding to the word "dry what" correspond to the audio start frame when the pronunciation is g a n and the audio end frame when the pronunciation is z ǐ in the dialect audio.

When the dialect training set is subjected to audio-text alignment, a Viterbi alignment algorithm may be adopted to implement the alignment, or some alignment gadgets (such as voice alignment, forced alignment, etc.) may also be adopted to implement automatic alignment.

In an alternative embodiment, the audio-text alignment of the dialect training set may be implemented by training a speech recognition model, and the trained speech recognition model may be continuously optimized to ensure a better or even optimal audio-text alignment result.

Specifically, an acoustic model may be selected as the speech recognition training model, for example, a hidden markov model, a neural network model, etc. is mostly used for modeling for a mainstream acoustic model, and then dialect speech and a mandarin pronunciation dictionary in the dialect training set are input as models, corresponding dialect words in the dialect training set are output as models, and the speech recognition training model is trained to obtain a first speech recognition model capable of recognizing a dialect, where the mandarin pronunciation dictionary may use an existing general mandarin pronunciation dictionary in which correspondence between words and mandarin pronunciation phonemes are stored.

In the training process, the pronunciation phoneme of the dialect speech can be recognized by using the speech recognition training model, then a word corresponding to the pronunciation phoneme of the dialect speech is obtained by searching the mandarin pronunciation dictionary, and finally whether the word is consistent with the dialect word corresponding to the dialect training set is verified, if not, the structural parameters of the speech recognition training model can be readjusted, the training process is repeated until the word output by the model in almost every training is consistent with the dialect word in the dialect training set, and the finally determined model structure is the first speech recognition model obtained by training. In order to obtain the optimal speech recognition model as far as possible, the optimization function can be used for continuously optimizing the structural parameters of the training model in the training process, or when the currently selected speech recognition training model is found to have poor effect, other speech recognition training models can be replaced so as to train to obtain the first speech recognition model with better recognition effect.

And then, the dialect speech in the dialect training set can be input into the first speech recognition model for speech recognition by performing audio-text alignment on the dialect training set by using the first speech recognition model, so that each section of dialect audio and corresponding dialect word are obtained through recognition, and the word boundary of each word in the dialect training set can be obtained according to the recognition result.

Therefore, the dialect training set is subjected to audio-text alignment through the trained first speech recognition model, so that the audio-text alignment process can be quickly completed, and a more accurate alignment result can be ensured.

And 102, carrying out voice-phoneme decoding on the dialect training set by using a Mandarin speech recognition model to obtain a pronunciation phoneme sequence of each voice in the dialect training set.

In this step, speech-to-phoneme decoding may be performed on the dialect training set to obtain a pronunciation phoneme sequence of each speech in the dialect training set, specifically, a mandarin speech recognition model may be used to decode each speech in the dialect training set to obtain a pronunciation phoneme sequence of each speech, and specifically, since each audio frame of each speech is decoded in the decoding process, there may be a plurality of possible combinations of the pronunciation phoneme sequences obtained by decoding, in the decoding process, a correct pronunciation phoneme sequence corresponding to each speech may be further determined by combining front and rear speech information, i.e., front and rear audio frames, in the dialect training set.

For example, the mandarin chinese speech recognition model decodes to obtain the pronunciation phoneme sequence of "g-e (i) -n", but it can be known from the preceding and following audio frames that "gin" is the impossible pronunciation phoneme and "gen" is the correct pronunciation phoneme, so that the final decoding result can be determined to be "gen".

The mandarin speech recognition model may be an existing speech recognition model for recognizing mandarin, and in order to ensure that a more accurate decoding result is obtained, a mandarin speech recognition model with a better recognition effect may be selected as much as possible, and the mandarin speech recognition model may also be obtained by training an algorithm using a neural network, such as a gradient descent algorithm.

It should be noted that, the execution sequence of the step 101 and the step 102 is not limited, and the steps may be executed in parallel, or may be executed sequentially, for example, one step is executed first, and then the other step is executed.

Optionally, the step 102 includes:

In an alternative embodiment of phoneme decoding, the dialect training set may be decoded by combining a mandarin chinese acoustic model and a phoneme language model, and the two models may be used to output a most likely pronunciation phoneme sequence corresponding to each speech in the dialect training set, so as to obtain a pronunciation phoneme sequence corresponding to each speech in the dialect training set, where the mandarin chinese acoustic model may decode an audio feature of each speech in the dialect training set, and the phoneme language model may further decode a pronunciation phoneme sequence corresponding to each speech.

In this embodiment, the phone language model may be trained by using a mandarin chinese phone set in advance, where the mandarin chinese phone set is a phone set in a mandarin chinese pronunciation, and 26 pinyins such as a, b, c, d, and e are all phones in a mandarin chinese pronunciation, the model is used to decode an input phone sequence and output a correct pronunciation phone sequence, where the correct pronunciation phone sequence is a phone sequence that can combine pronunciations in the mandarin chinese pronunciation, and the model corrects phones that cannot combine pronunciations in a decoding process.

Specifically, the phoneme language model may be obtained by training with the purpose of obtaining a correct pronunciation-phoneme combination through output, by using a mandarin chinese phoneme set as a training corpus, and by using a language model in language recognition as a training model, such as a commonly used Weighted Finite-state-converter (WFST) algorithm model.

After decoding the dialect training set to obtain the corresponding candidate pronunciation phoneme, the phoneme language model may decode the candidate pronunciation phoneme input again, and output to obtain a correct pronunciation phoneme sequence according with the pronunciation rule of the mandarin chinese.

Therefore, by training the phoneme language model and decoding the dialect training set by combining the Mandarin Chinese acoustic model and the phoneme language model, the corresponding correct pronunciation phoneme sequence can be obtained by fast and accurately decoding.

Before performing phoneme decoding on the candidate pronunciation phoneme corresponding to each piece of speech by using the phoneme language model, the method further includes:

In this embodiment, a phone language model with a better phone decoding effect can be obtained by training the mandarin chinese phone set and the phone pronunciation dictionary, so as to ensure the accuracy and reliability of the decoding result obtained by decoding the candidate pronunciation phones by using the phone language model.

Specifically, a language model may be selected as a phoneme language training model, such as a WFST model, and three algorithms may be adopted as required, such as one of Composition, Determinization, and Minimization, and then a phoneme and phoneme pronunciation dictionary in the mandarin chinese phone set is used as model input, a corresponding pronunciation phoneme sequence conforming to the mandarin chinese pronunciation rule is used as model output, and the phoneme language training model is trained to obtain a phoneme language model capable of decoding a correctly pronouncing phoneme sequence, wherein the phoneme pronunciation dictionary may be obtained by constructing a corresponding relationship between a phoneme of a word and a pronunciation phoneme sequence.

In the training process, a plurality of phonemes of the pronunciation of the corresponding word in the mandarin chinese phoneme set can be input into the phoneme language training model to obtain a phoneme combination consisting of the phonemes, and then a pronunciation phoneme sequence of the word corresponding to the phoneme combination is searched from the phoneme pronunciation dictionary and is used as an output, so that the trained model can decode the input pronunciation phoneme capable of corresponding to the word and output the pronunciation phoneme corresponding to the word as a correct pronunciation phoneme sequence, while for the input pronunciation phoneme incapable of corresponding to the word, the pronunciation phoneme sequence cannot be searched and is judged as an incorrect pronunciation phoneme sequence, and such candidate pronunciation phoneme combinations are excluded.

For example, the phone in the mandarin chinese phone set may be "a b c d e …", etc., the phone pronunciation dictionary may include two columns, the first column is a word phone, such as "ai bo cu di …", the second column is a word corresponding pronunciation phone sequence, such as "ai bo cu di …", during training, if the phone "a-i" is input, the pronunciation phone sequence to be output by the training model is "ai", and if the phone "g-e-n" is input, the corresponding pronunciation phone sequence "gen" is output by the training model.

Therefore, the candidate pronunciation phonemes corresponding to each piece of voice are subjected to phoneme decoding by training the language identification model, so that the phoneme decoding process can be quickly completed, and a more accurate decoding result can be ensured.

Step 103, determining the pronunciation phoneme sequence of each word in the dialect training set according to the pronunciation phoneme sequence of each voice in the dialect training set obtained by decoding and the word boundary of each word in the dialect training set.

After obtaining the correct pronunciation phoneme sequence of each speech in the dialect training set and the word boundary of each word in the dialect training set, the pronunciation phoneme sequence of each word in the dialect training set may be determined by combining the two sequences, and specifically, the correct pronunciation phoneme sequence of each speech in the dialect training set may be segmented according to the word boundary of each word in the dialect training set to obtain the pronunciation phoneme sequence of each word.

For example, the pronunciation phoneme sequence decoded to a dialect speech is "n ǐ z a ig nsh a z ǐ", and then the pronunciation phoneme sequence "n ǐ z a ig nsh z ǐ" is segmented according to the audio start frame and end frame of each word ("you", "in" and "dry sub") in the speech, so that the pronunciation phoneme sequence of each word can be determined as: the pronunciation phoneme sequence of "you" is "n ǐ", "the pronunciation phoneme sequence of" in "is" z aji ", and the pronunciation phoneme sequence of" dry yam "is" g a nsh a z ǐ ".

And 104, labeling the dialect pronunciation of each word in the dialect training set according to the pronunciation phoneme sequence of each word in the dialect training set.

After the pronunciation phoneme sequence of each word in the dialect training set is obtained, the dialect pronunciation of each word in the dialect training set can be determined based on the pronunciation phoneme sequence, and the determined dialect pronunciation of each word can be respectively labeled as the dialect pronunciation of the corresponding word, so that the dialect pronunciation labeling work is completed. For example, after the phoneme sequence of "dry what" is determined to be "g assh a z ǐ", the dialect "dry what" can be labeled as "g assh a z ǐ".

Certainly, the obtained pronunciation phoneme sequence of each word in the dialect training set may be biased from the actual pronunciation by various factors, such as the influence of the dialect training set itself, the recognition and decoding model, and the like, so that, in order to reduce the dialect pronunciation labeling errors as much as possible and improve the labeling accuracy, the occurrence frequency of each pronunciation of each word in the dialect training set may be counted first based on the obtained pronunciation phoneme sequence of each word in the dialect training set, and some pronunciations with a small occurrence frequency are removed, and pronunciations with a large occurrence frequency are regarded as credible pronunciations and labeled as the dialect pronunciations of the corresponding words.

Optionally, the method further includes:

In this embodiment, after obtaining the pronunciation phoneme sequence of each word in the dialect training set, the pronunciations of the words can be statistically generalized based on this to determine all possible pronunciations of each word, and further, the target word with multiple pronunciations can be determined according to the statistical result, for example, if the pronunciation of the dialect word "dry what" includes "g astrh z ǐ" and "g astrh z ǐ" after decoding, the word can be determined as the target word with multiple pronunciations.

When determining the target word with multiple pronunciations, the target word with multiple pronunciations can be determined according to the dialect pronunciation of each word and the mandarin pronunciation thereof by further combining the pronunciations of each word labeled in the mandarin pronunciation dictionary, for example, if the pronunciation of the decoded dialect word "dry what" is "g assh z ǐ" and the mandarin pronunciation "g assh z ǐ", the word can be determined as the target word with multiple pronunciations.

In this way, by combining the mandarin pronunciation of each word in the dialect training set, a target word with multiple pronunciations can be more comprehensively determined.

After a target word with multiple pronunciations is determined, the target pronunciation of the target word can be added into a mandarin pronunciation dictionary to obtain a new pronunciation dictionary, namely the target pronunciation dictionary, wherein the target pronunciation of the target word can be determined by screening the multiple pronunciations of the target word so as to eliminate some pronunciations which are not accurately labeled due to pronunciation errors in a dialect training set or errors occurring due to model decoding, and the screening can be manual verification or preset condition elimination, for example, some pronunciations of the target word which occur with a low frequency in the dialect training set are eliminated, and only the remaining pronunciations are reserved as the target pronunciation of the word.

For example, when it is determined that the pronunciation of the dialect word "dry what" includes "g a nsh a z ǐ" and the frequency of occurrence thereof in the dialect training set is high, the pronunciation may be added to the mandarin pronunciation dictionary, so that in the new target pronunciation dictionary, "dry what" is a polyphonic word including two pronunciations "g a nsh a z ǐ" and "g nsh a z ǐ".

Thus, the target pronunciation dictionary marked with Mandarin pronunciation and dialect pronunciation can be automatically constructed through the embodiment, and the constructed target pronunciation dictionary can be used for dialect recognition without depending on manual work.

Optionally, after the target word with multiple pronunciations is determined, before the target pronunciation of the target word is added into the mandarin chinese pronunciation dictionary to obtain a target pronunciation dictionary, the method further includes:

In this embodiment, in order to ensure the accuracy of the constructed target pronunciation dictionary, after obtaining the multiple pronunciations of the target word, the target pronunciation may be determined according to the occurrence frequencies of the multiple pronunciations of the word in the dialect training set, and specifically, the pronunciation whose occurrence frequency meets a preset condition may be determined as the target pronunciation, where the preset condition may be that the frequency is higher than the preset frequency, or the first N pronunciations sorted according to the frequency, such as the first three pronunciations with the highest frequency, or the frequency is ranked first and higher than the preset frequency, where the occurrence frequency may be represented by the number of occurrences. Thus, the pronunciations with lower occurrence frequency are considered to be low in reliability and are rejected, and the pronunciations with higher occurrence frequency are reserved due to higher reliability.

In this embodiment, the target pronunciation of the target word is determined based on the frequency of occurrence of the plurality of pronunciations of the target word in the dialect training set, so that the target pronunciations of the target word are guaranteed to have high confidence, and the target pronunciation dictionary constructed based on the target pronunciation of the target word also has high accuracy.

In the dialect pronunciation labeling method in this embodiment, a word boundary of each word in the dialect training set is obtained by performing audio-text alignment on the dialect training set, and a correct pronunciation phoneme sequence of each voice in the dialect training set is obtained by performing voice-phoneme decoding on the dialect training set, so that a pronunciation phoneme sequence of each word in the dialect training set is determined according to the correct pronunciation phoneme sequence of each voice in the dialect training set and the word boundary of each word in the dialect training set, and pronunciation labeling of the dialect words is completed based on the pronunciation phoneme sequence. Therefore, the pronunciation marking of the spoken words can be realized through the automatic process without depending on the manpower, and the manpower and the time cost are saved.

The embodiment of the embodiment shown in fig. 1 is exemplified by constructing a dialect pronunciation dictionary with reference to fig. 2:

step 201, training a speech recognition model by using a dialect training set and a mandarin pronunciation dictionary, wherein an optimal speech recognition model is trained as far as possible in the training so as to obtain an optimal alignment result.

Step 202, performing audio-text alignment on the dialect training set by using the trained voice recognition model to obtain a word boundary of each word in the dialect training set.

Step 203, constructing a phoneme language model according to the mandarin chinese phoneme set, wherein phonemes in the mandarin chinese phoneme set are used as a training corpus, and a phoneme pronunciation dictionary is used for performing phoneme language model training.

And step 204, decoding the dialect training set by using the mandarin acoustic model and the phoneme language model to obtain a pronunciation phoneme sequence of each voice in the dialect training set.

Step 205, obtaining the pronunciation of each dialect word in the dialect training set expressed by the mandarin phoneme, that is, the dialect pronunciation phoneme sequence thereof, according to the word boundary of each word in the dialect training set obtained in step 202 and the pronunciation phoneme sequence of each pronunciation in the dialect training set obtained by decoding. For example, the northeast dialect "dry what" is labeled in the original mandarin pronunciation dictionary as "g a nsh z ǐ", and decoded to obtain the dialect pronunciation phoneme sequence of "g a nsh a z ǐ".

Step 206, it is determined from the foregoing steps that each dialect word may have multiple pronunciations, and the pronunciations with higher frequency may be further sorted according to the occurrence frequencies of the pronunciations in the dialect training set, and selected as the target pronunciations, and added into the original mandarin pronunciation dictionary to generate a new pronunciation dictionary for dialect recognition.

Referring to fig. 3, fig. 3 is a flowchart of a speech recognition method according to an embodiment of the present invention, and as shown in fig. 3, the method includes the following steps:

step 301, constructing a target pronunciation dictionary based on the dialect pronunciation of each word in the labeled dialect training set, wherein the dialect pronunciation of each word is obtained by labeling by using a dialect pronunciation labeling method in the embodiment of the method shown in fig. 1, and the target pronunciation dictionary is labeled with the dialect pronunciation of each word;

in this embodiment, the dialect pronunciation of each word in the dialect training set labeled by the method embodiment shown in fig. 1 may be utilized to construct a target pronunciation dictionary labeled with the dialect pronunciation of each word, and a specific construction manner may refer to a related description in an optional implementation manner of how to construct the target pronunciation dictionary in the method embodiment shown in fig. 1, and in order to avoid repetition, details are not described here again.

And step 302, training and generating a dialect voice recognition model by utilizing the dialect training set and the target pronunciation dictionary.

The dialect recognition is completed by utilizing the target pronunciation dictionary obtained by construction, so that different pronunciation problems of some local dialects are effectively solved. Specifically, a large number of dialect training sets, such as recorded dialect voices and corresponding texts, may be obtained first, then the dialect voices in the dialect training sets are used as model inputs, and the corresponding texts are used as model outputs, the training models may use acoustic models commonly used in a voice recognition system, such as hidden markov models, neural network models, and the like, and the training process is similar to the voice recognition model training process in the prior art, which may be specifically referred to the related description of the voice recognition model in the foregoing embodiment. The target pronunciation dictionary can be used in a training process, after the training model decodes the speech of the dialect to obtain a pronunciation phoneme sequence, the target pronunciation dictionary is inquired to obtain a text corresponding to the pronunciation phoneme sequence, and therefore the dialect speech recognition model capable of recognizing the dialect can be generated through the training.

And step 302, dialect recognition is carried out by utilizing the dialect voice recognition model.

After the dialect speech recognition model is generated through training, dialect recognition can be performed by using the dialect speech recognition model generated through training, namely, dialect speech input by a user can be recognized, specifically, received dialect speech is input into the model, and the model outputs text meanings corresponding to the dialect speech.

Therefore, the dialect speech recognition model is obtained by training based on the target pronunciation dictionary marked with the dialect pronunciations, so that the dialect of the user can be recognized more accurately, and the dialect recognition effect is improved.

In the speech recognition method in this embodiment, the dialect speech recognition model is generated by training the target pronunciation dictionary and the dialect training corpus obtained by using the dialect pronunciation labeling method, and the dialect is recognized by using the model, so that the accuracy of dialect recognition can be improved, and the dialect recognition effect can be further improved.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a dialect pronunciation labeling apparatus according to an embodiment of the present invention, and as shown in fig. 4, the dialect pronunciation labeling apparatus 400 includes:

an alignment module 401, configured to perform audio-text alignment on the obtained dialect training set to obtain a word boundary of each word in the dialect training set, where the word boundary of each word is an audio start frame and an audio end frame of the word in the dialect training set, and the dialect training set includes dialect voices and corresponding texts;

a decoding module 402, configured to perform speech-phoneme decoding on the dialect training set by using a mandarin chinese speech recognition model to obtain a pronunciation phoneme sequence of each speech in the dialect training set;

a first determining module 403, configured to determine a pronunciation phoneme sequence of each word in the dialect training set according to the decoded pronunciation phoneme sequence of each speech in the dialect training set and a word boundary of each word in the dialect training set;

and a labeling module 404, configured to label a dialect pronunciation of each word in the dialect training set according to the pronunciation phoneme sequence of each word in the dialect training set.

Optionally, the dialect pronunciation labeling apparatus 400 further includes:

Optionally, the alignment module 401 includes:

Optionally, the decoding module 402 is configured to perform speech-phoneme decoding on the dialect training set by using a mandarin speech recognition model and a phoneme language model to obtain a pronunciation phoneme sequence corresponding to each speech in the dialect training set, where the phoneme language model is obtained by using a mandarin phoneme set, and the mandarin phoneme set includes a mandarin pronunciation phoneme.

Optionally, the dialect pronunciation labeling apparatus 400 further includes:

The dialect pronunciation labeling apparatus 400 can implement the processes in the method embodiments of fig. 1 and fig. 2, and is not described herein again to avoid repetition. The dialect pronunciation labeling device 400 of the embodiment of the invention obtains the word boundary of each word in the dialect training set by performing audio-text alignment on the dialect training set, and obtains the correct pronunciation phoneme sequence of each voice in the dialect training set by performing voice-phoneme decoding on the dialect training set, thereby determining the pronunciation phoneme sequence of each word in the dialect training set according to the correct pronunciation phoneme sequence of each voice in the dialect training set and the word boundary of each word in the dialect training set, and completing the pronunciation labeling of the dialect words based on the pronunciation phoneme sequence. Therefore, the pronunciation marking of the spoken words can be realized through the automatic process without depending on the manpower, and the manpower and the time cost are saved.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention, and as shown in fig. 5, the speech recognition apparatus 500 includes:

a first dictionary construction module 501, configured to construct a target pronunciation dictionary based on the dialect pronunciation of each word in the labeled dialect training set, where the dialect pronunciation of each word is obtained by labeling using a dialect pronunciation labeling method in the method embodiment shown in fig. 1, and the target pronunciation dictionary is labeled with the dialect pronunciation of each word;

a model generation module 502, configured to generate a dialect speech recognition model by using the dialect training set and the target pronunciation dictionary training;

and a speech recognition module 503, configured to perform dialect recognition by using the dialect speech recognition model.

The speech recognition apparatus 500 can implement the processes in the method embodiment of fig. 3, and is not described here again to avoid repetition. The speech recognition device 500 of the embodiment of the present invention constructs the target pronunciation dictionary from the dialect pronunciation obtained by the dialect pronunciation labeling method, generates the dialect speech recognition model by using the constructed target pronunciation dictionary and the dialect training set, and recognizes the dialect by using the model, thereby improving the accuracy of dialect recognition and further improving the dialect recognition effect.

The embodiment of the present invention further provides a dialect pronunciation labeling apparatus, which includes a processor, a memory, and a computer program stored in the memory and capable of running on the processor, and when being executed by the processor, the computer program implements each process of the dialect pronunciation labeling method embodiment shown in fig. 1, and can achieve the same technical effect, and is not described herein again to avoid repetition.

An embodiment of the present invention further provides a speech recognition apparatus, which includes a processor, a memory, and a computer program stored in the memory and capable of running on the processor, where the computer program, when executed by the processor, implements each process of the speech recognition method embodiment shown in fig. 3, and can achieve the same technical effect, and is not described herein again to avoid repetition.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the dialect pronunciation labeling method embodiment shown in fig. 1, and can achieve the same technical effect, and is not described herein again to avoid repetition; alternatively, the computer program is executed by the processor to implement the processes of the embodiment of the speech recognition method shown in fig. 3, and the same technical effects can be achieved, and are not described herein again to avoid repetition. The computer-readable storage medium may be a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A dialect pronunciation labeling method is characterized by comprising the following steps:

2. The method of claim 1, wherein after labeling the dialect pronunciation for each word in the dialect training set, the method further comprises:

3. The method of claim 2, wherein after determining the target word having a plurality of pronunciations, the method further comprises, before adding the target pronunciation of the target word to the Mandarin pronunciation dictionary to obtain a target pronunciation dictionary:

4. The method of any one of claims 1 to 3, wherein the audio-text aligning the obtained dialect training set comprises:

5. The method according to any one of claims 1 to 3, wherein said performing speech-phoneme decoding on said dialect training set using a Mandarin speech recognition model to obtain a pronunciation phoneme sequence of each speech in said dialect training set comprises:

6. The method of claim 5, wherein prior to speech-phoneme decoding the dialect training set using a Mandarin Chinese acoustic model and a phoneme language model, the method further comprises:

7. A speech recognition method, comprising:

constructing a target pronunciation dictionary based on the dialect pronunciation of each word in the labeled dialect training set, wherein the dialect pronunciation of each word is labeled by using the dialect pronunciation labeling method of any one of claims 1 to 6, and the target pronunciation dictionary is labeled with the dialect pronunciation of each word;

8. A dialect pronunciation labeling apparatus, comprising:

9. A speech recognition apparatus, comprising:

a first dictionary construction module, configured to construct a target pronunciation dictionary based on the dialect pronunciation of each word in the labeled dialect training set, where the dialect pronunciation of each word is labeled by using the dialect pronunciation labeling method according to any one of claims 1 to 6, and the target pronunciation dictionary is labeled with the dialect pronunciation of each word;

10. A dialect pronunciation tagging apparatus comprising a processor, a memory and a computer program stored on the memory and operable on the processor, wherein the computer program, when executed by the processor, implements the steps of the dialect pronunciation tagging method according to any one of claims 1 to 6.

11. A speech recognition apparatus comprising a processor, a memory and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps in the speech recognition method as claimed in claim 7.

12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the dialect pronunciation labeling method according to any one of claims 1 to 6; alternatively, the computer program realizes the steps in the speech recognition method as claimed in claim 7 when executed by a processor.