CN113936642A

CN113936642A - Pronunciation dictionary construction method, voice recognition method and related device

Info

Publication number: CN113936642A
Application number: CN202111222208.4A
Authority: CN
Inventors: 方昕; 刘俊华
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2022-01-14

Abstract

The application provides a pronunciation dictionary construction method, a voice recognition method and a related device, wherein the pronunciation dictionary construction method comprises the following steps: extracting phonemes from the target audio data to obtain a corresponding phoneme set; the target audio data is audio data covering all phonemes; according to the recognition probability of each phoneme in the extracted phoneme set and the phoneme label corresponding to the target audio data, determining similar phonemes of the phonemes in the phoneme label from the phoneme set; constructing a multi-pronunciation dictionary according to the standard pronunciation corresponding to the word and the rule correspondingly stored by the similar pronunciation; wherein the standard pronunciation is composed of phonemes in the phoneme label, and the similar pronunciation is composed of similar phonemes of the phonemes in the phoneme label. The multi-pronunciation dictionary constructed by the scheme can improve the fault tolerance and robustness of voice recognition, and further can improve the voice recognition effect.

Description

Pronunciation dictionary construction method, voice recognition method and related device

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a pronunciation dictionary construction method, a speech recognition device, a pronunciation dictionary equipment, and a storage medium.

Background

Currently, the mainstream commercial speech recognition system is still a framework based on the joint decoding of an acoustic model and a language model, wherein the acoustic model is mainly responsible for mapping speech features to phonemes, and the language model combines a pronunciation dictionary to convert phoneme strings into corresponding text strings.

In the pronunciation dictionary, the correspondence between a pronunciation made up of phonemes and a text is recorded. When the pronunciation of the phoneme string identified by the acoustic model is matched with a certain pronunciation in the pronunciation dictionary, the text corresponding to the phoneme string can be determined according to the corresponding relation between the pronunciation and the text in the pronunciation dictionary, namely, the phoneme to text conversion is realized.

The existing speech recognition scheme has poor fault tolerance, and is particularly characterized in that only a phoneme string strictly matched with pronunciation in a pronunciation dictionary is recognized as a text corresponding to the pronunciation in the pronunciation dictionary, so that although the absolute accuracy of speech recognition can be ensured, the robustness of the speech recognition is reduced, and the improvement of the speech recognition effect is not facilitated.

Disclosure of Invention

Based on the technical current situation, the application provides a pronunciation dictionary construction method, a voice recognition method, a device, equipment and a storage medium, which can improve the fault tolerance and robustness of voice recognition and improve the voice recognition effect.

A pronunciation dictionary construction method comprises the following steps:

extracting phonemes from the target audio data to obtain a corresponding phoneme set; the target audio data is audio data covering all phonemes;

according to the recognition probability of each phoneme in the extracted phoneme set and the phoneme label corresponding to the target audio data, determining similar phonemes of the phonemes in the phoneme label from the phoneme set; the similar phonemes of the phonemes in the phoneme label refer to a set number of phonemes which are selected from the phoneme set, correspond to the phonemes in the phoneme label and have the highest recognition probability;

constructing a multi-pronunciation dictionary according to the standard pronunciation corresponding to the word and the rule correspondingly stored by the similar pronunciation; wherein the similar pronunciation is composed of similar phonemes of the phonemes in the phoneme label.

Optionally, the constructing and obtaining a multi-pronunciation dictionary according to the rule that the words and the standard pronunciations and similar pronunciations corresponding to the words are stored correspondingly comprises:

determining a standard pronunciation corresponding to a word in a pronunciation dictionary, and determining a similar pronunciation corresponding to the word in the pronunciation dictionary according to a similar phoneme of the phoneme in the phoneme label;

and obtaining the multi-pronunciation dictionary by correspondingly storing the standard pronunciation and the similar pronunciation corresponding to the word.

Optionally, determining a standard pronunciation corresponding to a word in a pronunciation dictionary, and determining a similar pronunciation corresponding to the word in the pronunciation dictionary according to a similar phoneme of the phoneme in the phoneme label, includes:

respectively determining standard pronunciations corresponding to high-frequency error-prone words in a pronunciation dictionary, and respectively determining similar pronunciations corresponding to the high-frequency error-prone words in the pronunciation dictionary according to similar phonemes of phonemes in the phoneme label;

obtaining a multi-pronunciation dictionary by storing a word corresponding to a standard pronunciation and a similar pronunciation corresponding to the word, comprising:

and correspondingly storing the high-frequency error-prone words in the pronunciation dictionary and the standard pronunciations and similar pronunciations corresponding to the high-frequency error-prone words to obtain the multi-pronunciation dictionary.

Optionally, the obtaining a multi-pronunciation dictionary by correspondingly storing the high-frequency error-prone word in the pronunciation dictionary and the standard pronunciation and the similar pronunciation corresponding to the high-frequency error-prone word includes:

calculating to obtain a score of the similar pronunciation corresponding to the high-frequency error-prone word according to the similar pronunciation corresponding to the high-frequency error-prone word in the pronunciation dictionary and the recognition probability of each phoneme in the phoneme set;

selecting similar pronunciations with the score higher than a set score threshold value from the similar pronunciations corresponding to the high-frequency error-prone words according to the scores of the similar pronunciations corresponding to the high-frequency error-prone words, and using the similar pronunciations as target similar pronunciations;

and correspondingly storing the high-frequency error-prone words in the pronunciation dictionary, the standard pronunciations corresponding to the high-frequency error-prone words and the target similar pronunciations to obtain a multi-pronunciation dictionary.

Optionally, the extracting phonemes from the target audio data to obtain a corresponding phoneme set includes:

and inputting the target audio data into a pre-trained acoustic model for phoneme extraction to obtain a phoneme set corresponding to the target audio data.

A speech recognition method comprising:

acquiring a phoneme sequence of a voice to be recognized;

determining a voice recognition result of the voice to be recognized according to the phoneme sequence of the voice to be recognized and a pre-constructed polyphonic dictionary;

wherein, the words in the multi-pronunciation dictionary correspond to the standard pronunciation and the similar pronunciation corresponding to the words and are stored; the similar pronunciation corresponding to the word is constructed by the similar phoneme of the phoneme in the phoneme label corresponding to the target audio data; the target audio data contains audio data corresponding to the word.

Optionally, the multiple pronunciation dictionary is constructed according to the pronunciation dictionary construction method.

Optionally, the obtaining a phoneme sequence of the speech to be recognized includes:

and inputting the speech to be recognized into a pre-trained acoustic model for phoneme extraction to obtain a phoneme sequence of the speech to be recognized.

Optionally, the acoustic model is obtained by training as follows:

inputting the audio features of training speech into an acoustic model to obtain phoneme information of the training speech;

inputting the phoneme information of the training voice and the non-semantic information extracted according to the audio features of the training voice into an audio synthesis model to obtain an audio synthesis result;

and performing parameter correction on the acoustic model according to the voice recognition loss of the acoustic model and the audio synthesis loss of the audio synthesis model.

Optionally, inputting the phoneme information of the training speech and the non-semantic information extracted according to the audio feature of the training speech into an audio synthesis model to obtain an audio synthesis result, where the method includes:

carrying out down-sampling of the same scale on the phoneme information of the training voice and the audio features of the training voice;

extracting and obtaining non-semantic information of the training voice according to the audio features of the training voice after down sampling;

fusing the downsampled phoneme information and the non-semantic information to obtain audio synthesis basic information;

and inputting the audio synthesis basic information into an audio synthesis model to obtain an audio synthesis result.

Optionally, the speech recognition loss of the acoustic model is determined by a cross entropy loss function, and the audio synthesis loss of the audio synthesis model is determined by a mean square error loss function.

A pronunciation dictionary construction apparatus comprising:

the phoneme extraction unit is used for extracting phonemes from the target audio data to obtain a corresponding phoneme set; the target audio data is audio data covering all phonemes;

a phoneme screening unit, configured to determine, according to the recognition probability of each phoneme in the extracted phoneme set and a phoneme label corresponding to the target audio data, a similar phoneme of a phoneme in the phoneme label from the phoneme set; the similar phonemes of the phonemes in the phoneme label refer to a set number of phonemes which are selected from the phoneme set, correspond to the phonemes in the phoneme label and have the highest recognition probability;

the dictionary building unit is used for building a multi-pronunciation dictionary according to the standard pronunciation corresponding to the word and the rule correspondingly stored by the similar pronunciation; wherein the similar pronunciation is composed of similar phonemes of the phonemes in the phoneme label.

A speech recognition apparatus comprising:

the voice processing unit is used for acquiring a phoneme sequence of a voice to be recognized;

the voice recognition unit is used for determining a voice recognition result of the voice to be recognized according to the phoneme sequence of the voice to be recognized and a pre-constructed multi-pronunciation dictionary;

An electronic device, comprising:

a memory and a processor;

the memory is connected with the processor and used for storing programs;

the processor is used for implementing the pronunciation dictionary construction method or the voice recognition method by running the program in the memory.

A storage medium having stored thereon a computer program which, when executed by a processor, implements the pronunciation dictionary construction method described above, or implements the speech recognition method described above.

The multi-pronunciation dictionary is constructed based on the pronunciation dictionary construction method provided by the application, wherein one word corresponds to a plurality of pronunciations, specifically to a standard pronunciation corresponding to the word and a similar pronunciation of the word. When speech recognition is performed based on the pronunciation dictionary, even if phoneme extraction of the speech to be recognized is inaccurate, for example, the pronunciation of the speech to be recognized is predicted to be similar pronunciation but not accurate pronunciation, the correct text can be recognized based on the multi-pronunciation dictionary. Therefore, the multi-pronunciation dictionary constructed by the technical scheme can improve the fault tolerance and robustness of voice recognition, further improve the voice recognition effect, and obtain better recognition effect even in a complex scene.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flowchart illustrating a pronunciation dictionary construction method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an acoustic model training process provided by an embodiment of the present application;

fig. 4 is a schematic structural diagram of a pronunciation dictionary construction apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical scheme of the embodiment of the application is suitable for the application scene of voice recognition, and by the adoption of the technical scheme of the embodiment of the application, the robustness of the voice recognition can be improved, and then the voice recognition effect can be improved.

Speech Recognition technology, also known as Automatic Speech Recognition (ASR), uses Speech as a research object, and aims to convert human voice signals into words or instructions. Today, where artificial intelligence is rapidly evolving, speech recognition technology is the first step for machines to "understand" human language.

For example, the word "prepare" in the pronunciation dictionary corresponds to the correct pronunciation of "zhun 3bei 4", but in a complex scene, due to the influence of the speed of speech or the environment, the acoustic model may predict the phoneme string corresponding to the "prepare" speech as "zun 3bei 4", and at this time, according to the matching rule, since "zun 3bei 4" does not match with "zhun 3bei 4" in the pronunciation dictionary, the phoneme string "zun 3bei 4" cannot be recognized as "prepare", that is, correct recognition cannot be achieved.

Therefore, the existing voice recognition scheme has poor fault tolerance and poor recognition effect in complex scenes.

Based on the technical current situation, the embodiment of the application provides a pronunciation dictionary construction method, and by means of the pronunciation dictionary constructed by the technical scheme of the embodiment of the application, the fault tolerance and robustness of voice recognition can be improved, the voice recognition effect is improved, and especially the voice recognition effect under a complex scene can be improved.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a pronunciation dictionary construction method, and as shown in fig. 1, the method includes:

s101, extracting phonemes from the target audio data to obtain a corresponding phoneme set.

Specifically, the target audio data is audio data covering all phonemes. The total phonemes are all phonemes included in a certain language, for example, all phonemes in a chinese language, all phonemes in an english language, and the like, and are specifically determined by a language field to which the pronunciation dictionary is applied.

The target audio data is specifically a certain amount of audio data. The target audio data covers all phonemes, and is particularly embodied in a way that any phoneme in all phonemes is in pronunciation of any word or words in the target audio data. For example, for the phoneme "a" in chinese language, the target audio data is considered to cover the phoneme "a" as long as the phoneme "a" is included in the pronunciation of at least one word in the target audio data.

The words are different according to the language. For example, in chinese language, the word may be a single chinese character, or a word composed of a plurality of chinese characters; in the english language, the word may be an english word or an english letter.

As an exemplary implementation manner, the target audio data or the acoustic features of the target audio data are respectively input into a pre-trained acoustic model for phoneme extraction, so that phonemes corresponding to the input target audio data can be obtained.

The above-mentioned acoustic model is a model trained in advance for performing phoneme extraction on audio data, and the specific training process thereof can be referred to in the description of the following embodiments.

S102, according to the recognition probability of each phoneme in the extracted phoneme set and the phoneme label corresponding to the target audio data, determining similar phonemes of the phonemes in the phoneme label from the phoneme set.

The recognition probability of each phoneme in the phoneme set refers to a probability that a certain audio frame is recognized as the phoneme when the target audio data is subjected to phoneme extraction. For example, when extracting phonemes from target audio data by using the above-described acoustic model, the extracted phonemes and the recognition probabilities of the phonemes are output. For example, assuming that the probability of a certain audio frame being recognized as the phoneme "a" is 0.8, the recognition probability of the phoneme "a" is 0.8.

The recognition probability of a phoneme in the phoneme set represents the probability that the phoneme is recognized correctly. Therefore, when the same audio frame is simultaneously recognized as a plurality of different phonemes, a correct phoneme recognition result can be selected according to the recognition probability of each phoneme, and in general, the correct phoneme recognition result is the phoneme with the highest recognition probability in the recognized plurality of different phonemes. For example, assuming that the probability of being recognized as "a" of a certain audio frame is 0.8 and the probability of being recognized as "b" is 0.4, the phoneme corresponding to the audio frame can be determined as "a" because the recognition probability of the phoneme "a" is higher.

Specifically, the contents of the phoneme extraction and the recognition probability of the phoneme can also be referred to the contents of the phoneme extraction and the phoneme score in the phoneme extraction in the conventional technical scheme.

The phoneme label is obtained by manually labeling phonemes in the target audio data, and the phoneme label of the target audio data includes correct phonemes corresponding to each audio frame in the target audio data.

The similar phonemes of the phonemes in the phoneme label refer to a set number of phonemes which are selected from the phoneme set, which correspond to the phonemes in the phoneme label and have the highest recognition probability.

Illustratively, the phonemes similar to the phonemes in the phoneme label can be obtained by comparing the phonemes in the phoneme set and the phonemes in the phoneme label corresponding to the same audio frame, and selecting the phonemes with the recognition probability of TOPN corresponding to the phonemes in the phoneme label from the phoneme set. Where N is the above-mentioned set number, in the embodiment of the present application, N is 3, that is, the similar phonemes of the phonemes in the phoneme label, specifically, the 3 phonemes with the highest recognition probability corresponding to the phonemes in the phoneme label selected from the phoneme set.

Specifically, for example, assume that for a certain audio frame a in the target audio data, its phoneme label is "a"; when target audio data in which a plurality of audio frames a may exist in a plurality of different audio sentences is input to a pre-trained acoustic model for phoneme extraction, the plurality of audio frames a are respectively recognized as phonemes "a 1", "a 2", "a 3" and "a 4" at the time of phoneme extraction, and recognition probabilities of "a 1", "a 2", "a 3" and "a 4" are respectively 0.8, 0.6, 0.7 and 0.3, and "a 1", "a 2" and "a 3" are 3 phonemes having the highest recognition probability, so that "a 1", "a 2" and "a 3" may be respectively used as similar phonemes of the phoneme "a" in a phoneme label.

Further, in order to facilitate comparison of the similarity between each similar phoneme and the phoneme label, the embodiment of the present application normalizes the recognition probability of the similar phoneme of the phoneme in the phoneme label as the final recognition probability of each similar phoneme. The normalized recognition probability of the similar phoneme of the phoneme in the phoneme label can represent the similarity between the phoneme in the phoneme label and the similar phoneme.

In addition, due to the data diversity of the target audio data, it is likely that the same audio frame may appear in the target audio data multiple times, and due to the difference in context, the same audio frame may be recognized as the same phoneme or a different phoneme. Therefore, the phonemes in the phoneme set corresponding to the same audio frame may be the same or different, and even if the phonemes are the same, the recognition probabilities thereof may be different.

For example, for the above audio frame a, when performing phoneme extraction on it, the 3 phonemes with the highest recognition rate may be obtained as: the phoneme of "a 2", the recognition probability of 0.5; the phoneme of "a 2", the recognition probability of 0.8; the phoneme "a 4" identifies a probability of 0.3. It can be seen that the phonemes similar to the phoneme label "a" of the audio frame a have only 2 different phonemes although the selected phoneme has the highest recognition probability of 3, and at this time, the recognition probabilities of the respective similar phonemes are normalized, and the recognition probabilities of the same phonemes are merged and normalized.

Specifically, for the phoneme of "a 2" described above, the normalized recognition probability is 0.81 (0.5+0.8)/(0.5+0.8+0.3), and the normalized recognition probability of the phoneme of "a 4" is 0.3/(0.5+0.8+0.3) to 0.19.

Further, in order to more clearly record the relationship and similarity between phonemes and similar phonemes in the phoneme label, the embodiments of the present application construct a phoneme confusion matrix in which each phoneme is used as a row and a column of a matrix and the similarity between phonemes is used as a matrix element for the phonemes and the similar phonemes in the phoneme label of the target audio data.

The specific phoneme confusion matrix can be seen in table 1 below:

TABLE 1

	a	b	c	d
					a	1.0	0.3	0.4	0.3
b	0.3	1.0	0.5	0.2
					c	0.4	0.5	1.0	0.1
d	0.3	0.2	0.1	1.0

In table 1, the phonemes a, b, c and d are taken as examples to show the similarity between them. For example, when the phoneme a is a phoneme in the above-mentioned phoneme label, the phonemes similar thereto may be determined as phonemes b, c, and d, and the similarity between the phoneme a and the phonemes b, c, and d is known to be 0.3, 0.4, and 0.3, respectively.

Referring to the above description, for each phoneme in the phoneme label of the target audio data, a phoneme confusion matrix as shown in table 1 may be constructed for recording the similar phoneme of the phoneme and the similarity between the phoneme and the similar phoneme thereof.

S103, constructing a multi-pronunciation dictionary according to the standard pronunciation corresponding to the word and the rule correspondingly stored by the similar pronunciation; wherein the standard pronunciation is composed of phonemes in the phoneme label, and the similar pronunciation is composed of similar phonemes of the phonemes in the phoneme label.

Specifically, the word mentioned above refers to each word in the pronunciation dictionary, and the specific form of the word in the pronunciation dictionary differs according to the language to which the pronunciation dictionary belongs. For example, in a pronunciation dictionary of chinese language, the word may be a single chinese character, or a word composed of a plurality of chinese characters; in the pronunciation dictionary of english language, the words may be english words or english letters.

In a conventional pronunciation dictionary, a word corresponds to only one pronunciation. In the speech recognition, only when the pronunciation x formed by the phoneme of the speech to be recognized is strictly matched with the pronunciation x in the pronunciation dictionary, the text of the speech to be recognized can be confirmed as the text corresponding to the pronunciation x in the pronunciation dictionary, otherwise, the speech to be recognized cannot be recognized as the text corresponding to the pronunciation x in the pronunciation dictionary.

Unlike the conventional pronunciation dictionary, the embodiment of the present application expands the pronunciation dictionary, and constructs a multi-pronunciation dictionary according to the rules of the words and the standard pronunciations and similar pronunciations corresponding to the words. It will be understood that in the multiple-pronunciations dictionary, a word may correspond to two or more pronunciations, and that in two or more pronunciations, there may be both a standard pronunciation corresponding to the word and a similar pronunciation corresponding to the word, and the number of similar pronunciations corresponding to the word may be one or more.

The standard pronunciation is a pronunciation composed of phonemes in the phoneme label of the target audio data. Since the target audio data is audio data covering all phonemes, a correct phoneme corresponding to an arbitrary word in the pronunciation dictionary can be searched from the phoneme label of the target audio data, and a standard pronunciation of the word can be formed from the phoneme in the phoneme label.

The above-mentioned similar pronunciation is a pronunciation composed of similar phonemes of the phonemes in the above-mentioned phoneme label. The similar phonemes of the phonemes in the phoneme label described above can be described with reference to the above embodiments.

Illustratively, for example, for a word "prepare" in the pronunciation dictionary, the pronunciation of the word is "zhun 3bei 4" according to the phoneme label of the target voice data and the phonemes in the phoneme label, wherein the contained phonemes are "zhu", "un", "b" and "ei" in sequence. Then "zhun 3bei 4" is the standard pronunciation corresponding to the word "ready".

Referring to the above description, assuming that the phoneme "zh" has a similar phoneme of "z", a similar pronunciation "zun 3bei 4" of the word "ready" can be formed from the similar phoneme of "z"; meanwhile, assuming that the phoneme of "ei" has a similar phoneme of "en", a similar pronunciation of "zhun 3ben 4" of the word "ready" can be constructed from the similar phoneme of "en".

On the basis of this, the word "prepare" is stored in correspondence with its corresponding standard pronunciation "zhun 3bei 4" and similar pronunciations "zun 3bei 4" and "zhun 3ben 4", and then in this multiple pronunciation dictionary, the word "prepare" corresponds to three pronunciations, "zhun 3bei 4", "zun 3bei 4" and "zhun 3ben 4", respectively.

In a preferred embodiment, in the multi-pronunciation dictionary, the similar pronunciation corresponding to the word may be a similar pronunciation in which the similarity of the standard pronunciation corresponding to the word is greater than a predetermined threshold. For example, for a word X, the word X may include a plurality of phonemes, each phoneme having a plurality of similar phonemes, so that the plurality of phonemes and the similar phonemes of the plurality of phonemes may form a plurality of similar pronunciations corresponding to the standard pronunciation of the word X, and the number of pronunciations derived by combining the similar phonemes of the plurality of phonemes is greater than the number of phonemes. However, due to the similarity difference between the phonemes, the similarity between the pronunciations combined by the phonemes is also different, and the similarity between some similar pronunciations and the standard pronunciation is very low, for example, the similarities between the phonemes of the similar pronunciation and the phonemes of the standard pronunciation are all low, so that the similarity between the similar pronunciation and the standard pronunciation is low.

If a large number of similar pronunciations with low similarity to the labeled pronunciation are recorded in the multi-pronunciation dictionary, the voice recognition effect is reduced. For example, for a word X, the user may freely say a word similar to the pronunciation of the word X, for example, the user may say a word Y, and if the pronunciation of the word Y is the same as a pronunciation with a lower similarity to the standard pronunciation of the word X, the word Y may be recognized as the word X, which is obviously incorrect.

Therefore, in order to ensure the voice recognition effect, the screening condition can be set for the similar pronunciation corresponding to the word in the multi-pronunciation dictionary, and the specific screening condition can be flexibly set according to the actual requirement, for example, the screening condition can be set according to the voice recognition precision, the recognition rate and the like. Specifically, it may be configured that only similar pronunciations whose similarity exceeds a set similarity threshold are recorded in the multi-pronouncing dictionary, or that only a few similar pronunciations having the highest similarity with their standard pronunciations are selected for a word and recorded in the multi-pronouncing dictionary, or the like. The specific similar pronunciation screening can be flexibly executed under the technical idea.

After the multi-pronunciation dictionary is constructed, a decoding network is reconstructed based on the multi-pronunciation dictionary for voice recognition, and therefore voice decoding can be achieved by using the multi-pronunciation dictionary.

It can be understood that, based on the multi-pronunciation dictionary, when the extraction of the phonemes of the speech to be recognized is not accurate, or the speech to be recognized itself is not clear and accurate enough due to the fact that the speaker carries the accent or the scene is complex, and thus the accuracy of the extraction of the phonemes is affected, even if the result of the extraction of the phonemes of the speech to be recognized is different from the standard pronunciation of the speech to be recognized, for example, the extracted phonemes form a similar pronunciation of the speech to be recognized, the correct text can be recognized.

For example, if a word "preparation" is included in a certain sentence of speech of the user, but the scene is complicated or the performance of the acoustic model is limited, the extracted phoneme corresponding to the word "preparation" is "zun b ei", and the utterance corresponding to the word "preparation" is "zun 3bei 4", in this case, although the extraction of the phoneme corresponding to the word "preparation" is not accurate enough, the word "preparation" is recognized as the word "preparation" because the utterance corresponding to "preparation" of "zun 3bei 4" is stored in the multi-pronunciation dictionary based on the multi-pronunciation dictionary, and thus a correct recognition result can be obtained.

As can be seen from the above description, the multi-pronunciation dictionary constructed based on the pronunciation dictionary construction method provided in the embodiment of the present application, wherein a word corresponds to multiple pronunciations, specifically to the standard pronunciation of the word, and the similar pronunciations of the word. When speech recognition is performed based on the pronunciation dictionary, even if phoneme extraction of the speech to be recognized is inaccurate, for example, the pronunciation of the speech to be recognized is predicted to be similar pronunciation but not accurate pronunciation, the correct text can be recognized based on the multi-pronunciation dictionary. Therefore, the multi-pronunciation dictionary constructed by the technical scheme of the embodiment of the application can improve the fault tolerance and robustness of voice recognition, further improve the voice recognition effect, and obtain better recognition effect even in a complex scene.

As an alternative embodiment, the above "constructing a multi-pronunciation dictionary according to the rules stored in correspondence of the standard pronunciation and the similar pronunciation of the word corresponding to the word" can be realized by performing the following steps a1-a 2:

a1, determining a standard pronunciation corresponding to the word in the pronunciation dictionary, and determining a similar pronunciation corresponding to the word in the pronunciation dictionary according to the similar phoneme of the phoneme in the phoneme label.

Specifically, for a word in the pronunciation dictionary, the standard pronunciation is determined, specifically, the phonemes in the pronunciation are determined. Then, for the phonemes in the standard pronunciation corresponding to the word, selecting similar phonemes from the similar phonemes of the phonemes to replace the phonemes to obtain the similar pronunciation of the standard pronunciation, namely the similar pronunciation corresponding to the word.

For example, for the word "prepare", its standard pronunciation, i.e., "zhun 3bei 4", is first determined; then, similar phoneme substitution is performed on "zhun 3bei 4" based on the similar phonemes of the phonemes in the phoneme label described above, resulting in similar pronunciation. Specifically, the similar phoneme of the phoneme in the standard pronunciation of the word is determined based on the similar phoneme of the phoneme in the phoneme label, and the similar phoneme is used for replacing the similar phoneme to obtain the similar pronunciation of the word. For example, replacing the phoneme "zh" in "zhun 3bei 4" with its similar phoneme "z" results in a similar pronunciation "zun 3bei 4".

As a preferred embodiment, similar pronunciation expansion may be performed only for high-frequency error-prone words in the pronunciation dictionary in order to reduce the negative effect of modifying the existing pronunciation dictionary as much as possible.

That is, when the above-described step a1 is executed, the standard pronunciations corresponding to the high-frequency error-prone words in the pronunciation dictionary are respectively determined, and the similar pronunciations corresponding to the high-frequency error-prone words in the pronunciation dictionary are respectively determined based on the similar phonemes of the phonemes in the phoneme label.

Specifically, a speech recognition baseline system is tested by utilizing a development set, and words with high recognition error rate, namely high-frequency error-prone words, are screened out. Then, the standard pronunciation and the similar pronunciation are determined according to the introduction aiming at the high-frequency error-prone words screened out.

As another alternative, when determining the similar pronunciation of the high-frequency error-prone word, the pronunciation of the error recognition result of the high-frequency error-prone word may be directly used as the similar pronunciation of the high-frequency error-prone word.

Specifically, a speech recognition baseline system is tested by utilizing a development set, and high-frequency wrong word pairs with high recognition error rate are screened out. The high-frequency erroneous word pair is specifically a word that is recognized erroneously, and an actual recognition result of the word when the word is recognized erroneously.

For example, assuming that the word "ready" is recognized as "quasi-unwieldy" with a high frequency when testing the speech recognition baseline system, it can be determined that "ready" and "quasi-unwieldy" are a pair of high-frequency misword pairs. For the word "prepare", when determining its similar pronunciation, the pronunciation of the word "quasi-ben" zhun3ben4 "can be directly determined as the similar pronunciation of" prepare ".

A2, storing the standard pronunciation and the similar pronunciation corresponding to the word to obtain the multi-pronunciation dictionary.

Specifically, for the word in the pronunciation dictionary, the word, the standard pronunciation and the similar pronunciation corresponding to the word are correspondingly stored, so that the pronunciation dictionary can be expanded, and the multi-pronunciation dictionary can be obtained.

When the step a1 determines that the standard pronunciation and the similar pronunciation corresponding to the word in the pronunciation dictionary are specifically the standard pronunciation and the similar pronunciation corresponding to the high-frequency error-prone word in the pronunciation dictionary, the step a2 specifically stores the high-frequency error-prone word in the pronunciation dictionary corresponding to the standard pronunciation and the similar pronunciation corresponding to the high-frequency error-prone word to obtain a multi-pronunciation dictionary, that is, only the high-frequency error-prone word in the pronunciation dictionary is subjected to pronunciation expansion.

Only the high-frequency error-prone words in the pronunciation dictionary are subjected to pronunciation expansion, so that the recognition effect of the voice recognition on the high-frequency error-prone words can be improved in a targeted manner, and negative effects on the original words with high recognition accuracy due to multi-pronunciation expansion are avoided.

As a more preferred embodiment, when a multi-pronunciation dictionary is constructed by storing the high-frequency error-prone word in the pronunciation dictionary in correspondence with the standard pronunciation and the similar pronunciation corresponding to the high-frequency error-prone word, the similar pronunciation of the high-frequency error-prone word can be further filtered, specifically referring to the following processing contents shown in steps B1-B3:

and B1, calculating to obtain a score of the similar pronunciation corresponding to the high-frequency error-prone word according to the similar pronunciation corresponding to the high-frequency error-prone word in the pronunciation dictionary and the recognition probability of each phoneme in the phoneme set.

Specifically, as described above, when extracting phonemes from target audio data, not only the phoneme set but also the recognition probability of each phoneme in the phoneme set can be determined, and further, based on the correspondence between each phoneme in the phoneme set and a phoneme in the phoneme label of the target audio data and the recognition probability of each phoneme in the phoneme set, similar phonemes to the phonemes in the phoneme label and the similarity between the phonemes in the phoneme label and the similar phonemes thereof can be determined.

Based on the above processing, for the similar pronunciation corresponding to the high frequency error-prone word in the pronunciation dictionary, the score of the similar pronunciation corresponding to the high frequency error-prone word can be calculated according to the similarity between the phoneme in the similar pronunciation and the phoneme in the standard pronunciation.

For example, an arithmetic mean of the similarity of each phoneme in the similar pronunciation of the high frequency error-prone word and the corresponding phoneme in the standard pronunciation of the high frequency error-prone word is calculated as a score of the similar pronunciation of the high frequency error-prone word.

Taking the word "preparation" as an example, the standard pronunciation is "zhun 3bei 4", and for the similar pronunciation "zun 3bei 4", it is assumed that the similarity between the phoneme "z" and the phoneme "zh" is 0.8, and it can be determined that the similarity between the "zhun 3bei 4" and the other phonemes of "zun 3bei 4" is equal to 1, and at this time, it can be determined that the score of the similar pronunciation "zun 3bei 4" is (0.8+1+ 1)/4-0.95.

In the above manner, the score of each similar pronunciation of the high frequency error-prone word can be calculated.

And B2, selecting similar pronunciation with the score higher than the set score threshold value from the similar pronunciations corresponding to the high-frequency error-prone words according to the scores of the similar pronunciations corresponding to the high-frequency error-prone words, and using the similar pronunciation as the target similar pronunciation.

Specifically, a similar pronunciation having a score higher than a set score threshold is selected from among similar pronunciations corresponding to high-frequency erroneous-prone words, and the selected similar pronunciation is used as a target similar pronunciation corresponding to the high-frequency erroneous-prone words.

And B3, correspondingly storing the high-frequency error-prone words in the pronunciation dictionary, and the standard pronunciations and the target similar pronunciations corresponding to the high-frequency error-prone words to obtain a multi-pronunciation dictionary.

Specifically, when the high-frequency error-prone word in the pronunciation dictionary is expanded, the standard pronunciation and the target similar pronunciation of the high-frequency error-prone word are stored in correspondence with the high-frequency error-prone word.

According to the processing described in the above B1-B3, the target similar pronunciation selection and pronunciation expansion are respectively performed on each high-frequency error-prone word in the pronunciation dictionary to obtain the final multi-pronunciation dictionary.

The processing of the steps B1-B3 realizes the screening of the similar pronunciations of the high-frequency error-prone words, and the operation can ensure that the similar pronunciations of the high-frequency error-prone words in the multi-pronunciation dictionary are pronunciations with high similarity to the standard pronunciations of the high-frequency error-prone words, so that the recognition effect of other non-high-frequency error-prone words due to the influence of pronunciations with large similarity can be avoided.

Based on the above pronunciation dictionary construction method, another embodiment of the present application further provides a speech recognition method, as shown in fig. 2, the method includes:

s201, acquiring a phoneme sequence of the speech to be recognized.

Specifically, a phoneme extraction process is performed on the speech to be recognized to obtain a phoneme sequence of the speech to be recognized.

For example, the speech to be recognized or the acoustic features of the speech to be recognized are input into a pre-trained acoustic model for phoneme extraction, so as to obtain a phoneme sequence of the speech to be recognized output by the acoustic model.

S202, determining a voice recognition result of the voice to be recognized according to the phoneme sequence of the voice to be recognized and a pre-constructed multi-pronunciation dictionary.

Wherein, the words in the multi-pronunciation dictionary correspond to the standard pronunciation and the similar pronunciation corresponding to the words and are stored; the standard pronunciation corresponding to the word is constructed by the phoneme in the phoneme label corresponding to the target audio data, and the similar pronunciation corresponding to the word is constructed by the similar phoneme of the phoneme in the phoneme label; the target audio data contains audio data corresponding to the word.

Specifically, the multi-pronunciation dictionary is constructed by the pronunciation dictionary construction method described in any of the above embodiments, and the specific construction process and the specific contents of the multi-pronunciation dictionary can be referred to the description of the above pronunciation dictionary construction method embodiment, and will not be repeated here.

And based on the multi-pronunciation dictionary, when the phoneme sequence of the voice to be recognized is obtained, decoding the phoneme sequence based on the multi-pronunciation dictionary to obtain a text corresponding to the phoneme sequence. Illustratively, the phoneme sequence is matched with pronunciations in a multi-pronunciation dictionary, so as to determine a speech recognition result of the speech to be recognized.

For example, when a certain phoneme string in a phoneme sequence of a speech to be recognized matches a certain pronunciation in a multi-pronunciation dictionary, it may be determined that the text corresponding to the phoneme string is the text corresponding to the pronunciation in the multi-pronunciation dictionary.

Because the words in the multi-pronunciation dictionary correspond to not only the standard pronunciation thereof but also the similar pronunciation thereof, when the extraction of the phonemes of the speech to be recognized is inaccurate, or under the condition that the extraction accuracy of the phonemes is affected because the pronunciation person carries an accent or the speech to be recognized is not clear and accurate enough due to complex scenes, even if the extraction result of the phonemes of the speech to be recognized is different from the standard pronunciation of the speech to be recognized, for example, the extracted phonemes form the similar pronunciation of the speech to be recognized, the correct text can be recognized.

As can be seen from the above description, the speech recognition method provided in the embodiment of the present application decodes a phoneme sequence of a speech to be recognized based on a pre-constructed multi-pronouncing dictionary, and obtains a speech recognition result. When the phoneme sequence of the speech to be recognized is decoded based on the pronunciation dictionary, even if the extraction of phonemes from the speech to be recognized is inaccurate, for example, if the pronunciation of the speech to be recognized is predicted to be similar but not accurate, the correct text can be recognized based on the multi-pronunciation dictionary. Therefore, the multi-pronunciation dictionary-based voice recognition method provided by the embodiment of the application can improve the fault tolerance and robustness of voice recognition, further improve the voice recognition effect, and obtain a better recognition effect even in a complex scene.

Next, the above-described training process of the acoustic model will be described.

First, it should be noted that the acoustic model training scheme proposed in the embodiment of the present application is applicable to the acoustic model according to any one of the embodiments described above, and in particular, is applicable to the acoustic model applied when the pronunciation dictionary construction method proposed in the embodiment described above inputs target audio data into a pre-trained acoustic model for phoneme extraction, and is also applicable to the acoustic model applied when the speech recognition method proposed in the embodiment described above inputs speech to be recognized into a pre-trained acoustic model for phoneme extraction.

As a preferred embodiment, an acoustic model may be obtained by training according to the model training method described in the embodiment of the present application, and then the acoustic model is used in the pronunciation dictionary construction method described in the embodiment and the speech recognition method described in the embodiment, specifically, the acoustic model is used in the pronunciation dictionary construction method described in the embodiment, and phoneme extraction is performed on target audio data to achieve the purpose of constructing a multiple pronunciation dictionary, and then the acoustic model is used in the speech recognition method described in the embodiment to perform phoneme extraction on speech to be recognized, thereby achieving recognition of the speech to be recognized.

Before introducing the acoustic model training scheme proposed in the embodiments of the present application, a general acoustic model training scheme is briefly introduced:

the acoustic model is generally modeled by adopting a deep neural network, and a loss function adopted during training is a cross entropy loss function, and the specific form is as follows:

where x represents a spectral feature vector (e.g., FilterBank, MFCC, etc.) of speech, y_tRepresenting the phoneme label corresponding to the time t.

In order to solve the accuracy of acoustic model classification in complex scenes, the mainstream scheme at present is to study on the aspect of training data, mainly collect a large amount of real audio in complex scenes, or generate a large amount of data matched with the distribution of a target scene in a machine simulation mode, and then add the data into a training set for mixed training.

In addition, the cross entropy training criterion has certain limitation on the acoustic model training, and the cross entropy criterion only concerns whether the current predicted value is consistent with the real target value, so that the cross entropy training criterion is a maximum likelihood training mode; when the predicted output of the model is inconsistent with the real target, the training cost of the model is the same, and the training scheme has limitation on the training of the acoustic model and has no any guidance function on the aspect of improving the robustness of the model, such as: in the modeling of the acoustic model in chinese, when the real phoneme label is a1 and the model prediction outputs are a2 and z1, the training costs are the same, but the recognition of the target phoneme a1 as a2 generally causes no great difficulty in comprehension of sentence meaning by the user, and the recognition of the target phoneme a1 as z1 generally causes great deviation of sentence meaning. Therefore, an acoustic model training scheme purely based on a cross entropy loss function can cause the robustness of the model to be greatly reduced.

Referring to fig. 3, the acoustic model training scheme proposed in the embodiment of the present application mainly includes the following steps C1-C3:

and C1, inputting the audio features of the training speech into an acoustic model to obtain the phoneme information of the training speech.

Specifically, referring to FIG. 3, an audio feature x of a training speech is shown_t＝1,…TThe acoustic model ASR-Net is input, which can be any type of acoustic model, such as LSTM, TDNN, DFSMN, etc.

The acoustic model extracts phoneme information H corresponding to the audio based on the input acoustic characteristics_txt. The phoneme information is the information content output by the last hidden layer of the acoustic model, the information content is mostly audio content information, and information such as speakers, channels and the like in the audio is weakened, so that the acoustic modelThe phoneme information extracted by the model can be generally understood as semantic information of the audio, so the phoneme information H extracted by the acoustic model ASR-Net_txtAnd the training speech can be equivalent to semantic feature information of the training speech.

And C2, inputting the phoneme information of the training speech and the non-semantic information extracted according to the audio features of the training speech into an audio synthesis model to obtain an audio synthesis result.

Specifically, the audio synthesis model is a Text-To-Speech (TTS) network TTS-Net, and the network can synthesize Speech according To semantic information corresponding To the Speech, speaker information, environment information, and the like. The function and structure of the TTS network can be described in the conventional TTS network.

Based on the above audio synthesis model, in the embodiment of the present application, the extracted phoneme information of the training speech and the extracted non-semantic information according to the audio feature of the training speech are input into the audio synthesis model, so as to obtain an audio synthesis result output by the audio synthesis model.

The above-mentioned non-semantic information extracted according to the audio features of the training speech refers to non-semantic information such as a speaker and an environmental factor included in the training speech extracted according to the audio features of the training speech.

As an exemplary implementation, referring to fig. 3, the audio features of the training speech are input into a Long short-term memory (LSTM) network, from which the non-semantic information of the training speech can be extracted. In practical implementation, the convolutional neural network CNN or other attention-based neural networks may also be used to extract the non-semantic information of the training speech.

And then inputting the phoneme information of the training voice and the non-semantic information of the training voice into an audio synthesis model TTS-Net to obtain an audio synthesis result output by the audio synthesis model.

It should be noted that the above-mentioned audio synthesis model is an audio synthesis model with ideal effect, that is, in the ideal state, the semantic information of the training speech, that is, the phoneme information, and the non-semantic information of the training speech are input into the audio synthesis model, and the audio synthesis model can accurately synthesize the training speech, that is, the training speech can be accurately recovered based on the semantic information and the non-semantic information of the training speech.

As a preferred implementation manner, in order to avoid direct copy of input semantic information and non-semantic information by the audio synthesis model TTS-Net, in the embodiment of the present application, phoneme information of training speech and audio features of the training speech are downsampled, and then the audio synthesis model is subjected to audio synthesis based on the downsampled information.

Specifically, referring to fig. 3, the phoneme information of the training speech and the audio feature of the training speech are downsampled in the same scale, and in the embodiment of the present application, R-frame downsampling is exemplarily performed on the phoneme information and the audio feature of the training speech respectively, that is, an average value of every R frame of data is taken as a sampled value. For example, an average value is taken every 4 frames as a sample value.

In case of single frame prediction, when R is 1, as shown in fig. 3, noise needs to be added to the audio feature of the training speech input to the LSTM network.

By training the phoneme information H of the speech_txtFor example, it is down-sampled by R frames, specifically phoneme information h per R frame_txt,1,…,h_txt,RCalculate a mean AVG (h)_txt,1,…,h_txt,R) Wherein, in the step (A),

wherein K represents the number of topK, alpha_kRepresents the posterior probability, W, corresponding to the top-k label of the current frame_kIndicating the embedding or connection weight of the top-k nodes.

And then, extracting and obtaining the non-semantic information of the training voice according to the audio features of the training voice after down sampling.

As an exemplary implementation, referring to fig. 3, the down-sampled audio features of the training speech are input into a Long short-term memory (LSTM) network, from which the non-semantic information of the training speech can be extracted. In practical implementation, the convolutional neural network CNN or other attention-based neural networks may also be used to extract the non-semantic information of the training speech.

Finally, fusing the downsampled phoneme information and the non-semantic information to obtain audio synthesis basic information; and inputting the audio synthesis basic information into an audio synthesis model to obtain an audio synthesis result.

For example, referring to fig. 3, after code encoding is performed on the non-semantic information of the training speech, concat concatenation is performed on the non-semantic information and the downsampled phoneme information of the training speech, and the obtained concatenation information is used as audio synthesis basic information. And then, inputting the audio synthesis basic information into an audio synthesis model TTS-Net to carry out audio synthesis to obtain an audio synthesis result.

As an optional implementation manner, the above-mentioned fusion of the downsampled phoneme information and the non-semantic information may be implemented by network fusion or an attentive mechanism, in addition to the direct splicing by using a concat manner.

C3, correcting parameters of the acoustic model according to the voice recognition loss of the acoustic model and the audio synthesis loss of the audio synthesis model.

Specifically, the speech recognition loss of the acoustic model may be a loss between a phoneme extracted by the acoustic model for the training speech and a phoneme label of the training speech, or a loss between a speech recognition result determined based on a phoneme extraction result of the acoustic model and a text label of the training speech, and the loss may be determined by a cross-entropy loss function, for example, that is, the speech recognition loss of the acoustic model is a cross-entropy loss CE-loss.

The audio synthesis loss of the audio synthesis model, specifically, the loss between the audio synthesis result output by the audio synthesis model and the training speech, may be determined by a mean square error loss function, for example, that the audio synthesis loss of the audio synthesis model is mean square error loss MSE-loss, and further, the audio synthesis loss of the audio synthesis model may also be MDN-loss.

Assume that the speech recognition penalty of the acoustic model is L_CEThe audio synthesis loss of the audio synthesis model is L_TTSThen the final acoustic model training loss L_SUMIs the sum of the two, namely: l is_SUM＝L_CE+αL_TTSWherein alpha is the super ginseng.

Then, the loss L is trained based on the acoustic model described above_SUMAnd performing parameter correction on the acoustic model.

It can be understood that when the acoustic model is trained according to the above scheme, the loss of the acoustic model itself is included, and the loss of the audio synthesis model is also considered.

Based on the characteristics of the TTS-Net network of the audio synthesis model, the TTS network passes the text features of the voice (corresponding to the left H of the upper diagram) when synthesizing the audio_txt) And speech code information (including: speaker, environment, etc.) for audio synthesis. Because information such as speakers, environmental factors and the like belongs to steady-state information in a piece of voice, information at the previous moment or the previous N moments can be directly multiplexed for synthesizing the current voice frame, and corresponding text information is time-varying information and needs to be in one-to-one correspondence with the current moment. It can be seen from fig. 3 that the semantic features are predicted outputs from the acoustic model ASR-Net network, and in order to enable more accurate synthesis, the more accurate the output of the ASR-Net network needs to be.

Meanwhile, the TTS-Net network adopts a probability likelihood mode for training, the output accuracy and the input content accuracy are in a linear relation, namely, the larger the difference between the input content and the standard content is, the larger the difference between the output voice and the standard voice is, namely, the larger the loss of the output voice and the standard voice is; the smaller the difference between the input content and the standard content, the smaller the difference between the output speech and the standard speech, that is, the smaller the loss thereof.

Therefore, for the TTS-Net network, the types of errors are different, and the costs calculated by the corresponding training criteria are also different, that is: the larger the difference between the prediction output of the ASR-Net network and the target label is, the larger the loss cost of the corresponding TTS-Net network is, and correspondingly, the larger the training loss of the acoustic model is; on the contrary, the smaller the difference between the predicted output of the ASR-Net network and the target label is, the smaller the loss cost of the corresponding TTS-Net network is, and correspondingly, the training loss of the acoustic model is smaller. Therefore, the purpose of differential error punishment of the prediction output of the ASR-Net network is achieved, and after repeated iterative training convergence, the acoustic model ASR-Net can be mistaken on the phoneme similar to the current real label even if errors occur in phoneme prediction.

It can be seen that, based on the addition of the audio synthesis model and the audio synthesis loss, the acoustic model training scheme provided in the embodiment of the present application can perform differentiation punishment on model classification errors, that is: the closer the classification error and the target phoneme are in pronunciation, the smaller the punishment is, and the farther the classification error and the target phoneme are in pronunciation, the larger the punishment is; the acoustic model is trained by the aid of the audio synthesis model, and when a complex scene audio test is carried out, the acoustic model predicts that the same phoneme is predicted even under the condition of error, and only the corresponding tones of the same phoneme may have differences. That is, the acoustic model prediction obtained by training through the training scheme provided by the embodiment of the application basically has no off-spectrum error, so that the speech recognition effect can be improved.

Based on the acoustic model obtained by training in the above manner, the multi-pronunciation dictionary constructed by combining the pronunciation dictionary construction method provided by the above embodiment of the present application and the corresponding voice recognition method can systematically and significantly improve the fault tolerance and robustness of voice recognition and significantly improve the voice recognition effect.

In accordance with the pronunciation dictionary construction method, another embodiment of the present application further provides a pronunciation dictionary construction apparatus, as shown in fig. 4, the apparatus includes:

the phoneme extracting unit 001 is used for extracting phonemes from the target audio data to obtain a corresponding phoneme set; the target audio data is audio data covering all phonemes;

a phoneme screening unit 002, configured to determine, from the phoneme set, similar phonemes of phonemes in the phoneme label according to the recognition probability of each phoneme in the extracted phoneme set and the phoneme label corresponding to the target audio data; the similar phonemes of the phonemes in the phoneme label refer to a set number of phonemes which are selected from the phoneme set, correspond to the phonemes in the phoneme label and have the highest recognition probability;

the dictionary construction unit 003 is used for constructing and obtaining a multi-pronunciation dictionary according to the rules that the words and the standard pronunciations and the similar pronunciations corresponding to the words are correspondingly stored; wherein the standard pronunciation is composed of phonemes in the phoneme label, and the similar pronunciation is composed of similar phonemes of the phonemes in the phoneme label.

As an alternative embodiment, the multi-pronunciation dictionary is constructed according to the rule that the words correspond to the standard pronunciations and similar pronunciations corresponding to the words, and the method comprises the following steps:

As an alternative embodiment, determining a standard pronunciation corresponding to a word in a pronunciation dictionary, and determining a similar pronunciation corresponding to a word in a pronunciation dictionary based on a similar phoneme of a phoneme in the phoneme label, includes:

As an alternative embodiment, the obtaining a multi-pronunciation dictionary by storing the high-frequency error-prone word in the pronunciation dictionary and the standard pronunciation and similar pronunciation correspondence corresponding to the high-frequency error-prone word comprises:

As an optional implementation manner, the extracting phonemes from the target audio data to obtain a corresponding phoneme set includes:

As an alternative embodiment, the acoustic model is trained as follows:

As an optional implementation manner, inputting phoneme information of the training speech and non-semantic information extracted according to the audio feature of the training speech into an audio synthesis model to obtain an audio synthesis result, including:

As an alternative embodiment, the speech recognition loss of the acoustic model is determined by a cross-entropy loss function, and the audio synthesis loss of the audio synthesis model is determined by a mean-square-error loss function.

Specifically, the specific work content of each unit of the pronunciation dictionary construction device is referred to the processing content of the corresponding processing step of the pronunciation dictionary construction method, and is not repeated here.

In correspondence with the above-mentioned speech recognition method, another embodiment of the present application further provides a speech recognition apparatus, as shown in fig. 5, the apparatus includes:

the voice processing unit 010 is configured to obtain a phoneme sequence of a voice to be recognized;

the voice recognition unit 011 is used for determining a voice recognition result of the voice to be recognized according to the phoneme sequence of the voice to be recognized and a pre-constructed multi-pronunciation dictionary;

As an alternative embodiment, the multi-pronunciation dictionary is constructed according to the pronunciation dictionary construction method.

As an optional implementation, the obtaining a phoneme sequence of a speech to be recognized includes:

As an alternative embodiment, the acoustic model is trained as follows:

Specifically, the specific working contents of each unit of the speech recognition apparatus and the specific contents of the acoustic model training process are referred to the processing contents of the corresponding processing steps of the speech recognition method, and are not repeated here.

Another embodiment of the present application further provides an electronic device, as shown in fig. 6, the electronic device including:

a memory 200 and a processor 210;

wherein, the memory 200 is connected to the processor 210 for storing programs;

the processor 210 is configured to implement the pronunciation dictionary construction method disclosed in any one of the above embodiments or implement the speech recognition method disclosed in any one of the above embodiments by running the program stored in the memory 200.

Specifically, the electronic device may further include: a bus, a communication interface 220, an input device 230, and an output device 240.

The processor 210, the memory 200, the communication interface 220, the input device 230, and the output device 240 are connected to each other through a bus. Wherein:

a bus may include a path that transfers information between components of a computer system.

The processor 210 may be a general-purpose processor, such as a general-purpose Central Processing Unit (CPU), microprocessor, etc., an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with the present invention. But may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components.

The processor 210 may include a main processor and may also include a baseband chip, modem, and the like.

The memory 200 stores programs for executing the technical solution of the present invention, and may also store an operating system and other key services. In particular, the program may include program code including computer operating instructions. More specifically, memory 200 may include a read-only memory (ROM), other types of static storage devices that may store static information and instructions, a Random Access Memory (RAM), other types of dynamic storage devices that may store information and instructions, a disk storage, a flash, and so forth.

The input device 230 may include a means for receiving data and information input by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor, among others.

Output device 240 may include equipment that allows output of information to a user, such as a display screen, a printer, speakers, and the like.

Communication interface 220 may include any device that uses any transceiver or the like to communicate with other devices or communication networks, such as an ethernet network, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), etc.

The processor 210 executes the program stored in the memory 200 and invokes other devices, which can be used to implement the steps of the pronunciation dictionary construction method provided in the above-mentioned embodiment of the present application or the steps of the speech recognition method provided in the above-mentioned embodiment of the present application.

Another embodiment of the present application further provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is executed by a processor, the computer program implements the steps of the pronunciation dictionary construction method provided in the above-mentioned embodiment of the present application, or implements the steps of the speech recognition method provided in the above-mentioned embodiment of the present application.

Specifically, the specific work content of each part of the electronic device and the specific processing content of the computer program on the storage medium when being executed by the processor may refer to the content of each embodiment of the pronunciation dictionary construction method or the content of each embodiment of the speech recognition method, which are not described herein again.

While, for purposes of simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present application is not limited by the order of acts or acts described, as some steps may occur in other orders or concurrently with other steps in accordance with the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The steps in the method of each embodiment of the present application may be sequentially adjusted, combined, and deleted according to actual needs, and technical features described in each embodiment may be replaced or combined.

The modules and sub-modules in the device and the terminal in the embodiments of the application can be combined, divided and deleted according to actual needs.

In the several embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of a module or a sub-module is only one logical division, and there may be other divisions when the terminal is actually implemented, for example, a plurality of sub-modules or modules may be combined or integrated into another module, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules or sub-modules described as separate parts may or may not be physically separate, and parts that are modules or sub-modules may or may not be physical modules or sub-modules, may be located in one place, or may be distributed over a plurality of network modules or sub-modules. Some or all of the modules or sub-modules can be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, each functional module or sub-module in the embodiments of the present application may be integrated into one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated into one module. The integrated modules or sub-modules may be implemented in the form of hardware, or may be implemented in the form of software functional modules or sub-modules.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software cells may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A pronunciation dictionary construction method is characterized by comprising the following steps:

2. The method of claim 1, wherein constructing a multi-pronunciation dictionary according to rules stored corresponding to standard pronunciations and similar pronunciations of words corresponding to the words comprises:

3. The method of claim 2, wherein determining a standard pronunciation corresponding to a word in a pronunciation dictionary and determining a similar pronunciation corresponding to a word in a pronunciation dictionary based on a similar phoneme of a phoneme in the phoneme label comprises:

4. The method of claim 3, wherein obtaining a multi-pronunciation dictionary by storing the high-frequency error-prone word in the pronunciation dictionary with the standard pronunciation and similar pronunciation correspondences corresponding to the high-frequency error-prone word comprises:

5. The method of claim 1, wherein the extracting phonemes from the target audio data to obtain a corresponding set of phonemes comprises:

6. A speech recognition method, comprising:

acquiring a phoneme sequence of a voice to be recognized;

7. The method according to claim 6, wherein the multi-pronunciation dictionary is constructed according to the pronunciation dictionary construction method of any one of claims 1 to 5.

8. The method of claim 6, wherein the obtaining the phoneme sequence of the speech to be recognized comprises:

9. The method of claim 5 or 8, wherein the acoustic model is trained as follows:

10. The method of claim 9, wherein inputting the phoneme information of the training speech and the extracted non-semantic information according to the audio features of the training speech into an audio synthesis model to obtain an audio synthesis result comprises:

11. The method of claim 9, wherein the speech recognition penalty of the acoustic model is determined by a cross entropy penalty function and the audio synthesis penalty of the audio synthesis model is determined by a mean square error penalty function.

12. A pronunciation dictionary construction apparatus, comprising:

13. A speech recognition apparatus, comprising:

14. An electronic device, comprising:

a memory and a processor;

the memory is connected with the processor and used for storing programs;

the processor is configured to implement the pronunciation dictionary construction method according to any one of claims 1 to 5 or the speech recognition method according to any one of claims 6 to 11 by executing a program in the memory.

15. A storage medium having stored thereon a computer program which, when executed by a processor, implements the pronunciation dictionary construction method according to any one of claims 1 to 5 or implements the speech recognition method according to any one of claims 6 to 11.