CN112530414B

CN112530414B - Iterative large-scale pronunciation dictionary construction method and device

Info

Publication number: CN112530414B
Application number: CN202110178948.6A
Authority: CN
Inventors: 王治愚; 王大亮; 王丽媛; 齐红威
Original assignee: Datatang Beijing Technology Co ltd
Current assignee: Datatang Beijing Technology Co ltd
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2021-05-25
Anticipated expiration: 2041-02-08
Also published as: CN112530414A

Abstract

The invention discloses an iterative large-scale pronunciation dictionary construction method and device, wherein the method comprises the following steps: generating an entry sequence according to the text raw data; generating a phonetic symbol sequence according to the audio raw data; generating a binary group < entry, phonetic symbol > by utilizing a G2P model according to the entry sequence; generating a binary group < phonetic symbol, entry > by utilizing a P2G model according to the phonetic symbol sequence; calculating the matching degree between the two binary groups, comparing the matching degree with a preset matching degree, and performing discriminative sample extraction on the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry >, the matching degree of which is less than the preset matching degree, so as to obtain discriminative samples; and obtaining the label and correction of the field expert on the identification sample, and storing the labeled and corrected binary group < entry, phonetic symbol > and binary group < phonetic symbol, entry > into the multi-level large-scale pronunciation dictionary. The invention can quickly and effectively construct a large-scale pronunciation dictionary, improve the working efficiency of the voice recognition system and reduce the labor cost.

Description

Iterative large-scale pronunciation dictionary construction method and device

Technical Field

The invention relates to the technical field of dictionary construction, in particular to an iterative large-scale pronunciation dictionary construction method and device.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

With the continuous innovation of science and technology, the speech field is rapidly developed, and meanwhile, the continuous updating and iteration of a language recognition system are also driven. The voice recognition system is composed of three parts, namely an acoustic model, a pronunciation dictionary and a language model, wherein the pronunciation dictionary is a very important part in the voice recognition system and is a bridge connecting the acoustic model and the voice model. Therefore, how to construct the pronunciation dictionary is a significant and arduous task for the voice recognition system, and the size of the construction scale of the pronunciation dictionary directly restricts the accuracy of the whole voice recognition system.

The existing pronunciation dictionary construction methods are mainly divided into three types, one is based on rules, one is based on machine learning, and the other is based on neural networks. The model scale is limited, and for a task of constructing a large-scale pronunciation dictionary, a large amount of manpower and material resources are consumed for manually collecting and constructing vocabulary entries and phonetic symbol data of the pronunciation dictionary, so that the finally generated pronunciation dictionary is about tens of thousands of vocabulary entries generally. Therefore, a method for quickly and effectively constructing a large-scale pronunciation dictionary is urgently needed to improve the working efficiency of the voice recognition system and reduce the labor cost.

Disclosure of Invention

The embodiment of the invention provides an iterative large-scale pronunciation dictionary construction method, which is used for quickly and effectively constructing a large-scale pronunciation dictionary, improving the working efficiency of a voice recognition system and reducing the labor cost and comprises the following steps:

preprocessing input text raw data to generate an entry sequence;

preprocessing input audio data to generate a phonetic symbol sequence;

processing the entry sequence based on a G2P model to obtain a phonetic symbol sequence, and generating a binary group < entry, phonetic symbol >;

processing the phonetic symbol sequence based on a P2G model to obtain a term sequence, and generating a binary group < phonetic symbol, term >;

calculating the matching degree between the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry >;

comparing the matching degree with a preset matching degree, and performing discriminative sample extraction on a binary group < entry, phonetic symbol > and a binary group < phonetic symbol, entry >, which correspond to the matching degree smaller than the preset matching degree, to obtain discriminative samples;

and acquiring the label and correction of the discriminative sample by a domain expert, and storing the labeled and corrected binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry > into a multi-level large-scale pronunciation dictionary.

The embodiment of the present invention further provides an iterative large-scale pronunciation dictionary constructing device, which is used for quickly and effectively constructing a large-scale pronunciation dictionary, improving the working efficiency of a speech recognition system and reducing the labor cost, and the device includes:

the data preprocessing module is used for preprocessing input text raw data to generate an entry sequence; preprocessing input audio data to generate a phonetic symbol sequence;

a G2P module, configured to process the entry sequence based on a G2P model to obtain a phonetic symbol sequence, and generate a binary group < entry, phonetic symbol >;

a P2G module, configured to process the phonetic symbol sequence based on the P2G model to obtain a vocabulary entry sequence, and generate a binary group < phonetic symbol, vocabulary entry >;

the active learning module is used for calculating the matching degree between the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry >; comparing the matching degree with a preset matching degree, and performing discriminative sample extraction on a binary group < entry, phonetic symbol > and a binary group < phonetic symbol, entry >, which correspond to the matching degree smaller than the preset matching degree, to obtain discriminative samples; and acquiring the label and correction of the discriminative sample by a domain expert, and storing the labeled and corrected binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry > into a multi-level large-scale pronunciation dictionary.

The embodiment of the invention also provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the iterative large-scale pronunciation dictionary construction method when executing the computer program.

The embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the iterative large-scale pronunciation dictionary construction method described above.

In the embodiment of the invention, compared with the technical scheme that a rule of a pronunciation dictionary is formulated according to phonetic symbol characteristics of a vocabulary entry, a model of a training neural network and a large amount of manpower and material resources are consumed for manually collecting and constructing the vocabulary entry and the phonetic symbol data of the pronunciation dictionary in the prior art, the method generates a vocabulary entry sequence by preprocessing input text raw data; preprocessing input audio data to generate a phonetic symbol sequence; processing the entry sequence based on a G2P model to obtain a phonetic symbol sequence, and generating a binary group < entry, phonetic symbol >; processing the phonetic symbol sequence based on a P2G model to obtain a term sequence, and generating a binary group < phonetic symbol, term >; calculating the matching degree between the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry >; comparing the matching degree with a preset matching degree, and performing discriminative sample extraction on a binary group < entry, phonetic symbol > and a binary group < phonetic symbol, entry >, the matching degree of which is less than the preset matching degree, so as to obtain discriminative samples; and obtaining the labeling and correction of the discriminative sample by the domain expert, and storing the labeled and corrected binary group < entry, phonetic symbol > and binary group < phonetic symbol, entry > into the multilevel large-scale pronunciation dictionary, so that the large-scale pronunciation dictionary can be quickly and effectively constructed, the working efficiency of the voice recognition system is improved, and the labor cost is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:

FIG. 1 is a block diagram of an iterative large-scale pronunciation dictionary construction apparatus according to an embodiment of the present invention;

FIG. 2 is a flowchart of an iterative large-scale pronunciation dictionary construction method according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating data preprocessing according to an embodiment of the present invention;

FIG. 4 is a schematic drawing of model inference of G2P and P2G in an embodiment of the invention;

FIG. 5 is a diagram illustrating the calculation of binary matching according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of the extraction of a differential sample according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

The term explains:

and (3) voice recognition: the method is characterized in that vocabulary contents in human voice are converted into computer readable input, such as binary codes, character sequences and the like, so that a human-computer interaction interface is more natural and easier to use;

the iteration formula is as follows: the method refers to a mode of training a model, which is a cyclic reciprocating process, and continuously utilizes constraint conditions to restrict the training of the model to be optimized towards a preset direction;

a pronunciation dictionary: a dictionary in the speech recognition system for describing each entry and its corresponding pronunciation relationship, which contains the mapping from word to phonetic symbol;

G2P (grappheme-to-Phoneme, Grapheme to Phoneme): referring to a generation model of a pronunciation dictionary, wherein the output result of the generation model is a < vocabulary entry, phonetic symbol > binary group which is used for deducing the pronunciation dictionary;

P2G (Phoneme-to-Grapheme, Phoneme to Grapheme): the model output result of the model is < phonetic symbol, vocabulary entry > binary group relative to the G2P model and the training model, and the model is used for deducing a pronunciation dictionary;

identification of samples: the sample which is selected after the binary matching degree calculation and contains the largest information amount is selected;

human presence in the loop: the method refers to the process of field expert marking and expert correction knowledge base in an active learning module, and particularly refers to the phenomenon that a part in which an expert participates is in the whole pronunciation dictionary construction system loop.

The invention provides an iterative large-scale pronunciation dictionary construction device, which comprises the following modules as shown in figure 1:

In the embodiment of the present invention, as shown in fig. 1, the data preprocessing module: the function of this module is to preprocess the input data, which includes the following two units.

And the text preprocessing unit is used for cleaning the text raw data, including removing network tags, stop words, error symbols and the like, performing word-level segmentation on the cleaned text, removing existing entries in a pronunciation dictionary, and directly entering the G2P model for inference.

And the phoneme segmentation unit is used for carrying out phoneme segmentation on the denoised audio data, then extracting a phoneme sequence, converting the phoneme sequence into a phonetic symbol sequence, and sending the phonetic symbol sequence to the inference unit of the P2G model for inference.

In an embodiment of the present invention, as shown in fig. 1, the G2P module: the function of this module is to train the G2P model and predict for a given sequence of terms its corresponding phonetic transcription sequence, which comprises a model training unit and a model inference unit.

Model (G2P) training unit: the unit first trains the initial model of G2P for cold start with a small amount of underlying pronunciation dictionary data and prepares for iterative training for later G2P model inference.

Model (G2P) inference unit: the unit uses the entry sequences output by the text pre-processing unit to generate their corresponding phonetic symbol sequences, the final sequence form being a binary < entry, phonetic symbol >.

In the embodiment of the present invention, as shown in fig. 1, the P2G module: the function of this module is to train the P2G model and predict for a given phonetic symbol sequence its corresponding sequence of terms, which includes a model training unit and a model inference unit.

Model (P2G) training unit: the unit first trains the initial model of P2G for cold start using a small amount of basic pronunciation dictionary data and prepares for iterative training for later P2G model inference.

Model (P2G) inference unit: the unit uses the phonetic symbol sequences output by the phoneme segmentation unit to generate their corresponding phonetic symbol sequences, the final sequence form being a binary < phonetic symbol, entry >.

In the embodiment of the present invention, as shown in fig. 1, the active learning module: and aiming at the < entry, phonetic symbol > and < phonetic symbol, entry > binary groups generated by the G2P module and the P2G module, calculating and comparing the degree of matching of the binary groups, and providing a brand-new training method for the two-way sequence model of the human-in-loop. Comparing the matching degree with a preset matching degree, and according to a matching degree comparison result (the matching degree is less than the preset matching degree, namely low matching, and the matching degree is greater than the preset matching degree, namely high matching), directly warehousing the high-matching binary group; and for low-matching duplet, a discriminative sample extraction unit is used for extracting a high-discriminative sample for further iterative model training of G2P and P2G, and further analysis and correction are carried out on the inference results of the two models. The sample correction operation can be used as a constraint condition for training two models. The module comprises the following four subunits:

a matching degree calculation unit: the unit is used for calculating the matching degree, carrying out the matching degree calculation between the < entry, phonetic symbol > duplet according to the output results of the G2P model inference unit and the P2G model inference unit, and dividing the matching degree into two matching conditions: high matching, namely, the data pair can be used as a final result of the pronunciation dictionary and is directly put in storage; if the match is low, the processing proceeds to the differential sample extraction unit.

A discriminating sample extracting unit: the unit is used for processing the output result of the matching degree calculation unit as a low-matching < entry, phonetic symbol > binary group, extracting a word sequence and phonetic symbol sequence pair which has a larger entropy value and can represent the overall characteristics of the sample by applying an entropy value maximization basic strategy in an uncertainty strategy in an active learning method, and taking the word sequence and phonetic symbol sequence pair as the input of the domain expert labeling unit.

A domain expert correction unit: the unit is used for carrying out domain expert marking on the binary group extracted by the identification sample extraction unit and taking the labeled binary group as a first part of a human-in-circuit. The process is to manually check and mark the problems of word formation, morphology and the like of the entries in the low matching sample, and finally input the marked < entries, phonetic symbols > binary groups into an expert correction knowledge base.

And (3) correcting the knowledge base by the expert: the method is used for carrying out expert correction on the to-be-corrected binary group processed by the domain expert labeling unit, carrying out correction operation on the < entry, phonetic symbol > binary group as a second part of a human-in-circuit, then carrying out standard quantization on all correction operations, and directly taking a corrected result as data of a final pronunciation dictionary library while serving as a constraint model training condition.

In the embodiment of the present invention, as shown in fig. 1, the multi-level large-scale pronunciation dictionary belongs to a multi-level large-scale pronunciation dictionary library: the database of the pronunciation dictionary constructed by the invention is divided into the following four levels: l1 core pronunciation dictionary; an L2 basic pronunciation dictionary; l3 expands pronunciation dictionary; l4 epilogue words. Wherein the L1 core pronunciation dictionary and the L2 basic pronunciation dictionary are used as important cold start data resources for training the model in the device.

L1 core pronunciation dictionary: the vocabulary entries and their corresponding phonetic symbol data accumulated in the standard dictionary, such as CMU dictionary, raman dictionary, etc., are used.

L2 basic pronunciation dictionary: the speech pronunciation dictionary data which is widely used in the field of speech recognition and has high recognition accuracy is used, such as LibriSpeech speech library and WSJ0 speech library.

L3 expands pronunciation dictionary: according to the existing database resources and the inference result of the model, the unique nouns, compound words, abbreviative words and the like related to the professional field are used as a third-layer pronunciation dictionary.

L4 extended pronunciation dictionary: and the construction unit of the extensive pronunciation dictionary is formed by aiming at the overlength words and the new words, the entity vocabulary entries and the slang vocabularies which are popular in the network.

The embodiment of the invention also provides an iterative large-scale pronunciation dictionary construction method, which is described in the following embodiment. Because the principle of solving the problems of the method is similar to that of the iterative large-scale pronunciation dictionary construction device, the implementation of the method can be referred to that of the iterative large-scale pronunciation dictionary construction device, and repeated parts are not repeated.

Fig. 2 is a flowchart of an iterative large-scale pronunciation dictionary construction method according to an embodiment of the present invention, and as shown in fig. 2, the method includes:

step 202: preprocessing input text raw data to generate an entry sequence;

step 204: preprocessing input audio data to generate a phonetic symbol sequence;

step 206: processing the entry sequence based on a G2P model to obtain a phonetic symbol sequence, and generating a binary group < entry, phonetic symbol >;

step 208: processing the phonetic symbol sequence based on a P2G model to obtain a term sequence, and generating a binary group < phonetic symbol, term >;

step 210: calculating the matching degree between the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry >;

step 212: comparing the matching degree with a preset matching degree, and performing discriminative sample extraction on a binary group < entry, phonetic symbol > and a binary group < phonetic symbol, entry >, which correspond to the matching degree smaller than the preset matching degree, to obtain discriminative samples;

step 214: and acquiring the label and correction of the discriminative sample by a domain expert, and storing the labeled and corrected binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry > into a multi-level large-scale pronunciation dictionary.

The iterative large-scale pronunciation dictionary construction is explained in detail based on fig. 1 and 2 described above.

(1) Data pre-processing

The input of the present invention is divided into two parts, the first part being text raw data (e.g., English text data I am a master of biological science in FIG. 3) and the second part being audio raw data (e.g., audio map on the right side of FIG. 3). The data may be obtained from a corpus that has already been collected, or audio and text collected using third party tools, without limitation. The raw text data is then cleaned and normalized in a text preprocessing unit to generate a sequence of terms, such as the term sequence { master, biological, science } in fig. 3. Meanwhile, the audio generation data enters a phoneme segmentation unit to carry out audio denoising and phoneme segmentation operations, and a phonetic symbol sequence is generated, such as the phonetic symbol sequence {/mass ə r/,/bi ə l ä j ə k ə l/,/si ə ns/} in FIG. 3.

(2) G2P, P2G model inference

The G2P and P2G models are a deep learning method based on sequence-to-model, and as shown in fig. 4, encode and learn the vectors of the feature sequences through a bidirectional long-and-short-term memory neural network, finally decode the vectors into the required sequences, and simultaneously calculate the similarity. The model inference units in the two models are pre-trained in an initial state by using an L1 core pronunciation dictionary and an L2 basic pronunciation dictionary, and can be used as cold start parameters of the models.

The model inference unit of the G2P model processes the output of the text pre-processing unit, i.e., the sequence of terms { master, biological, science }. And generating phonetic symbol sequences { < master,/master ə r/, 100>, < biological,/bi ə l ä j ə k ə l/, 98>, < science,/si ə ns/, 99> } corresponding to the vocabulary entry sequences by utilizing a pre-trained model inference unit, wherein 100, 98 and 99 represent the similarity (namely the matching degree) of the vocabulary entries and the phonetic symbols, 100 represents the complete matching, and 98 and 99 represent that the matching degree of the vocabulary entries and the phonetic symbols is not the complete matching. The G2P model is formulated as follows:

；

wherein the content of the first and second substances,

representing the first in a sequence of termsiThe number of the entries is the same as the number of the entries,Mindicating a common presence in the current sequence of entriesMAn entry. Encoded by G2P to obtain

That is to say<Entry, phonetic symbol>A binary group.

The model inference unit of the P2G model processes the output result of the phoneme segmentation unit, namely the phonetic symbol sequence {/most ə r/,/bi ə l ä j ə k ə l/,/si ə ns/}. And generating a vocabulary entry sequence { </master ə r/, master, 100>, </bi ə l ä j ə k ə l/, biological, 98>, </si ə ns/, science, 99> } corresponding to the phonetic symbol sequence by utilizing a pre-trained model inference unit, wherein 100, 98 and 99 represent the matching degrees of the phonetic symbols and the vocabulary entries, 100 represents a complete match, and 98 and 99 represent that the matching degrees of the phonetic symbols and the vocabulary entries are not the complete match. The P2G model is formulated as follows:

；

wherein the content of the first and second substances,

to representFirst in the phonetic symbol sequenceiThe number of the phonetic symbols is one,Mrepresenting all of the current phonetic symbol sequenceMAnd (4) phonetic symbols. Obtained by using P2G coding

That is to say<Phonetic symbol, entry>A binary group.

(3) Binary matching

After passing through the inference units of the G2P model and the P2G model, the two groups of inference results are input to the binary matching degree calculation unit for matching degree calculation, as shown in fig. 5, and then the two groups of inference results are classified into two matching degrees, i.e., high matching and low matching, according to the degree of matching.

The measurement method used by the matching degree calculation unit is KL divergence, so the calculation distance of the KL divergence is used for measuring the matching degree between the vocabulary entry sequence and the phonetic symbol sequence, because the mapping process of < vocabulary entry, phonetic symbol > of the pronunciation dictionary in the speech recognition field is asymmetric mapping; moreover, since the pronunciation dictionary has many pronunciations, polyphones, and homonyms, many-to-many mapping between the vocabulary entry and the phonetic symbol is caused. The KL divergence is a common method used to measure the difference between two asymmetric distributions and therefore can be used here to calculate the degree of match between < phonetic symbol, lemma > doublet and < lemma, lemma > doublet. The specific calculation formula is as follows:

；

wherein, A represents the inference process of the G2P model, namely, each inputted vocabulary entry word generates phonetic symbol phone, and B represents the inference process of the P2G model, namely, each inputted phonetic symbol phone generates vocabulary entry word;

is the degree of match between the dyads output by the G2P model and the P2G model;

is a G2P modelThe probability of generating a phone by the input word in (1);

then is the probability that the input phone in the P2G model generated word,

。

(4) differential sample extraction

And (3) aiming at the < entry, phonetic symbol > binary group and the < phonetic symbol, entry > binary group with low matching degree, carrying out extraction of the discriminative sample by using an active learning method according to the information entropy of the sample. Specifically, information entropies of the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry > with the matching degree smaller than the preset matching degree are calculated, the information entropies are compared with a preset information entropy threshold, and the information entropies are labeled by taking the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry > with the information entropies larger than the preset information entropy threshold as discriminative samples. The process of selecting the differential sample is shown in fig. 6: the graph (a) is a sample to be extracted, which is divided into two types according to the characteristics of the sample, and the two types are respectively represented by gray and black; (b) the straight line in the graph represents a larger preset information entropy threshold, under the condition of the preset information entropy threshold, the gray or black samples which are closer to the straight line are samples with larger information entropy, and the farther the distance is, the lower the information entropy of the sample points is considered; (c) when the graph shows that the threshold value is smaller, the number of samples on the straight line is increased, which means that the number of samples with large information entropy is large, namely, the number of discriminative samples is more, so that the training efficiency of the model is higher.

The formula for calculating the entropy used in the extraction process of the discriminative sample is as follows:

；

a, B are two inference results of a G2P model and a P2G model respectively;

is the minimum unit corresponding to each model

When the temperature of the water is higher than the set temperature,

to represent<Entry, phonetic symbol>Correspond to

When the temperature of the water is higher than the set temperature,

to represent<Phonetic symbol, entry>；

Is the combined probability of each group of phonetic symbol sequences in the G2P model;

it is the combined probability of each set of sequences of terms in the P2G model that corresponds to a phonetic sequence.jThe value range of (1) to low degree of matching<Entry, phonetic symbol>Binary sum<Phonetic symbol, entry>The number of doublets.

(5) Domain expert labeling and correction

Performing domain expert labeling on the most representative sample extracted by the differential sample extraction unit through a domain expert correction unit, and labeling a series of problems of the sample, such as repeated identification and error identification of the entry; wrong feature labeling of phonetic symbols and audio wrong labeling of polyphonic words. Because the present invention relates to specific or proprietary vocabularies in different areas of expertise, different domain experts are required to perform the tagging work.

And then, carrying out expert correction on the labeled sample data through an expert correction knowledge base, and then carrying out standard quantization on all correction operations to be used as a constraint model training condition.

(6) Model iterative training

The method takes labeled and corrected sample data as constraint conditions of a G2P model and a P2G model, the G2P model and the P2G model are both in iterative training, a sequence-to-sequence model is used in the training process, and the training process of the G2P model and the P2G model is constrained through a loss function. Here, a loss function for model training is defined and quantifies the correction operation into three different cost (i.e., weight) operations: 1. adding operation: the insert cost is 1, the delete operation cost is 1, and the modify operation cost is 2, thereby obtaining a normalized training model overall Loss function Loss, which can be defined by the following formula:

；

；

wherein the content of the first and second substances,

is a loss function;

is the loss of sequence to sequence model;

representing the loss after the manual correction operation is quantified;

is the weight of the add operation of the penalty function,

is the weight of the delete operation of the loss function,

is the weight of the modification operation of the loss function,

representing the length of each set of sequences of terms or phonetic symbols,

a sequence of terms is represented that is,

representing a phonetic symbol sequence.

In the embodiment of the invention, compared with the technical scheme that a rule of a pronunciation dictionary is formulated according to phonetic symbol characteristics of a vocabulary entry, a model of a training neural network and a large amount of manpower and material resources are consumed for manually collecting and constructing the vocabulary entry and the phonetic symbol data of the pronunciation dictionary in the prior art, the method generates a vocabulary entry sequence by preprocessing input text raw data; preprocessing input audio data to generate a phonetic symbol sequence; processing the entry sequence based on a G2P model to obtain a phonetic symbol sequence, and generating a binary group < entry, phonetic symbol >; processing the phonetic symbol sequence based on a P2G model to obtain a term sequence, and generating a binary group < phonetic symbol, term >; calculating the matching degree between the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry >; comparing the matching degree with a preset matching degree, and performing discriminative sample extraction on a binary group < entry, phonetic symbol > and a binary group < phonetic symbol, entry >, the matching degree of which is less than the preset matching degree, so as to obtain discriminative samples; and obtaining the label and correction of the discriminative sample by a domain expert, storing the labeled and corrected binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry > into a multi-level large-scale pronunciation dictionary, standardizing the correction operation as a constraint condition of model training in the other part, and iteratively training the model. The whole construction process is realized by the participation of people in a loop, the model is fast and efficient, and the processes are repeated in a circulating mode until the construction of a large-scale pronunciation dictionary is completed. The large-scale pronunciation dictionary can be quickly and effectively constructed, the working efficiency of the voice recognition system is improved, and the labor cost is reduced.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An iterative large-scale pronunciation dictionary construction method is characterized by comprising the following steps:

preprocessing input text raw data to generate an entry sequence;

preprocessing input audio data to generate a phonetic symbol sequence;

2. The iterative large-scale pronunciation dictionary construction method according to claim 1, wherein preprocessing the input text raw data to generate a sequence of terms, comprises:

cleaning and standardizing input text raw data to generate a vocabulary entry sequence;

preprocessing input audio data to generate a phonetic symbol sequence, comprising:

and carrying out audio denoising and phoneme segmentation on the input audio raw data to generate a phonetic symbol sequence.

3. The iterative large-scale pronunciation dictionary construction method according to claim 1, further comprising: and if the matching degree is greater than the preset matching degree, storing the corresponding binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry > into the multi-level large-scale pronunciation dictionary.

4. The iterative large-scale pronunciation dictionary construction method according to claim 1, wherein performing discriminative sample extraction on the binary < vocabulary entry, phonetic symbol > and the binary < phonetic symbol, vocabulary entry > corresponding to the matching degree smaller than the preset matching degree to obtain discriminative samples comprises:

calculating the information entropy of the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry >, which are corresponding to the matching degree smaller than the preset matching degree;

and comparing the information entropy with a preset information entropy threshold, and taking the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry > corresponding to the information entropy larger than the preset information entropy threshold as discriminative samples.

5. The iterative large-scale pronunciation dictionary construction method according to claim 1, further comprising:

and iteratively training the G2P model and the P2G model by using the labeled and corrected binary < entry, phonetic symbol > and the binary < phonetic symbol, entry >.

6. The iterative large-scale pronunciation dictionary construction method of claim 1, wherein the G2P model and the P2G model are iteratively trained using labeled and corrected doublets < entry, phonetic symbol > and doublets < phonetic symbol, entry >, comprising:

converting the labeled and corrected binary group < entry, phonetic symbol > and binary group < phonetic symbol, entry > into three operations with different weights: adding operation weight to be 1, deleting operation weight to be 1 and modifying operation weight to be 2;

obtaining a Loss function Loss based on three operations with different weights;

and (5) performing iterative training by using a Loss function Loss constraint G2P model and a P2G model.

7. The iterative large-scale pronunciation dictionary construction method according to claim 1, wherein the Loss function Loss is obtained based on three different weight operations according to the following formula:

8. an iterative large-scale pronunciation dictionary construction device, comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the iterative large-scale pronunciation dictionary construction method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the iterative large-scale pronunciation dictionary construction method according to any one of claims 1 to 7.