CN112530414B - Iterative large-scale pronunciation dictionary construction method and device - Google Patents

Iterative large-scale pronunciation dictionary construction method and device Download PDF

Info

Publication number
CN112530414B
CN112530414B CN202110178948.6A CN202110178948A CN112530414B CN 112530414 B CN112530414 B CN 112530414B CN 202110178948 A CN202110178948 A CN 202110178948A CN 112530414 B CN112530414 B CN 112530414B
Authority
CN
China
Prior art keywords
entry
phonetic symbol
binary group
matching degree
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110178948.6A
Other languages
Chinese (zh)
Other versions
CN112530414A (en
Inventor
王治愚
王大亮
王丽媛
齐红威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Datatang Beijing Technology Co ltd
Original Assignee
Datatang Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Datatang Beijing Technology Co ltd filed Critical Datatang Beijing Technology Co ltd
Priority to CN202110178948.6A priority Critical patent/CN112530414B/en
Publication of CN112530414A publication Critical patent/CN112530414A/en
Application granted granted Critical
Publication of CN112530414B publication Critical patent/CN112530414B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • G10L2015/0633Creating reference templates; Clustering using lexical or orthographic knowledge sources

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses an iterative large-scale pronunciation dictionary construction method and device, wherein the method comprises the following steps: generating an entry sequence according to the text raw data; generating a phonetic symbol sequence according to the audio raw data; generating a binary group < entry, phonetic symbol > by utilizing a G2P model according to the entry sequence; generating a binary group < phonetic symbol, entry > by utilizing a P2G model according to the phonetic symbol sequence; calculating the matching degree between the two binary groups, comparing the matching degree with a preset matching degree, and performing discriminative sample extraction on the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry >, the matching degree of which is less than the preset matching degree, so as to obtain discriminative samples; and obtaining the label and correction of the field expert on the identification sample, and storing the labeled and corrected binary group < entry, phonetic symbol > and binary group < phonetic symbol, entry > into the multi-level large-scale pronunciation dictionary. The invention can quickly and effectively construct a large-scale pronunciation dictionary, improve the working efficiency of the voice recognition system and reduce the labor cost.

Description

Iterative large-scale pronunciation dictionary construction method and device
Technical Field
The invention relates to the technical field of dictionary construction, in particular to an iterative large-scale pronunciation dictionary construction method and device.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
With the continuous innovation of science and technology, the speech field is rapidly developed, and meanwhile, the continuous updating and iteration of a language recognition system are also driven. The voice recognition system is composed of three parts, namely an acoustic model, a pronunciation dictionary and a language model, wherein the pronunciation dictionary is a very important part in the voice recognition system and is a bridge connecting the acoustic model and the voice model. Therefore, how to construct the pronunciation dictionary is a significant and arduous task for the voice recognition system, and the size of the construction scale of the pronunciation dictionary directly restricts the accuracy of the whole voice recognition system.
The existing pronunciation dictionary construction methods are mainly divided into three types, one is based on rules, one is based on machine learning, and the other is based on neural networks. The model scale is limited, and for a task of constructing a large-scale pronunciation dictionary, a large amount of manpower and material resources are consumed for manually collecting and constructing vocabulary entries and phonetic symbol data of the pronunciation dictionary, so that the finally generated pronunciation dictionary is about tens of thousands of vocabulary entries generally. Therefore, a method for quickly and effectively constructing a large-scale pronunciation dictionary is urgently needed to improve the working efficiency of the voice recognition system and reduce the labor cost.
Disclosure of Invention
The embodiment of the invention provides an iterative large-scale pronunciation dictionary construction method, which is used for quickly and effectively constructing a large-scale pronunciation dictionary, improving the working efficiency of a voice recognition system and reducing the labor cost and comprises the following steps:
preprocessing input text raw data to generate an entry sequence;
preprocessing input audio data to generate a phonetic symbol sequence;
processing the entry sequence based on a G2P model to obtain a phonetic symbol sequence, and generating a binary group < entry, phonetic symbol >;
processing the phonetic symbol sequence based on a P2G model to obtain a term sequence, and generating a binary group < phonetic symbol, term >;
calculating the matching degree between the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry >;
comparing the matching degree with a preset matching degree, and performing discriminative sample extraction on a binary group < entry, phonetic symbol > and a binary group < phonetic symbol, entry >, which correspond to the matching degree smaller than the preset matching degree, to obtain discriminative samples;
and acquiring the label and correction of the discriminative sample by a domain expert, and storing the labeled and corrected binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry > into a multi-level large-scale pronunciation dictionary.
The embodiment of the present invention further provides an iterative large-scale pronunciation dictionary constructing device, which is used for quickly and effectively constructing a large-scale pronunciation dictionary, improving the working efficiency of a speech recognition system and reducing the labor cost, and the device includes:
the data preprocessing module is used for preprocessing input text raw data to generate an entry sequence; preprocessing input audio data to generate a phonetic symbol sequence;
a G2P module, configured to process the entry sequence based on a G2P model to obtain a phonetic symbol sequence, and generate a binary group < entry, phonetic symbol >;
a P2G module, configured to process the phonetic symbol sequence based on the P2G model to obtain a vocabulary entry sequence, and generate a binary group < phonetic symbol, vocabulary entry >;
the active learning module is used for calculating the matching degree between the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry >; comparing the matching degree with a preset matching degree, and performing discriminative sample extraction on a binary group < entry, phonetic symbol > and a binary group < phonetic symbol, entry >, which correspond to the matching degree smaller than the preset matching degree, to obtain discriminative samples; and acquiring the label and correction of the discriminative sample by a domain expert, and storing the labeled and corrected binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry > into a multi-level large-scale pronunciation dictionary.
The embodiment of the invention also provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the iterative large-scale pronunciation dictionary construction method when executing the computer program.
The embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the iterative large-scale pronunciation dictionary construction method described above.
In the embodiment of the invention, compared with the technical scheme that a rule of a pronunciation dictionary is formulated according to phonetic symbol characteristics of a vocabulary entry, a model of a training neural network and a large amount of manpower and material resources are consumed for manually collecting and constructing the vocabulary entry and the phonetic symbol data of the pronunciation dictionary in the prior art, the method generates a vocabulary entry sequence by preprocessing input text raw data; preprocessing input audio data to generate a phonetic symbol sequence; processing the entry sequence based on a G2P model to obtain a phonetic symbol sequence, and generating a binary group < entry, phonetic symbol >; processing the phonetic symbol sequence based on a P2G model to obtain a term sequence, and generating a binary group < phonetic symbol, term >; calculating the matching degree between the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry >; comparing the matching degree with a preset matching degree, and performing discriminative sample extraction on a binary group < entry, phonetic symbol > and a binary group < phonetic symbol, entry >, the matching degree of which is less than the preset matching degree, so as to obtain discriminative samples; and obtaining the labeling and correction of the discriminative sample by the domain expert, and storing the labeled and corrected binary group < entry, phonetic symbol > and binary group < phonetic symbol, entry > into the multilevel large-scale pronunciation dictionary, so that the large-scale pronunciation dictionary can be quickly and effectively constructed, the working efficiency of the voice recognition system is improved, and the labor cost is reduced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:
FIG. 1 is a block diagram of an iterative large-scale pronunciation dictionary construction apparatus according to an embodiment of the present invention;
FIG. 2 is a flowchart of an iterative large-scale pronunciation dictionary construction method according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating data preprocessing according to an embodiment of the present invention;
FIG. 4 is a schematic drawing of model inference of G2P and P2G in an embodiment of the invention;
FIG. 5 is a diagram illustrating the calculation of binary matching according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of the extraction of a differential sample according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
The term explains:
and (3) voice recognition: the method is characterized in that vocabulary contents in human voice are converted into computer readable input, such as binary codes, character sequences and the like, so that a human-computer interaction interface is more natural and easier to use;
the iteration formula is as follows: the method refers to a mode of training a model, which is a cyclic reciprocating process, and continuously utilizes constraint conditions to restrict the training of the model to be optimized towards a preset direction;
a pronunciation dictionary: a dictionary in the speech recognition system for describing each entry and its corresponding pronunciation relationship, which contains the mapping from word to phonetic symbol;
G2P (grappheme-to-Phoneme, Grapheme to Phoneme): referring to a generation model of a pronunciation dictionary, wherein the output result of the generation model is a < vocabulary entry, phonetic symbol > binary group which is used for deducing the pronunciation dictionary;
P2G (Phoneme-to-Grapheme, Phoneme to Grapheme): the model output result of the model is < phonetic symbol, vocabulary entry > binary group relative to the G2P model and the training model, and the model is used for deducing a pronunciation dictionary;
identification of samples: the sample which is selected after the binary matching degree calculation and contains the largest information amount is selected;
human presence in the loop: the method refers to the process of field expert marking and expert correction knowledge base in an active learning module, and particularly refers to the phenomenon that a part in which an expert participates is in the whole pronunciation dictionary construction system loop.
The invention provides an iterative large-scale pronunciation dictionary construction device, which comprises the following modules as shown in figure 1:
the data preprocessing module is used for preprocessing input text raw data to generate an entry sequence; preprocessing input audio data to generate a phonetic symbol sequence;
a G2P module, configured to process the entry sequence based on a G2P model to obtain a phonetic symbol sequence, and generate a binary group < entry, phonetic symbol >;
a P2G module, configured to process the phonetic symbol sequence based on the P2G model to obtain a vocabulary entry sequence, and generate a binary group < phonetic symbol, vocabulary entry >;
the active learning module is used for calculating the matching degree between the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry >; comparing the matching degree with a preset matching degree, and performing discriminative sample extraction on a binary group < entry, phonetic symbol > and a binary group < phonetic symbol, entry >, which correspond to the matching degree smaller than the preset matching degree, to obtain discriminative samples; and acquiring the label and correction of the discriminative sample by a domain expert, and storing the labeled and corrected binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry > into a multi-level large-scale pronunciation dictionary.
In the embodiment of the present invention, as shown in fig. 1, the data preprocessing module: the function of this module is to preprocess the input data, which includes the following two units.
And the text preprocessing unit is used for cleaning the text raw data, including removing network tags, stop words, error symbols and the like, performing word-level segmentation on the cleaned text, removing existing entries in a pronunciation dictionary, and directly entering the G2P model for inference.
And the phoneme segmentation unit is used for carrying out phoneme segmentation on the denoised audio data, then extracting a phoneme sequence, converting the phoneme sequence into a phonetic symbol sequence, and sending the phonetic symbol sequence to the inference unit of the P2G model for inference.
In an embodiment of the present invention, as shown in fig. 1, the G2P module: the function of this module is to train the G2P model and predict for a given sequence of terms its corresponding phonetic transcription sequence, which comprises a model training unit and a model inference unit.
Model (G2P) training unit: the unit first trains the initial model of G2P for cold start with a small amount of underlying pronunciation dictionary data and prepares for iterative training for later G2P model inference.
Model (G2P) inference unit: the unit uses the entry sequences output by the text pre-processing unit to generate their corresponding phonetic symbol sequences, the final sequence form being a binary < entry, phonetic symbol >.
In the embodiment of the present invention, as shown in fig. 1, the P2G module: the function of this module is to train the P2G model and predict for a given phonetic symbol sequence its corresponding sequence of terms, which includes a model training unit and a model inference unit.
Model (P2G) training unit: the unit first trains the initial model of P2G for cold start using a small amount of basic pronunciation dictionary data and prepares for iterative training for later P2G model inference.
Model (P2G) inference unit: the unit uses the phonetic symbol sequences output by the phoneme segmentation unit to generate their corresponding phonetic symbol sequences, the final sequence form being a binary < phonetic symbol, entry >.
In the embodiment of the present invention, as shown in fig. 1, the active learning module: and aiming at the < entry, phonetic symbol > and < phonetic symbol, entry > binary groups generated by the G2P module and the P2G module, calculating and comparing the degree of matching of the binary groups, and providing a brand-new training method for the two-way sequence model of the human-in-loop. Comparing the matching degree with a preset matching degree, and according to a matching degree comparison result (the matching degree is less than the preset matching degree, namely low matching, and the matching degree is greater than the preset matching degree, namely high matching), directly warehousing the high-matching binary group; and for low-matching duplet, a discriminative sample extraction unit is used for extracting a high-discriminative sample for further iterative model training of G2P and P2G, and further analysis and correction are carried out on the inference results of the two models. The sample correction operation can be used as a constraint condition for training two models. The module comprises the following four subunits:
a matching degree calculation unit: the unit is used for calculating the matching degree, carrying out the matching degree calculation between the < entry, phonetic symbol > duplet according to the output results of the G2P model inference unit and the P2G model inference unit, and dividing the matching degree into two matching conditions: high matching, namely, the data pair can be used as a final result of the pronunciation dictionary and is directly put in storage; if the match is low, the processing proceeds to the differential sample extraction unit.
A discriminating sample extracting unit: the unit is used for processing the output result of the matching degree calculation unit as a low-matching < entry, phonetic symbol > binary group, extracting a word sequence and phonetic symbol sequence pair which has a larger entropy value and can represent the overall characteristics of the sample by applying an entropy value maximization basic strategy in an uncertainty strategy in an active learning method, and taking the word sequence and phonetic symbol sequence pair as the input of the domain expert labeling unit.
A domain expert correction unit: the unit is used for carrying out domain expert marking on the binary group extracted by the identification sample extraction unit and taking the labeled binary group as a first part of a human-in-circuit. The process is to manually check and mark the problems of word formation, morphology and the like of the entries in the low matching sample, and finally input the marked < entries, phonetic symbols > binary groups into an expert correction knowledge base.
And (3) correcting the knowledge base by the expert: the method is used for carrying out expert correction on the to-be-corrected binary group processed by the domain expert labeling unit, carrying out correction operation on the < entry, phonetic symbol > binary group as a second part of a human-in-circuit, then carrying out standard quantization on all correction operations, and directly taking a corrected result as data of a final pronunciation dictionary library while serving as a constraint model training condition.
In the embodiment of the present invention, as shown in fig. 1, the multi-level large-scale pronunciation dictionary belongs to a multi-level large-scale pronunciation dictionary library: the database of the pronunciation dictionary constructed by the invention is divided into the following four levels: l1 core pronunciation dictionary; an L2 basic pronunciation dictionary; l3 expands pronunciation dictionary; l4 epilogue words. Wherein the L1 core pronunciation dictionary and the L2 basic pronunciation dictionary are used as important cold start data resources for training the model in the device.
L1 core pronunciation dictionary: the vocabulary entries and their corresponding phonetic symbol data accumulated in the standard dictionary, such as CMU dictionary, raman dictionary, etc., are used.
L2 basic pronunciation dictionary: the speech pronunciation dictionary data which is widely used in the field of speech recognition and has high recognition accuracy is used, such as LibriSpeech speech library and WSJ0 speech library.
L3 expands pronunciation dictionary: according to the existing database resources and the inference result of the model, the unique nouns, compound words, abbreviative words and the like related to the professional field are used as a third-layer pronunciation dictionary.
L4 extended pronunciation dictionary: and the construction unit of the extensive pronunciation dictionary is formed by aiming at the overlength words and the new words, the entity vocabulary entries and the slang vocabularies which are popular in the network.
The embodiment of the invention also provides an iterative large-scale pronunciation dictionary construction method, which is described in the following embodiment. Because the principle of solving the problems of the method is similar to that of the iterative large-scale pronunciation dictionary construction device, the implementation of the method can be referred to that of the iterative large-scale pronunciation dictionary construction device, and repeated parts are not repeated.
Fig. 2 is a flowchart of an iterative large-scale pronunciation dictionary construction method according to an embodiment of the present invention, and as shown in fig. 2, the method includes:
step 202: preprocessing input text raw data to generate an entry sequence;
step 204: preprocessing input audio data to generate a phonetic symbol sequence;
step 206: processing the entry sequence based on a G2P model to obtain a phonetic symbol sequence, and generating a binary group < entry, phonetic symbol >;
step 208: processing the phonetic symbol sequence based on a P2G model to obtain a term sequence, and generating a binary group < phonetic symbol, term >;
step 210: calculating the matching degree between the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry >;
step 212: comparing the matching degree with a preset matching degree, and performing discriminative sample extraction on a binary group < entry, phonetic symbol > and a binary group < phonetic symbol, entry >, which correspond to the matching degree smaller than the preset matching degree, to obtain discriminative samples;
step 214: and acquiring the label and correction of the discriminative sample by a domain expert, and storing the labeled and corrected binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry > into a multi-level large-scale pronunciation dictionary.
The iterative large-scale pronunciation dictionary construction is explained in detail based on fig. 1 and 2 described above.
(1) Data pre-processing
The input of the present invention is divided into two parts, the first part being text raw data (e.g., English text data I am a master of biological science in FIG. 3) and the second part being audio raw data (e.g., audio map on the right side of FIG. 3). The data may be obtained from a corpus that has already been collected, or audio and text collected using third party tools, without limitation. The raw text data is then cleaned and normalized in a text preprocessing unit to generate a sequence of terms, such as the term sequence { master, biological, science } in fig. 3. Meanwhile, the audio generation data enters a phoneme segmentation unit to carry out audio denoising and phoneme segmentation operations, and a phonetic symbol sequence is generated, such as the phonetic symbol sequence {/mass ə r/,/bi ə l ä j ə k ə l/,/si ə ns/} in FIG. 3.
(2) G2P, P2G model inference
The G2P and P2G models are a deep learning method based on sequence-to-model, and as shown in fig. 4, encode and learn the vectors of the feature sequences through a bidirectional long-and-short-term memory neural network, finally decode the vectors into the required sequences, and simultaneously calculate the similarity. The model inference units in the two models are pre-trained in an initial state by using an L1 core pronunciation dictionary and an L2 basic pronunciation dictionary, and can be used as cold start parameters of the models.
The model inference unit of the G2P model processes the output of the text pre-processing unit, i.e., the sequence of terms { master, biological, science }. And generating phonetic symbol sequences { < master,/master ə r/, 100>, < biological,/bi ə l ä j ə k ə l/, 98>, < science,/si ə ns/, 99> } corresponding to the vocabulary entry sequences by utilizing a pre-trained model inference unit, wherein 100, 98 and 99 represent the similarity (namely the matching degree) of the vocabulary entries and the phonetic symbols, 100 represents the complete matching, and 98 and 99 represent that the matching degree of the vocabulary entries and the phonetic symbols is not the complete matching. The G2P model is formulated as follows:
Figure 724417DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 692635DEST_PATH_IMAGE002
representing the first in a sequence of termsiThe number of the entries is the same as the number of the entries,Mindicating a common presence in the current sequence of entriesMAn entry. Encoded by G2P to obtain
Figure 72801DEST_PATH_IMAGE003
That is to say<Entry, phonetic symbol>A binary group.
The model inference unit of the P2G model processes the output result of the phoneme segmentation unit, namely the phonetic symbol sequence {/most ə r/,/bi ə l ä j ə k ə l/,/si ə ns/}. And generating a vocabulary entry sequence { </master ə r/, master, 100>, </bi ə l ä j ə k ə l/, biological, 98>, </si ə ns/, science, 99> } corresponding to the phonetic symbol sequence by utilizing a pre-trained model inference unit, wherein 100, 98 and 99 represent the matching degrees of the phonetic symbols and the vocabulary entries, 100 represents a complete match, and 98 and 99 represent that the matching degrees of the phonetic symbols and the vocabulary entries are not the complete match. The P2G model is formulated as follows:
Figure 650413DEST_PATH_IMAGE004
wherein the content of the first and second substances,
Figure 166845DEST_PATH_IMAGE005
to representFirst in the phonetic symbol sequenceiThe number of the phonetic symbols is one,Mrepresenting all of the current phonetic symbol sequenceMAnd (4) phonetic symbols. Obtained by using P2G coding
Figure 571544DEST_PATH_IMAGE006
That is to say<Phonetic symbol, entry>A binary group.
(3) Binary matching
After passing through the inference units of the G2P model and the P2G model, the two groups of inference results are input to the binary matching degree calculation unit for matching degree calculation, as shown in fig. 5, and then the two groups of inference results are classified into two matching degrees, i.e., high matching and low matching, according to the degree of matching.
The measurement method used by the matching degree calculation unit is KL divergence, so the calculation distance of the KL divergence is used for measuring the matching degree between the vocabulary entry sequence and the phonetic symbol sequence, because the mapping process of < vocabulary entry, phonetic symbol > of the pronunciation dictionary in the speech recognition field is asymmetric mapping; moreover, since the pronunciation dictionary has many pronunciations, polyphones, and homonyms, many-to-many mapping between the vocabulary entry and the phonetic symbol is caused. The KL divergence is a common method used to measure the difference between two asymmetric distributions and therefore can be used here to calculate the degree of match between < phonetic symbol, lemma > doublet and < lemma, lemma > doublet. The specific calculation formula is as follows:
Figure 173426DEST_PATH_IMAGE007
wherein, A represents the inference process of the G2P model, namely, each inputted vocabulary entry word generates phonetic symbol phone, and B represents the inference process of the P2G model, namely, each inputted phonetic symbol phone generates vocabulary entry word;
Figure 554729DEST_PATH_IMAGE008
is the degree of match between the dyads output by the G2P model and the P2G model;
Figure 722405DEST_PATH_IMAGE009
is a G2P modelThe probability of generating a phone by the input word in (1);
Figure 501268DEST_PATH_IMAGE010
then is the probability that the input phone in the P2G model generated word,
Figure 121605DEST_PATH_IMAGE011
(4) differential sample extraction
And (3) aiming at the < entry, phonetic symbol > binary group and the < phonetic symbol, entry > binary group with low matching degree, carrying out extraction of the discriminative sample by using an active learning method according to the information entropy of the sample. Specifically, information entropies of the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry > with the matching degree smaller than the preset matching degree are calculated, the information entropies are compared with a preset information entropy threshold, and the information entropies are labeled by taking the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry > with the information entropies larger than the preset information entropy threshold as discriminative samples. The process of selecting the differential sample is shown in fig. 6: the graph (a) is a sample to be extracted, which is divided into two types according to the characteristics of the sample, and the two types are respectively represented by gray and black; (b) the straight line in the graph represents a larger preset information entropy threshold, under the condition of the preset information entropy threshold, the gray or black samples which are closer to the straight line are samples with larger information entropy, and the farther the distance is, the lower the information entropy of the sample points is considered; (c) when the graph shows that the threshold value is smaller, the number of samples on the straight line is increased, which means that the number of samples with large information entropy is large, namely, the number of discriminative samples is more, so that the training efficiency of the model is higher.
The formula for calculating the entropy used in the extraction process of the discriminative sample is as follows:
Figure 775440DEST_PATH_IMAGE012
a, B are two inference results of a G2P model and a P2G model respectively;
Figure 797623DEST_PATH_IMAGE013
is the minimum unit corresponding to each model
Figure 481807DEST_PATH_IMAGE014
When the temperature of the water is higher than the set temperature,
Figure 323861DEST_PATH_IMAGE015
to represent<Entry, phonetic symbol>Correspond to
Figure 46967DEST_PATH_IMAGE016
When the temperature of the water is higher than the set temperature,
Figure 923656DEST_PATH_IMAGE017
to represent<Phonetic symbol, entry>;
Figure 309900DEST_PATH_IMAGE018
Is the combined probability of each group of phonetic symbol sequences in the G2P model;
Figure 435988DEST_PATH_IMAGE019
it is the combined probability of each set of sequences of terms in the P2G model that corresponds to a phonetic sequence.jThe value range of (1) to low degree of matching<Entry, phonetic symbol>Binary sum<Phonetic symbol, entry>The number of doublets.
(5) Domain expert labeling and correction
Performing domain expert labeling on the most representative sample extracted by the differential sample extraction unit through a domain expert correction unit, and labeling a series of problems of the sample, such as repeated identification and error identification of the entry; wrong feature labeling of phonetic symbols and audio wrong labeling of polyphonic words. Because the present invention relates to specific or proprietary vocabularies in different areas of expertise, different domain experts are required to perform the tagging work.
And then, carrying out expert correction on the labeled sample data through an expert correction knowledge base, and then carrying out standard quantization on all correction operations to be used as a constraint model training condition.
(6) Model iterative training
The method takes labeled and corrected sample data as constraint conditions of a G2P model and a P2G model, the G2P model and the P2G model are both in iterative training, a sequence-to-sequence model is used in the training process, and the training process of the G2P model and the P2G model is constrained through a loss function. Here, a loss function for model training is defined and quantifies the correction operation into three different cost (i.e., weight) operations: 1. adding operation: the insert cost is 1, the delete operation cost is 1, and the modify operation cost is 2, thereby obtaining a normalized training model overall Loss function Loss, which can be defined by the following formula:
Figure 260987DEST_PATH_IMAGE020
Figure 992182DEST_PATH_IMAGE021
wherein the content of the first and second substances,
Figure 579022DEST_PATH_IMAGE022
is a loss function;
Figure 661247DEST_PATH_IMAGE023
is the loss of sequence to sequence model;
Figure 696461DEST_PATH_IMAGE024
representing the loss after the manual correction operation is quantified;
Figure 547743DEST_PATH_IMAGE025
is the weight of the add operation of the penalty function,
Figure 305483DEST_PATH_IMAGE026
is the weight of the delete operation of the loss function,
Figure 78267DEST_PATH_IMAGE027
is the weight of the modification operation of the loss function,
Figure 460050DEST_PATH_IMAGE028
representing the length of each set of sequences of terms or phonetic symbols,
Figure 431417DEST_PATH_IMAGE029
a sequence of terms is represented that is,
Figure 360058DEST_PATH_IMAGE030
representing a phonetic symbol sequence.
The embodiment of the invention also provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the iterative large-scale pronunciation dictionary construction method when executing the computer program.
The embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the iterative large-scale pronunciation dictionary construction method described above.
In the embodiment of the invention, compared with the technical scheme that a rule of a pronunciation dictionary is formulated according to phonetic symbol characteristics of a vocabulary entry, a model of a training neural network and a large amount of manpower and material resources are consumed for manually collecting and constructing the vocabulary entry and the phonetic symbol data of the pronunciation dictionary in the prior art, the method generates a vocabulary entry sequence by preprocessing input text raw data; preprocessing input audio data to generate a phonetic symbol sequence; processing the entry sequence based on a G2P model to obtain a phonetic symbol sequence, and generating a binary group < entry, phonetic symbol >; processing the phonetic symbol sequence based on a P2G model to obtain a term sequence, and generating a binary group < phonetic symbol, term >; calculating the matching degree between the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry >; comparing the matching degree with a preset matching degree, and performing discriminative sample extraction on a binary group < entry, phonetic symbol > and a binary group < phonetic symbol, entry >, the matching degree of which is less than the preset matching degree, so as to obtain discriminative samples; and obtaining the label and correction of the discriminative sample by a domain expert, storing the labeled and corrected binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry > into a multi-level large-scale pronunciation dictionary, standardizing the correction operation as a constraint condition of model training in the other part, and iteratively training the model. The whole construction process is realized by the participation of people in a loop, the model is fast and efficient, and the processes are repeated in a circulating mode until the construction of a large-scale pronunciation dictionary is completed. The large-scale pronunciation dictionary can be quickly and effectively constructed, the working efficiency of the voice recognition system is improved, and the labor cost is reduced.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. An iterative large-scale pronunciation dictionary construction method is characterized by comprising the following steps:
preprocessing input text raw data to generate an entry sequence;
preprocessing input audio data to generate a phonetic symbol sequence;
processing the entry sequence based on a G2P model to obtain a phonetic symbol sequence, and generating a binary group < entry, phonetic symbol >;
processing the phonetic symbol sequence based on a P2G model to obtain a term sequence, and generating a binary group < phonetic symbol, term >;
calculating the matching degree between the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry >;
comparing the matching degree with a preset matching degree, and performing discriminative sample extraction on a binary group < entry, phonetic symbol > and a binary group < phonetic symbol, entry >, which correspond to the matching degree smaller than the preset matching degree, to obtain discriminative samples;
and acquiring the label and correction of the discriminative sample by a domain expert, and storing the labeled and corrected binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry > into a multi-level large-scale pronunciation dictionary.
2. The iterative large-scale pronunciation dictionary construction method according to claim 1, wherein preprocessing the input text raw data to generate a sequence of terms, comprises:
cleaning and standardizing input text raw data to generate a vocabulary entry sequence;
preprocessing input audio data to generate a phonetic symbol sequence, comprising:
and carrying out audio denoising and phoneme segmentation on the input audio raw data to generate a phonetic symbol sequence.
3. The iterative large-scale pronunciation dictionary construction method according to claim 1, further comprising: and if the matching degree is greater than the preset matching degree, storing the corresponding binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry > into the multi-level large-scale pronunciation dictionary.
4. The iterative large-scale pronunciation dictionary construction method according to claim 1, wherein performing discriminative sample extraction on the binary < vocabulary entry, phonetic symbol > and the binary < phonetic symbol, vocabulary entry > corresponding to the matching degree smaller than the preset matching degree to obtain discriminative samples comprises:
calculating the information entropy of the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry >, which are corresponding to the matching degree smaller than the preset matching degree;
and comparing the information entropy with a preset information entropy threshold, and taking the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry > corresponding to the information entropy larger than the preset information entropy threshold as discriminative samples.
5. The iterative large-scale pronunciation dictionary construction method according to claim 1, further comprising:
and iteratively training the G2P model and the P2G model by using the labeled and corrected binary < entry, phonetic symbol > and the binary < phonetic symbol, entry >.
6. The iterative large-scale pronunciation dictionary construction method of claim 1, wherein the G2P model and the P2G model are iteratively trained using labeled and corrected doublets < entry, phonetic symbol > and doublets < phonetic symbol, entry >, comprising:
converting the labeled and corrected binary group < entry, phonetic symbol > and binary group < phonetic symbol, entry > into three operations with different weights: adding operation weight to be 1, deleting operation weight to be 1 and modifying operation weight to be 2;
obtaining a Loss function Loss based on three operations with different weights;
and (5) performing iterative training by using a Loss function Loss constraint G2P model and a P2G model.
7. The iterative large-scale pronunciation dictionary construction method according to claim 1, wherein the Loss function Loss is obtained based on three different weight operations according to the following formula:
Figure DEST_PATH_IMAGE001
8. an iterative large-scale pronunciation dictionary construction device, comprising:
the data preprocessing module is used for preprocessing input text raw data to generate an entry sequence; preprocessing input audio data to generate a phonetic symbol sequence;
a G2P module, configured to process the entry sequence based on a G2P model to obtain a phonetic symbol sequence, and generate a binary group < entry, phonetic symbol >;
a P2G module, configured to process the phonetic symbol sequence based on the P2G model to obtain a vocabulary entry sequence, and generate a binary group < phonetic symbol, vocabulary entry >;
the active learning module is used for calculating the matching degree between the binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry >; comparing the matching degree with a preset matching degree, and performing discriminative sample extraction on a binary group < entry, phonetic symbol > and a binary group < phonetic symbol, entry >, which correspond to the matching degree smaller than the preset matching degree, to obtain discriminative samples; and acquiring the label and correction of the discriminative sample by a domain expert, and storing the labeled and corrected binary group < entry, phonetic symbol > and the binary group < phonetic symbol, entry > into a multi-level large-scale pronunciation dictionary.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the iterative large-scale pronunciation dictionary construction method according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the iterative large-scale pronunciation dictionary construction method according to any one of claims 1 to 7.
CN202110178948.6A 2021-02-08 2021-02-08 Iterative large-scale pronunciation dictionary construction method and device Active CN112530414B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110178948.6A CN112530414B (en) 2021-02-08 2021-02-08 Iterative large-scale pronunciation dictionary construction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110178948.6A CN112530414B (en) 2021-02-08 2021-02-08 Iterative large-scale pronunciation dictionary construction method and device

Publications (2)

Publication Number Publication Date
CN112530414A CN112530414A (en) 2021-03-19
CN112530414B true CN112530414B (en) 2021-05-25

Family

ID=74975572

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110178948.6A Active CN112530414B (en) 2021-02-08 2021-02-08 Iterative large-scale pronunciation dictionary construction method and device

Country Status (1)

Country Link
CN (1) CN112530414B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019208859A1 (en) * 2018-04-27 2019-10-31 주식회사 시스트란인터내셔널 Method for generating pronunciation dictionary and apparatus therefor
CN110675854A (en) * 2019-08-22 2020-01-10 厦门快商通科技股份有限公司 Chinese and English mixed speech recognition method and device
CN111369974A (en) * 2020-03-11 2020-07-03 北京声智科技有限公司 Dialect pronunciation labeling method, language identification method and related device
CN112331207A (en) * 2020-09-30 2021-02-05 音数汇元(上海)智能科技有限公司 Service content monitoring method and device, electronic equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6876543B2 (en) * 2017-06-29 2021-05-26 日本放送協会 Phoneme recognition dictionary generator and phoneme recognition device and their programs

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019208859A1 (en) * 2018-04-27 2019-10-31 주식회사 시스트란인터내셔널 Method for generating pronunciation dictionary and apparatus therefor
CN110675854A (en) * 2019-08-22 2020-01-10 厦门快商通科技股份有限公司 Chinese and English mixed speech recognition method and device
CN111369974A (en) * 2020-03-11 2020-07-03 北京声智科技有限公司 Dialect pronunciation labeling method, language identification method and related device
CN112331207A (en) * 2020-09-30 2021-02-05 音数汇元(上海)智能科技有限公司 Service content monitoring method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112530414A (en) 2021-03-19

Similar Documents

Publication Publication Date Title
CN110209823B (en) Multi-label text classification method and system
CN112906397B (en) Short text entity disambiguation method
CN116127952A (en) Multi-granularity Chinese text error correction method and device
CN112818698B (en) Fine-grained user comment sentiment analysis method based on dual-channel model
CN115831102A (en) Speech recognition method and device based on pre-training feature representation and electronic equipment
CN114841151B (en) Medical text entity relation joint extraction method based on decomposition-recombination strategy
CN110992943B (en) Semantic understanding method and system based on word confusion network
CN114528835A (en) Semi-supervised specialized term extraction method, medium and equipment based on interval discrimination
CN115293139A (en) Training method of voice transcription text error correction model and computer equipment
CN115658846A (en) Intelligent search method and device suitable for open-source software supply chain
CN111145914A (en) Method and device for determining lung cancer clinical disease library text entity
CN117390189A (en) Neutral text generation method based on pre-classifier
CN111666375A (en) Matching method of text similarity, electronic equipment and computer readable medium
CN116702765A (en) Event extraction method and device and electronic equipment
CN112530414B (en) Iterative large-scale pronunciation dictionary construction method and device
CN115994204A (en) National defense science and technology text structured semantic analysis method suitable for few sample scenes
CN112528003B (en) Multi-item selection question-answering method based on semantic sorting and knowledge correction
CN115422945A (en) Rumor detection method and system integrating emotion mining
CN115358227A (en) Open domain relation joint extraction method and system based on phrase enhancement
CN115240712A (en) Multi-mode-based emotion classification method, device, equipment and storage medium
CN114692615A (en) Small sample semantic graph recognition method for small languages
CN110879838B (en) Open domain question-answering system
CN114239555A (en) Training method of keyword extraction model and related device
CN111709245A (en) Chinese-Yuan pseudo parallel sentence pair extraction method based on semantic self-adaptive coding
CN116976351B (en) Language model construction method based on subject entity and subject entity recognition device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant