CN117672181A - Method and device for determining pronunciation of polyphone, storage medium and electronic equipment - Google Patents

Method and device for determining pronunciation of polyphone, storage medium and electronic equipment Download PDF

Info

Publication number
CN117672181A
CN117672181A CN202311737873.6A CN202311737873A CN117672181A CN 117672181 A CN117672181 A CN 117672181A CN 202311737873 A CN202311737873 A CN 202311737873A CN 117672181 A CN117672181 A CN 117672181A
Authority
CN
China
Prior art keywords
determining
polyphone
pronunciation
feature
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311737873.6A
Other languages
Chinese (zh)
Inventor
王锦阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN202311737873.6A priority Critical patent/CN117672181A/en
Publication of CN117672181A publication Critical patent/CN117672181A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The application discloses a method and a device for determining pronunciation of a polyphone, a storage medium and electronic equipment, wherein the method comprises the following steps: analyzing the corpus to be processed to determine a first polyphone in the corpus to be processed, wherein the first polyphone corresponds to character features; masking the to-be-processed sentences in the to-be-processed corpus by using a character coding model to obtain sentence embedded vectors of the to-be-processed sentences; generating a feature vector according to the sentence embedding vector, the character feature and the related word feature corresponding to the character feature; and inputting the feature vector into the trained deep learning model, and determining the target pronunciation corresponding to the first polyphone according to the output result of the deep learning model. By adopting the technical scheme, the problem of how to accurately determine the pronunciation of the polyphones is solved.

Description

Method and device for determining pronunciation of polyphone, storage medium and electronic equipment
Technical Field
The present invention relates to the field of speech synthesis, and in particular, to a method and apparatus for determining pronunciation of polyphones, a storage medium, and an electronic device.
Background
At present, in the process of converting characters used in the speech synthesis technology into speech, the correct pronunciation of the first polyphone often needs to be determined in advance. In the prior art, a method for determining pronunciation based on rules or a method for determining pronunciation based on deep learning is generally adopted to determine the pronunciation of a first multi-pronunciation word in a word, the rule-based method needs to manually define corresponding rules of words and pronunciation in a dictionary, then the text is divided into words to match the pronunciation in the dictionary, but the method needs to maintain complex rules and the dictionary, and correct pronunciation cannot be distinguished when the words and the dictionary are provided with a plurality of pronunciations. Whereas learning-based methods require training a neural network model for predicting the pronunciation of the first polyphone using a large amount of data, existing data set samples for the first polyphone disambiguation are small in number and data distribution is unbalanced. The accuracy of determining the pronunciation of the first polyphone is low by the above two methods.
Accordingly, the related art has a problem of how to accurately confirm the pronunciation of the polyphones.
Aiming at the problem of how to accurately confirm the pronunciation of the polyphones in the related technology, no effective solution is proposed yet.
Disclosure of Invention
The embodiment of the application provides a method and a device for determining pronunciation of a polyphone, a storage medium and electronic equipment, which at least solve the problem of how to accurately confirm the pronunciation of the polyphone in the related technology.
According to an embodiment of the embodiments of the present application, there is provided a method for determining pronunciation of a polyphone, including: analyzing the corpus to be processed, and determining a first polyphone in the corpus to be processed, wherein the first polyphone corresponds to character features; masking the to-be-processed sentences in the to-be-processed corpus by using a character coding model to obtain sentence embedded vectors of the to-be-processed sentences; generating a feature vector according to the sentence embedded vector, the character feature and the related word feature corresponding to the character feature; and inputting the feature vector into a trained deep learning model, and determining target pronunciation corresponding to the first polyphone according to an output result of the deep learning model, wherein the deep learning model is obtained by training by taking a historical feature vector corresponding to a second polyphone in a training corpus as an input sample and taking a sample probability of each pronunciation of the second polyphone as an output sample.
In one exemplary embodiment, before generating a feature vector from the sentence embedding vector, the character feature and the related word feature corresponding to the character feature, the method includes: determining adjacent characters of the first polyphones, and determining all words corresponding to the adjacent characters from a preset word vector database; obtaining a pre-training word vector obtained by vectorizing all the words, wherein the dimension of the pre-training word vector is consistent with the number of the words; and determining the pre-training word vector as the related word characteristic.
In an exemplary embodiment, generating a feature vector from the sentence embedding vector, the character feature and the related word feature corresponding to the character feature includes: obtaining a coding result obtained by coding the sentence embedded vector, and determining a transformation result obtained by carrying out nonlinear transformation on the related word characteristics; calculating the correlation degree corresponding to the related word characteristics based on the coding result and the transformation result; and generating a feature vector according to the relevance and the character feature.
In an exemplary embodiment, calculating the relevance of the relevant word feature based on the encoding result and the transformation result includes: obtaining the product of the coding result, the transformation result and a preset weight matrix; and carrying out normalization calculation on the product by using an activation function, and determining the result obtained by the normalization calculation as the correlation degree.
In one exemplary embodiment, generating a feature vector from the relevance and the character feature includes: determining products of target transformation results and target correlation degrees corresponding to all relevant word features according to all relevant word features to obtain a plurality of products; and determining the sum of the products, and determining the sum value between the sum of the products and the character feature as the feature vector.
In an exemplary embodiment, determining the pronunciation corresponding to the first polyphone according to the output result of the deep learning model includes: determining the probability of all pronunciations of the first polyphones from the output result of the deep learning model; and determining the maximum value from the probabilities of all the pronunciations, and determining the pronunciation corresponding to the maximum value as the pronunciation corresponding to the first polyphone.
In an exemplary embodiment, the method further comprises: calculating information entropy corresponding to the probabilities of all the pronunciations; under the condition that the information entropy meets a preset threshold value, determining other samples from the training corpus, wherein the other samples at least comprise adjacent word samples of the first polyphone sample and word samples except the first polyphone sample in the training corpus; and training the deep learning model by taking the feature vectors of the other samples and the historical feature vectors corresponding to the second polyphones as input samples and taking the sample probability of each pronunciation of the second polyphones as output samples.
According to another embodiment of the embodiments of the present application, there is also provided a device for determining pronunciation of a polyphone, including: the analysis module is used for analyzing the corpus to be processed and determining a first polyphone in the corpus to be processed, wherein the first polyphone corresponds to character features; the processing module is used for carrying out mask processing on the sentences to be processed in the corpus to be processed by using the character coding model to obtain sentence embedded vectors of the sentences to be processed; the generation module is used for generating a feature vector according to the sentence embedding vector, the character features and the related word features corresponding to the character features; the determining module is used for inputting the feature vector into a trained deep learning model, and determining target pronunciation corresponding to the first polyphone according to an output result of the deep learning model, wherein the deep learning model is obtained by training by taking a historical feature vector corresponding to a second polyphone in a training corpus as an input sample and taking a sample probability of each pronunciation of the second polyphone as an output sample.
In an exemplary embodiment, the generating module is further configured to: determining adjacent characters of the first polyphones, and determining all words corresponding to the adjacent characters from a preset word vector database; obtaining a pre-training word vector obtained by vectorizing all the words, wherein the dimension of the pre-training word vector is consistent with the number of the words; and determining the pre-training word vector as the related word characteristic.
In an exemplary embodiment, the generating module further includes: the acquisition unit is used for acquiring a coding result obtained by coding the sentence embedded vector and determining a transformation result obtained by carrying out nonlinear transformation on the related word characteristics; a calculating unit, configured to calculate a correlation degree corresponding to the related word feature based on the encoding result and the transformation result; and the generating unit is used for generating a feature vector according to the relevance and the character features.
In an exemplary embodiment, the above-mentioned calculation unit is further configured to: obtaining the product of the coding result, the transformation result and a preset weight matrix; and carrying out normalization calculation on the product by using an activation function, and determining the result obtained by the normalization calculation as the correlation degree.
In an exemplary embodiment, the generating unit is further configured to: determining products of target transformation results and target correlation degrees corresponding to all relevant word features according to all relevant word features to obtain a plurality of products; and determining the sum of the products, and determining the sum value between the sum of the products and the character feature as the feature vector.
In an exemplary embodiment, the determining module is further configured to: determining the probability of all pronunciations of the first polyphones from the output result of the deep learning model; and determining the maximum value from the probabilities of all the pronunciations, and determining the pronunciation corresponding to the maximum value as the pronunciation corresponding to the first polyphone.
In an exemplary embodiment, the determining module is further configured to: calculating information entropy corresponding to the probabilities of all the pronunciations; under the condition that the information entropy meets a preset threshold value, determining other samples from the training corpus, wherein the other samples at least comprise adjacent word samples of the first polyphone sample and word samples except the first polyphone sample in the training corpus; and training the deep learning model by taking the feature vectors of the other samples and the historical feature vectors corresponding to the second polyphones as input samples and taking the sample probability of each pronunciation of the second polyphones as output samples.
According to yet another aspect of the embodiments of the present application, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the above-described method of determining a polyphonic pronunciation when run.
According to still another aspect of the embodiments of the present application, there is further provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the method for determining the pronunciation of the polyphonic word by the computer program.
In the embodiment of the application, character features corresponding to a first polyphone in a corpus to be processed are analyzed, a character encoding model is used for carrying out mask processing on a sentence to be processed in the corpus to be processed to obtain a sentence embedding vector, the character features and feature vectors generated by related word features corresponding to the character features are input into a deep learning model trained by the feature vectors of a second polyphone, so that the sample probability of each pronunciation of the first polyphone is obtained, and the pronunciation of the first polyphone is determined according to the sample probability of each pronunciation. By adopting the technical scheme, the problem of how to accurately determine the pronunciation of the polyphones is solved, and the effect of accurately determining the pronunciation of the polyphones is further realized.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a schematic diagram of a hardware environment of a method for determining pronunciation of a polyphone according to an embodiment of the present application;
FIG. 2 is a flow chart of a method of determining a multi-tone word pronunciation in accordance with an embodiment of the present application;
FIG. 3 is a schematic diagram of the principle structure of a deep learning model according to an embodiment of the present application;
FIG. 4 is a word feature adapter schematic diagram according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a semi-supervised learning framework according to an embodiment of the present application;
fig. 6 is a block diagram of a multi-tone word pronunciation determination device according to an embodiment of the present application.
Detailed Description
In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, subsystem, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements that are expressly listed or inherent to such process, method, article, or apparatus.
In order to more clearly understand the technical solutions of the present application, the following describes some terms related to the embodiments of the present application:
text-to-speech (G2P, grapheme to Phoneme): is an important language processing technique that helps computers understand and generate speech by converting letters or glyphs (called graphemes) in the language into phonemes (phones) in the pronunciation.
Semi-supervised learning (SSL, semi-Supervised Learning): a learning method between supervised learning and unsupervised learning. In semi-supervised learning, the training dataset contains examples of both labels (known outputs) and unlabeled (unknown outputs), which enables the model to learn from limited, labeled data and extend over more unlabeled data. The goal of semi-supervised learning is to improve the performance of the model by combining limited, labeled data with a large amount of unlabeled data.
In the embodiment of the application, in order to ensure that the behavior of the model facing the tagged data is consistent with that of the non-tagged data, the input tagged data is randomly amplified, and then a loss function is calculated based on the output of different amplified data. The reliability of the training result can be improved by training in the semi-supervision mode, so that the model has robustness to the training of the label-free data. Taking the first polyphone labeled data as an example in the embodiment of the present application, the method for randomly amplifying the first polyphone labeled data includes dividing a sentence into words, only retaining words including the first polyphone and adjacent words of the first polyphone, using Word2Vec (a neural network model for generating Word vectors) to replace other words with similar words to generate a new sentence, and the data generated by the method is label-free data.
In order to use the above-mentioned non-label data for model training, it is necessary to generate pseudo labels for the non-label data, and because the cost of extracting the pronunciation data of the polyphones from the pinyin dictionary is minimum, the pseudo labels can be generated for the non-label data by the determined monophones composed of the polyphones and the monophones in the pinyin dictionary, so as to reduce the probability of occurrence of false labels and speed up model training.
The method embodiments provided in the embodiments of the present invention may be executed in a computer terminal or similar computing device. Taking the example of running on a computer terminal, fig. 1 is a block diagram of the hardware structure of the computer terminal for executing the method for determining the pronunciation of the polyphonic word according to the embodiment of the invention. As shown in fig. 1, the computer terminal may include one or more processors 102 (only one is shown in fig. 1), which processor 102 may include, but is not limited to, a microprocessor (Microprocessor Unit, abbreviated MPU) or programmable logic device (Programmable logic device, abbreviated PLD) and a memory 104 configured to store data, and in one exemplary embodiment, a transmission device 106 configured to communicate with an input/output device 108.
The memory 104 may be configured to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a method for determining a polyphonic pronunciation in an embodiment of the present invention, and the processor 102 performs various functional applications and determination of the polyphonic pronunciation by running the computer program stored in the memory 104, that is, implements the method described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the computer terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is arranged to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of a computer terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module configured to communicate wirelessly with the internet.
In this embodiment, a method for determining a pronunciation of a polyphone is provided, which is applied to the server, and fig. 2 is a flowchart of a method for determining a pronunciation of a polyphone according to an embodiment of the present application, where the flowchart includes the following steps:
step S202, analyzing a corpus to be processed, and determining a first polyphone in the corpus to be processed, wherein the first polyphone corresponds to character features;
step S204, masking the sentence to be processed in the corpus to be processed by using a character coding model to obtain a sentence embedded vector of the sentence to be processed;
alternatively, in the above step S204, the RoBERTa-wwm may be used as a character encoder to mask the sentence to be processed in the corpus to be processed. Where RoBERTa is a language model based on bi-directional coding and self-attention mechanisms, wwm represents Whole Word Masking, i.e. masking the whole word, specifically including dividing the corpus text by character, and inputting the divided characters into the RoBERTa-wwm character encoder for processing.
For example, if the child-roberta-wwm-ext contains 12 layers of transfomer, each layer contains 12 self-attention heads and 768 hidden units. The character encoder processes the input sentence to generate sentence embedded vector H corresponding to character features S =[h 1 ,h 2 ,…,h n ]Where n is the sentence length, h 1 ,h 2 ,…,h n Etc. represent the characteristics corresponding to the characters in the input sentence.
Compared with a classical BERT character encoder, the RoBERTa-wwm character encoder can perform mask processing on the whole word to better understand context and semantic association, so that the Chinese text processing capability is improved, the training data size and the training step number of the RoBERTa-wwm character encoder are increased, and the robustness is stronger.
Step S206, generating a feature vector according to the sentence embedded vector, the character features and the related word features corresponding to the character features;
step S208, inputting the feature vector to a trained deep learning model, and determining a target pronunciation corresponding to the first polyphone according to an output result of the deep learning model, wherein the deep learning model is obtained by training with a historical feature vector corresponding to a second polyphone in a training corpus as an input sample and with a sample probability of each pronunciation of the second polyphone as an output sample.
Through the steps, the corpus to be processed is analyzed, and a first polyphone in the corpus to be processed is determined, wherein the first polyphone corresponds to character features; masking the to-be-processed sentences in the to-be-processed corpus by using a character coding model to obtain sentence embedded vectors of the to-be-processed sentences; generating a feature vector according to the sentence embedded vector, the character feature and the related word feature corresponding to the character feature; and inputting the feature vector into a trained deep learning model, and determining target pronunciation corresponding to the first polyphone according to an output result of the deep learning model, wherein the deep learning model is obtained by training by taking a historical feature vector corresponding to a second polyphone in a training corpus as an input sample and taking a sample probability of each pronunciation of the second polyphone as an output sample. The problem of how to accurately determine the pronunciation of the polyphones is solved, and the effect of accurately determining the pronunciation of the polyphones is further achieved.
In one exemplary embodiment, before generating a feature vector from the sentence embedding vector, the character feature and the related word feature corresponding to the character feature, the method includes: determining adjacent characters of the first polyphones, and determining all words corresponding to the adjacent characters from a preset word vector database; obtaining a pre-training word vector obtained by vectorizing all the words, wherein the dimension of the pre-training word vector is consistent with the number of the words; and determining the pre-training word vector as the related word characteristic.
Alternatively, in the above embodiment, for example, the dimension of each pre-training word vector is 200. Let x be ti Is a related word feature and will be all x ti Represented as x t =(x t1 ,x t2 ,…,x tw ) Wherein w represents the number of related words, x t ∈R w×200 The training algorithm of the pre-training word vector can use a direct Skip-Gram algorithm, and the direct Skip-Gram algorithm is a word vector training algorithm based on a Skip-Gram model and is characterized by considering the co-occurrence relationship of word pairs and the position relationship of the word pairs.
In an exemplary embodiment, generating a feature vector from the sentence embedding vector, the character feature and the related word feature corresponding to the character feature includes: obtaining a coding result obtained by coding the sentence embedded vector, and determining a transformation result obtained by carrying out nonlinear transformation on the related word characteristics; calculating the correlation degree corresponding to the related word characteristics based on the coding result and the transformation result; and generating a feature vector according to the relevance and the character feature.
Alternatively, in the above embodiment, the encoding result obtained by encoding the statement embedded vector may be obtained using formula (1):
H e =ReLU(Conv1D(H S )) (1);
wherein H is e Representing the result of the encoding, H S Representing statement embedded vectors, H S ∈R n×d ,H e ∈R 1×d D represents the dimension of the vector of RoBERTa-wwm at the hidden layer, conv1D represents the one-dimensional convolution layer of the neural network, and ReLU represents the linear rectification function.
The transformation result obtained by performing nonlinear transformation on the related word features can be determined by using the formula (2):
wherein V is t Representing the transformation result, x t For related word characteristics, W 1 ∈R d×200 ,W 2 ∈R d×d ,b 1 And b 2 Is the bias value.
In an exemplary embodiment, calculating the relevance of the relevant word feature based on the encoding result and the transformation result includes: obtaining the product of the coding result, the transformation result and a preset weight matrix; and carrying out normalization calculation on the product by using an activation function, and determining the result obtained by the normalization calculation as the correlation degree.
A degree of correlation corresponding to the related word feature may be calculated based on the encoding result and the transformation result using formula (3):
a t =softmax(H e W attn V t ) (3);
wherein a is t Representing the degree of correlation, W attn ∈R d×d Representing a preset weight matrix, softmax representing the activation function.
In one exemplary embodiment, generating a feature vector from the relevance and the character feature includes: determining products of target transformation results and target correlation degrees corresponding to all relevant word features according to all relevant word features to obtain a plurality of products; and determining the sum of the products, and determining the sum value between the sum of the products and the character feature as the feature vector.
A feature vector may be generated from the relevance and the character feature using equation (4):
all v tj Can be expressed as V t =(v t1 ,…,v tw ) Wherein h is t Expressed as character features corresponding to the first polyphones, h' t Representing the generated feature vector.
In an exemplary embodiment, determining the pronunciation corresponding to the first polyphone according to the output result of the deep learning model includes: determining the probability of all pronunciations of the first polyphones from the output result of the deep learning model; and determining the maximum value from the probabilities of all the pronunciations, and determining the pronunciation corresponding to the maximum value as the pronunciation corresponding to the first polyphone.
Alternatively, in the above embodiment, the vector may be normalized to the probability distribution by a Softmax layer, which is the last layer of the deep learning model, for outputting the probability distribution result.
In an exemplary embodiment, the method further comprises: calculating information entropy corresponding to the probabilities of all the pronunciations; under the condition that the information entropy meets a preset threshold value, determining other samples from the training corpus, wherein the other samples at least comprise adjacent word samples of the first polyphone sample and word samples except the first polyphone sample in the training corpus; and training the deep learning model by taking the feature vectors of the other samples and the historical feature vectors corresponding to the second polyphones as input samples and taking the sample probability of each pronunciation of the second polyphones as output samples.
Alternatively, in the above embodiment, the preset threshold may be a static threshold or a dynamic threshold, for example, the dynamic threshold may be a value of min (T max ,T min + epoch s), wherein T max For the upper threshold value, T min For the lower threshold, s is used to adjust the updating frequency of the threshold, and represents the degree of each updating change, and epoch represents the iteration number of the model, if s is 2, the dynamic threshold is updated every two epochs.
Optionally, in the above embodiment, a semi-supervised learning method is used to solve the problem of lack of effective multi-word samples, more training corpus is generated through a small amount of existing training corpus to serve as samples, training efficiency of machine learning is improved, specifically, when multi-word pronunciation is determined, words exceeding a certain adjacent range are usually related to adjacent words of multi-word in sentences, and are less influenced by pronunciation of multi-word, so that multi-word samples with labels can be amplified, for example, multi-word and adjacent words in the samples are kept unchanged, other words in sentences are replaced by similar words, and the replaced sentences are used as new samples to train the deep learning model.
Through the embodiment, the problem of lack of the polyphone training sample can be solved, the model can learn from limited label data and extend to more label-free data, the accuracy of the deep learning model in the polyphone pronunciation prediction is improved, a large amount of manpower is not required to be consumed for marking data to obtain the labeled data sample, and cost reduction and synergy are realized.
In order to better understand the process of the method for determining the multi-tone word pronunciation, the implementation method flow of the determination of the multi-tone word pronunciation is described below with reference to the alternative embodiments, but is not limited to the technical solution of the embodiments of the present application.
In an alternative embodiment, fig. 3 is a schematic structural diagram of a deep learning model according to an embodiment of the present application, specifically shown in fig. 3:
the RoBERTa-wwm is used as a character encoder to mask the sentence to be processed in the corpus to be processed. Wherein C is 1 To C n Inputting the characters into a RoBERTa-wwm character encoder for processing to obtain a character embedded vector h for the characters after the sentence segmentation to be processed 1 To h n For the target polyphones C in the segmented characters t Determining adjacent characters, and then taking the words matched with the adjacent characters in the Chinese word vector library as related words x t . Then the characters corresponding to the target polyphones are embedded into a vector h t Character embedding vector H of sentence to be processed S (h 1 To h n ) Related word x t Input deviceThe word characteristic matcher obtains a characteristic vector h' t
For the obtained feature vector h' t As shown in fig. 4, the character embedding vector H of the sentence to be processed is obtained by using one-dimensional convolution Conv1D S Encoding the related word x t Non-linear transformation is performed, and then a (sentence-word) attention mechanism is used to obtain the word most relevant to the sentence, and then a character embedding vector h corresponding to the target polyphonic word is obtained t Adding and fusing to obtain a feature vector h' t And normalizing the feature vectors into probability distribution through a Softmax layer in the deep learning model to obtain all possible pronunciation probabilities of the polyphones.
In an alternative embodiment, the process of semi-supervised learning may be described in conjunction with FIG. 5, as shown in FIG. 5:
in semi-supervised learning, the model is able to learn and expand more training corpus samples from existing training corpora. Taking the training corpus containing polyphones as an example in the embodiment of the application, the method for randomly amplifying the training corpus includes dividing sentences into words, only retaining words containing polyphones and polyphone adjacent words, replacing other words with similar words by Word2Vec to generate new sentences, for example, in fig. 5, the polyphones in the training corpus are "sleep" and the retained words containing polyphones and polyphone adjacent words are "sleep", and the amplified sentences are "i want to sleep". The polyphones in the training corpus are "branches", the reserved words containing polyphones are "de-analysis", and the amplified sentences are "de-analysis questions". The embedded vector of the amplified sentence is obtained through the character encoder and then is input into a deep learning model (a polyphone disambiguation model in fig. 5) to generate the pronunciation of the polyphone, the accuracy of the generated result is calculated through a loss function, and the amplified training corpus is added into a training sample under the condition that the accuracy meets the standard.
Through the embodiment, the character features of the polyphones and the related word features can be fused to improve the accuracy of determining the pronunciation of the polyphones, and the problem of lack of training samples in the process of training a deep learning model for determining the pronunciation of the polyphones is solved by adopting a semi-supervised learning method.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method of the embodiments of the present application.
Fig. 6 is a block diagram of a structure of a multi-tone word pronunciation determination device according to an embodiment of the present application; as shown in fig. 6, includes:
The parsing module 62 is configured to parse a corpus to be processed to determine a first polyphone in the corpus to be processed, where the first polyphone corresponds to a character feature;
the processing module 64 is configured to perform mask processing on a to-be-processed sentence in the to-be-processed corpus by using a character encoding model, so as to obtain a sentence embedded vector of the to-be-processed sentence;
a generating module 66, configured to generate a feature vector according to the sentence embedding vector, the character feature and the related word feature corresponding to the character feature;
the determining module 68 is configured to input the feature vector to a trained deep learning model, and determine, according to an output result of the deep learning model, a target pronunciation corresponding to the first polyphone, where the deep learning model is obtained by training with a historical feature vector corresponding to a second polyphone in a training corpus as an input sample and with a sample probability of each pronunciation of the second polyphone as an output sample.
Through the device, the first polyphones in the corpus to be processed are determined by analyzing the corpus to be processed, wherein the first polyphones correspond to character features; masking the to-be-processed sentences in the to-be-processed corpus by using a character coding model to obtain sentence embedded vectors of the to-be-processed sentences; generating a feature vector according to the sentence embedded vector, the character feature and the related word feature corresponding to the character feature; and inputting the feature vector into a trained deep learning model, and determining target pronunciation corresponding to the first polyphone according to an output result of the deep learning model, wherein the deep learning model is obtained by training by taking a historical feature vector corresponding to a second polyphone in a training corpus as an input sample and taking a sample probability of each pronunciation of the second polyphone as an output sample. By adopting the technical scheme, the problem of how to accurately determine the pronunciation of the polyphones is solved, and the effect of accurately determining the pronunciation of the polyphones is further realized.
In one exemplary embodiment, the generation module 66 is further configured to: determining adjacent characters of the first polyphones, and determining all words corresponding to the adjacent characters from a preset word vector database; obtaining a pre-training word vector obtained by vectorizing all the words, wherein the dimension of the pre-training word vector is consistent with the number of the words; and determining the pre-training word vector as the related word characteristic.
In one exemplary embodiment, the generating module 66 further includes: the acquisition unit is used for acquiring a coding result obtained by coding the sentence embedded vector and determining a transformation result obtained by carrying out nonlinear transformation on the related word characteristics; a calculating unit, configured to calculate a correlation degree corresponding to the related word feature based on the encoding result and the transformation result; and the generating unit is used for generating a feature vector according to the relevance and the character features.
In an exemplary embodiment, the above-mentioned calculation unit is further configured to: obtaining the product of the coding result, the transformation result and a preset weight matrix; and carrying out normalization calculation on the product by using an activation function, and determining the result obtained by the normalization calculation as the correlation degree.
In an exemplary embodiment, the generating unit is further configured to: determining products of target transformation results and target correlation degrees corresponding to all relevant word features according to all relevant word features to obtain a plurality of products; and determining the sum of the products, and determining the sum value between the sum of the products and the character feature as the feature vector.
In one exemplary embodiment, the determination module 68 is further configured to: determining the probability of all pronunciations of the first polyphones from the output result of the deep learning model; and determining the maximum value from the probabilities of all the pronunciations, and determining the pronunciation corresponding to the maximum value as the pronunciation corresponding to the first polyphone.
In one exemplary embodiment, the determination module 68 is further configured to: calculating information entropy corresponding to the probabilities of all the pronunciations; under the condition that the information entropy meets a preset threshold value, determining other samples from the training corpus, wherein the other samples at least comprise adjacent word samples of the first polyphone sample and word samples except the first polyphone sample in the training corpus; and training the deep learning model by taking the feature vectors of the other samples and the historical feature vectors corresponding to the second polyphones as input samples and taking the sample probability of each pronunciation of the second polyphones as output samples.
Embodiments of the present application also provide a storage medium including a stored program, wherein the program performs the method of any one of the above when run.
Alternatively, in the present embodiment, the above-described storage medium may be configured to store program code for performing the steps of:
s1, analyzing a corpus to be processed, and determining a first polyphone in the corpus to be processed, wherein the first polyphone corresponds to character features;
s2, carrying out mask processing on the sentences to be processed in the corpus to be processed by using a character coding model to obtain sentence embedded vectors of the sentences to be processed;
s3, generating a feature vector according to the sentence embedded vector, the character feature and the related word feature corresponding to the character feature;
s4, inputting the feature vector into a trained deep learning model, and determining a target pronunciation corresponding to the first polyphone according to an output result of the deep learning model, wherein the deep learning model is obtained by training by taking a historical feature vector corresponding to a second polyphone in a training corpus as an input sample and taking a sample probability of each pronunciation of the second polyphone as an output sample.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.
Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:
s1, analyzing a corpus to be processed, and determining a first polyphone in the corpus to be processed, wherein the first polyphone corresponds to character features;
s2, carrying out mask processing on the sentences to be processed in the corpus to be processed by using a character coding model to obtain sentence embedded vectors of the sentences to be processed;
s3, generating a feature vector according to the sentence embedded vector, the character feature and the related word feature corresponding to the character feature;
s4, inputting the feature vector into a trained deep learning model, and determining a target pronunciation corresponding to the first polyphone according to an output result of the deep learning model, wherein the deep learning model is obtained by training by taking a historical feature vector corresponding to a second polyphone in a training corpus as an input sample and taking a sample probability of each pronunciation of the second polyphone as an output sample.
Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.
It will be appreciated by those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be centralized on a single computing device, or distributed across a network of computing devices, or they may alternatively be implemented in program code executable by computing devices, such that they may be stored in a memory device for execution by the computing devices and, in some cases, the steps shown or described may be performed in a different order than what is shown or described, or they may be implemented as individual integrated circuit modules, or as individual integrated circuit modules. Thus, the present application is not limited to any specific combination of hardware and software.
The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims (10)

1. A method for determining pronunciation of a polyphone, comprising:
analyzing the corpus to be processed, and determining a first polyphone in the corpus to be processed, wherein the first polyphone corresponds to character features;
masking the to-be-processed sentences in the to-be-processed corpus by using a character coding model to obtain sentence embedded vectors of the to-be-processed sentences;
generating a feature vector according to the sentence embedded vector, the character feature and the related word feature corresponding to the character feature;
and inputting the feature vector into a trained deep learning model, and determining target pronunciation corresponding to the first polyphone according to an output result of the deep learning model, wherein the deep learning model is obtained by training by taking a historical feature vector corresponding to a second polyphone in a training corpus as an input sample and taking a sample probability of each pronunciation of the second polyphone as an output sample.
2. The method of claim 1, wherein before generating a feature vector from the sentence embedding vector, the character feature and the related word feature corresponding to the character feature, comprising:
determining adjacent characters of the first polyphones, and determining all words corresponding to the adjacent characters from a preset word vector database;
obtaining a pre-training word vector obtained by vectorizing all the words, wherein the dimension of the pre-training word vector is consistent with the number of the words;
and determining the pre-training word vector as the related word characteristic.
3. The method for determining pronunciation of polyphone as claimed in claim 1, wherein generating feature vectors from the sentence embedding vectors, the character features and the related word features corresponding to the character features, comprises:
obtaining a coding result obtained by coding the sentence embedded vector, and determining a transformation result obtained by carrying out nonlinear transformation on the related word characteristics;
calculating the correlation degree corresponding to the related word characteristics based on the coding result and the transformation result;
And generating a feature vector according to the relevance and the character feature.
4. A method of determining a pronunciation of a polyphone as claimed in claim 3, wherein calculating a degree of correlation corresponding to the relevant word feature based on the encoding result and the transformation result includes:
obtaining the product of the coding result, the transformation result and a preset weight matrix;
and carrying out normalization calculation on the product by using an activation function, and determining the result obtained by the normalization calculation as the correlation degree.
5. A method of determining a pronunciation of a polyphone as claimed in claim 3, wherein generating a feature vector from the relevance and the character feature includes:
determining products of target transformation results and target correlation degrees corresponding to all relevant word features according to all relevant word features to obtain a plurality of products;
and determining the sum of the products, and determining the sum value between the sum of the products and the character feature as the feature vector.
6. The method for determining the pronunciation of the first polyphone according to claim 1, wherein determining the pronunciation corresponding to the first polyphone according to the output result of the deep learning model comprises:
Determining the probability of all pronunciations of the first polyphones from the output result of the deep learning model;
and determining the maximum value from the probabilities of all the pronunciations, and determining the pronunciation corresponding to the maximum value as the pronunciation corresponding to the first polyphone.
7. The method of determining a multi-tone word pronunciation as claimed in claim 6, further comprising: calculating information entropy corresponding to the probabilities of all the pronunciations;
under the condition that the information entropy meets a preset threshold value, determining other samples from the training corpus, wherein the other samples at least comprise adjacent word samples of the first polyphone sample and word samples except the first polyphone sample in the training corpus;
and training the deep learning model by taking the feature vectors of the other samples and the historical feature vectors corresponding to the second polyphones as input samples and taking the sample probability of each pronunciation of the second polyphones as output samples.
8. A multi-tone word pronunciation determining apparatus, comprising:
the analysis module is used for analyzing the corpus to be processed and determining a first polyphone in the corpus to be processed, wherein the first polyphone corresponds to character features;
The processing module is used for carrying out mask processing on the sentences to be processed in the corpus to be processed by using the character coding model to obtain sentence embedded vectors of the sentences to be processed;
the generation module is used for generating a feature vector according to the sentence embedding vector, the character features and the related word features corresponding to the character features;
the determining module is used for inputting the feature vector into a trained deep learning model, and determining target pronunciation corresponding to the first polyphone according to an output result of the deep learning model, wherein the deep learning model is obtained by training by taking a historical feature vector corresponding to a second polyphone in a training corpus as an input sample and taking a sample probability of each pronunciation of the second polyphone as an output sample.
9. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored program, wherein the program, when run, performs the method of any one of claims 1 to 7.
10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1 to 7 by means of the computer program.
CN202311737873.6A 2023-12-15 2023-12-15 Method and device for determining pronunciation of polyphone, storage medium and electronic equipment Pending CN117672181A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311737873.6A CN117672181A (en) 2023-12-15 2023-12-15 Method and device for determining pronunciation of polyphone, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311737873.6A CN117672181A (en) 2023-12-15 2023-12-15 Method and device for determining pronunciation of polyphone, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN117672181A true CN117672181A (en) 2024-03-08

Family

ID=90078764

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311737873.6A Pending CN117672181A (en) 2023-12-15 2023-12-15 Method and device for determining pronunciation of polyphone, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN117672181A (en)

Similar Documents

Publication Publication Date Title
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN110851596B (en) Text classification method, apparatus and computer readable storage medium
CN109992664B (en) Dispute focus label classification method and device, computer equipment and storage medium
CN107729313B (en) Deep neural network-based polyphone pronunciation distinguishing method and device
CN110781276A (en) Text extraction method, device, equipment and storage medium
CN113094578B (en) Deep learning-based content recommendation method, device, equipment and storage medium
JP2016513269A (en) Method and device for acoustic language model training
CN115599901B (en) Machine question-answering method, device, equipment and storage medium based on semantic prompt
CN111241828A (en) Intelligent emotion recognition method and device and computer readable storage medium
CN114026556A (en) Semantic element prediction method, computer device and storage medium background
CN111859940B (en) Keyword extraction method and device, electronic equipment and storage medium
CN110968725B (en) Image content description information generation method, electronic device and storage medium
CN112667782A (en) Text classification method, device, equipment and storage medium
CN111079432A (en) Text detection method and device, electronic equipment and storage medium
CN113268576B (en) Deep learning-based department semantic information extraction method and device
CN111695591A (en) AI-based interview corpus classification method, device, computer equipment and medium
CN114255096A (en) Data requirement matching method and device, electronic equipment and storage medium
CN116070632A (en) Informal text entity tag identification method and device
CN111091004A (en) Training method and training device for sentence entity labeling model and electronic equipment
JP5441937B2 (en) Language model learning device, language model learning method, language analysis device, and program
CN112084769A (en) Dependency syntax model optimization method, device, equipment and readable storage medium
CN114492661A (en) Text data classification method and device, computer equipment and storage medium
US11893344B2 (en) Morpheme analysis learning device, morpheme analysis device, method, and program
CN111159405B (en) Irony detection method based on background knowledge
CN116483314A (en) Automatic intelligent activity diagram generation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination