CN117672181A

CN117672181A - Method and device for determining pronunciation of polyphone, storage medium and electronic equipment

Info

Publication number: CN117672181A
Application number: CN202311737873.6A
Authority: CN
Inventors: 王锦阳
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2023-12-15
Filing date: 2023-12-15
Publication date: 2024-03-08

Abstract

The application discloses a method and a device for determining pronunciation of a polyphone, a storage medium and electronic equipment, wherein the method comprises the following steps: analyzing the corpus to be processed to determine a first polyphone in the corpus to be processed, wherein the first polyphone corresponds to character features; masking the to-be-processed sentences in the to-be-processed corpus by using a character coding model to obtain sentence embedded vectors of the to-be-processed sentences; generating a feature vector according to the sentence embedding vector, the character feature and the related word feature corresponding to the character feature; and inputting the feature vector into the trained deep learning model, and determining the target pronunciation corresponding to the first polyphone according to the output result of the deep learning model. By adopting the technical scheme, the problem of how to accurately determine the pronunciation of the polyphones is solved.

Description

Method and device for determining pronunciation of polyphone, storage medium and electronic equipment

Technical Field

The present invention relates to the field of speech synthesis, and in particular, to a method and apparatus for determining pronunciation of polyphones, a storage medium, and an electronic device.

Background

At present, in the process of converting characters used in the speech synthesis technology into speech, the correct pronunciation of the first polyphone often needs to be determined in advance. In the prior art, a method for determining pronunciation based on rules or a method for determining pronunciation based on deep learning is generally adopted to determine the pronunciation of a first multi-pronunciation word in a word, the rule-based method needs to manually define corresponding rules of words and pronunciation in a dictionary, then the text is divided into words to match the pronunciation in the dictionary, but the method needs to maintain complex rules and the dictionary, and correct pronunciation cannot be distinguished when the words and the dictionary are provided with a plurality of pronunciations. Whereas learning-based methods require training a neural network model for predicting the pronunciation of the first polyphone using a large amount of data, existing data set samples for the first polyphone disambiguation are small in number and data distribution is unbalanced. The accuracy of determining the pronunciation of the first polyphone is low by the above two methods.

Accordingly, the related art has a problem of how to accurately confirm the pronunciation of the polyphones.

Aiming at the problem of how to accurately confirm the pronunciation of the polyphones in the related technology, no effective solution is proposed yet.

Disclosure of Invention

The embodiment of the application provides a method and a device for determining pronunciation of a polyphone, a storage medium and electronic equipment, which at least solve the problem of how to accurately confirm the pronunciation of the polyphone in the related technology.

According to an embodiment of the embodiments of the present application, there is provided a method for determining pronunciation of a polyphone, including: analyzing the corpus to be processed, and determining a first polyphone in the corpus to be processed, wherein the first polyphone corresponds to character features; masking the to-be-processed sentences in the to-be-processed corpus by using a character coding model to obtain sentence embedded vectors of the to-be-processed sentences; generating a feature vector according to the sentence embedded vector, the character feature and the related word feature corresponding to the character feature; and inputting the feature vector into a trained deep learning model, and determining target pronunciation corresponding to the first polyphone according to an output result of the deep learning model, wherein the deep learning model is obtained by training by taking a historical feature vector corresponding to a second polyphone in a training corpus as an input sample and taking a sample probability of each pronunciation of the second polyphone as an output sample.

In one exemplary embodiment, before generating a feature vector from the sentence embedding vector, the character feature and the related word feature corresponding to the character feature, the method includes: determining adjacent characters of the first polyphones, and determining all words corresponding to the adjacent characters from a preset word vector database; obtaining a pre-training word vector obtained by vectorizing all the words, wherein the dimension of the pre-training word vector is consistent with the number of the words; and determining the pre-training word vector as the related word characteristic.

In an exemplary embodiment, generating a feature vector from the sentence embedding vector, the character feature and the related word feature corresponding to the character feature includes: obtaining a coding result obtained by coding the sentence embedded vector, and determining a transformation result obtained by carrying out nonlinear transformation on the related word characteristics; calculating the correlation degree corresponding to the related word characteristics based on the coding result and the transformation result; and generating a feature vector according to the relevance and the character feature.

In an exemplary embodiment, calculating the relevance of the relevant word feature based on the encoding result and the transformation result includes: obtaining the product of the coding result, the transformation result and a preset weight matrix; and carrying out normalization calculation on the product by using an activation function, and determining the result obtained by the normalization calculation as the correlation degree.

In one exemplary embodiment, generating a feature vector from the relevance and the character feature includes: determining products of target transformation results and target correlation degrees corresponding to all relevant word features according to all relevant word features to obtain a plurality of products; and determining the sum of the products, and determining the sum value between the sum of the products and the character feature as the feature vector.

In an exemplary embodiment, determining the pronunciation corresponding to the first polyphone according to the output result of the deep learning model includes: determining the probability of all pronunciations of the first polyphones from the output result of the deep learning model; and determining the maximum value from the probabilities of all the pronunciations, and determining the pronunciation corresponding to the maximum value as the pronunciation corresponding to the first polyphone.

In an exemplary embodiment, the method further comprises: calculating information entropy corresponding to the probabilities of all the pronunciations; under the condition that the information entropy meets a preset threshold value, determining other samples from the training corpus, wherein the other samples at least comprise adjacent word samples of the first polyphone sample and word samples except the first polyphone sample in the training corpus; and training the deep learning model by taking the feature vectors of the other samples and the historical feature vectors corresponding to the second polyphones as input samples and taking the sample probability of each pronunciation of the second polyphones as output samples.

According to another embodiment of the embodiments of the present application, there is also provided a device for determining pronunciation of a polyphone, including: the analysis module is used for analyzing the corpus to be processed and determining a first polyphone in the corpus to be processed, wherein the first polyphone corresponds to character features; the processing module is used for carrying out mask processing on the sentences to be processed in the corpus to be processed by using the character coding model to obtain sentence embedded vectors of the sentences to be processed; the generation module is used for generating a feature vector according to the sentence embedding vector, the character features and the related word features corresponding to the character features; the determining module is used for inputting the feature vector into a trained deep learning model, and determining target pronunciation corresponding to the first polyphone according to an output result of the deep learning model, wherein the deep learning model is obtained by training by taking a historical feature vector corresponding to a second polyphone in a training corpus as an input sample and taking a sample probability of each pronunciation of the second polyphone as an output sample.

In an exemplary embodiment, the generating module is further configured to: determining adjacent characters of the first polyphones, and determining all words corresponding to the adjacent characters from a preset word vector database; obtaining a pre-training word vector obtained by vectorizing all the words, wherein the dimension of the pre-training word vector is consistent with the number of the words; and determining the pre-training word vector as the related word characteristic.

In an exemplary embodiment, the generating module further includes: the acquisition unit is used for acquiring a coding result obtained by coding the sentence embedded vector and determining a transformation result obtained by carrying out nonlinear transformation on the related word characteristics; a calculating unit, configured to calculate a correlation degree corresponding to the related word feature based on the encoding result and the transformation result; and the generating unit is used for generating a feature vector according to the relevance and the character features.

In an exemplary embodiment, the above-mentioned calculation unit is further configured to: obtaining the product of the coding result, the transformation result and a preset weight matrix; and carrying out normalization calculation on the product by using an activation function, and determining the result obtained by the normalization calculation as the correlation degree.

In an exemplary embodiment, the generating unit is further configured to: determining products of target transformation results and target correlation degrees corresponding to all relevant word features according to all relevant word features to obtain a plurality of products; and determining the sum of the products, and determining the sum value between the sum of the products and the character feature as the feature vector.

In an exemplary embodiment, the determining module is further configured to: determining the probability of all pronunciations of the first polyphones from the output result of the deep learning model; and determining the maximum value from the probabilities of all the pronunciations, and determining the pronunciation corresponding to the maximum value as the pronunciation corresponding to the first polyphone.

In an exemplary embodiment, the determining module is further configured to: calculating information entropy corresponding to the probabilities of all the pronunciations; under the condition that the information entropy meets a preset threshold value, determining other samples from the training corpus, wherein the other samples at least comprise adjacent word samples of the first polyphone sample and word samples except the first polyphone sample in the training corpus; and training the deep learning model by taking the feature vectors of the other samples and the historical feature vectors corresponding to the second polyphones as input samples and taking the sample probability of each pronunciation of the second polyphones as output samples.

According to yet another aspect of the embodiments of the present application, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the above-described method of determining a polyphonic pronunciation when run.

According to still another aspect of the embodiments of the present application, there is further provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the method for determining the pronunciation of the polyphonic word by the computer program.

In the embodiment of the application, character features corresponding to a first polyphone in a corpus to be processed are analyzed, a character encoding model is used for carrying out mask processing on a sentence to be processed in the corpus to be processed to obtain a sentence embedding vector, the character features and feature vectors generated by related word features corresponding to the character features are input into a deep learning model trained by the feature vectors of a second polyphone, so that the sample probability of each pronunciation of the first polyphone is obtained, and the pronunciation of the first polyphone is determined according to the sample probability of each pronunciation. By adopting the technical scheme, the problem of how to accurately determine the pronunciation of the polyphones is solved, and the effect of accurately determining the pronunciation of the polyphones is further realized.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic diagram of a hardware environment of a method for determining pronunciation of a polyphone according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of determining a multi-tone word pronunciation in accordance with an embodiment of the present application;

FIG. 3 is a schematic diagram of the principle structure of a deep learning model according to an embodiment of the present application;

FIG. 4 is a word feature adapter schematic diagram according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a semi-supervised learning framework according to an embodiment of the present application;

fig. 6 is a block diagram of a multi-tone word pronunciation determination device according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, subsystem, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements that are expressly listed or inherent to such process, method, article, or apparatus.

In order to more clearly understand the technical solutions of the present application, the following describes some terms related to the embodiments of the present application:

text-to-speech (G2P, grapheme to Phoneme): is an important language processing technique that helps computers understand and generate speech by converting letters or glyphs (called graphemes) in the language into phonemes (phones) in the pronunciation.

Semi-supervised learning (SSL, semi-Supervised Learning): a learning method between supervised learning and unsupervised learning. In semi-supervised learning, the training dataset contains examples of both labels (known outputs) and unlabeled (unknown outputs), which enables the model to learn from limited, labeled data and extend over more unlabeled data. The goal of semi-supervised learning is to improve the performance of the model by combining limited, labeled data with a large amount of unlabeled data.

In the embodiment of the application, in order to ensure that the behavior of the model facing the tagged data is consistent with that of the non-tagged data, the input tagged data is randomly amplified, and then a loss function is calculated based on the output of different amplified data. The reliability of the training result can be improved by training in the semi-supervision mode, so that the model has robustness to the training of the label-free data. Taking the first polyphone labeled data as an example in the embodiment of the present application, the method for randomly amplifying the first polyphone labeled data includes dividing a sentence into words, only retaining words including the first polyphone and adjacent words of the first polyphone, using Word2Vec (a neural network model for generating Word vectors) to replace other words with similar words to generate a new sentence, and the data generated by the method is label-free data.

In order to use the above-mentioned non-label data for model training, it is necessary to generate pseudo labels for the non-label data, and because the cost of extracting the pronunciation data of the polyphones from the pinyin dictionary is minimum, the pseudo labels can be generated for the non-label data by the determined monophones composed of the polyphones and the monophones in the pinyin dictionary, so as to reduce the probability of occurrence of false labels and speed up model training.

The method embodiments provided in the embodiments of the present invention may be executed in a computer terminal or similar computing device. Taking the example of running on a computer terminal, fig. 1 is a block diagram of the hardware structure of the computer terminal for executing the method for determining the pronunciation of the polyphonic word according to the embodiment of the invention. As shown in fig. 1, the computer terminal may include one or more processors 102 (only one is shown in fig. 1), which processor 102 may include, but is not limited to, a microprocessor (Microprocessor Unit, abbreviated MPU) or programmable logic device (Programmable logic device, abbreviated PLD) and a memory 104 configured to store data, and in one exemplary embodiment, a transmission device 106 configured to communicate with an input/output device 108.

The memory 104 may be configured to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a method for determining a polyphonic pronunciation in an embodiment of the present invention, and the processor 102 performs various functional applications and determination of the polyphonic pronunciation by running the computer program stored in the memory 104, that is, implements the method described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the computer terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is arranged to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of a computer terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module configured to communicate wirelessly with the internet.

In this embodiment, a method for determining a pronunciation of a polyphone is provided, which is applied to the server, and fig. 2 is a flowchart of a method for determining a pronunciation of a polyphone according to an embodiment of the present application, where the flowchart includes the following steps:

step S202, analyzing a corpus to be processed, and determining a first polyphone in the corpus to be processed, wherein the first polyphone corresponds to character features;

step S204, masking the sentence to be processed in the corpus to be processed by using a character coding model to obtain a sentence embedded vector of the sentence to be processed;

alternatively, in the above step S204, the RoBERTa-wwm may be used as a character encoder to mask the sentence to be processed in the corpus to be processed. Where RoBERTa is a language model based on bi-directional coding and self-attention mechanisms, wwm represents Whole Word Masking, i.e. masking the whole word, specifically including dividing the corpus text by character, and inputting the divided characters into the RoBERTa-wwm character encoder for processing.

For example, if the child-roberta-wwm-ext contains 12 layers of transfomer, each layer contains 12 self-attention heads and 768 hidden units. The character encoder processes the input sentence to generate sentence embedded vector H corresponding to character features ^S ＝[h ₁ ,h ₂ ,…,h _n ]Where n is the sentence length, h ₁ ,h ₂ ,…,h _n Etc. represent the characteristics corresponding to the characters in the input sentence.

Compared with a classical BERT character encoder, the RoBERTa-wwm character encoder can perform mask processing on the whole word to better understand context and semantic association, so that the Chinese text processing capability is improved, the training data size and the training step number of the RoBERTa-wwm character encoder are increased, and the robustness is stronger.

Step S206, generating a feature vector according to the sentence embedded vector, the character features and the related word features corresponding to the character features;

step S208, inputting the feature vector to a trained deep learning model, and determining a target pronunciation corresponding to the first polyphone according to an output result of the deep learning model, wherein the deep learning model is obtained by training with a historical feature vector corresponding to a second polyphone in a training corpus as an input sample and with a sample probability of each pronunciation of the second polyphone as an output sample.

Through the steps, the corpus to be processed is analyzed, and a first polyphone in the corpus to be processed is determined, wherein the first polyphone corresponds to character features; masking the to-be-processed sentences in the to-be-processed corpus by using a character coding model to obtain sentence embedded vectors of the to-be-processed sentences; generating a feature vector according to the sentence embedded vector, the character feature and the related word feature corresponding to the character feature; and inputting the feature vector into a trained deep learning model, and determining target pronunciation corresponding to the first polyphone according to an output result of the deep learning model, wherein the deep learning model is obtained by training by taking a historical feature vector corresponding to a second polyphone in a training corpus as an input sample and taking a sample probability of each pronunciation of the second polyphone as an output sample. The problem of how to accurately determine the pronunciation of the polyphones is solved, and the effect of accurately determining the pronunciation of the polyphones is further achieved.

Alternatively, in the above embodiment, for example, the dimension of each pre-training word vector is 200. Let x be _ti Is a related word feature and will be all x _ti Represented as x _t ＝(x _t1 ,x _t2 ,…,x _tw ) Wherein w represents the number of related words, x _t ∈R ^w×200 The training algorithm of the pre-training word vector can use a direct Skip-Gram algorithm, and the direct Skip-Gram algorithm is a word vector training algorithm based on a Skip-Gram model and is characterized by considering the co-occurrence relationship of word pairs and the position relationship of the word pairs.

Alternatively, in the above embodiment, the encoding result obtained by encoding the statement embedded vector may be obtained using formula (1):

H ^e ＝ReLU(Conv1D(H ^S )) (1)；

wherein H is ^e Representing the result of the encoding, H ^S Representing statement embedded vectors, H ^S ∈R ^n×d ,H ^e ∈R ^1×d D represents the dimension of the vector of RoBERTa-wwm at the hidden layer, conv1D represents the one-dimensional convolution layer of the neural network, and ReLU represents the linear rectification function.

The transformation result obtained by performing nonlinear transformation on the related word features can be determined by using the formula (2):

wherein V is _t Representing the transformation result, x _t For related word characteristics, W ₁ ∈R ^d×200 ,W ₂ ∈R ^d×d ，b ₁ And b ₂ Is the bias value.

A degree of correlation corresponding to the related word feature may be calculated based on the encoding result and the transformation result using formula (3):

a _t ＝softmax(H ^e W _attn V _t ) (3)；

wherein a is _t Representing the degree of correlation, W _attn ∈R ^d×d Representing a preset weight matrix, softmax representing the activation function.

A feature vector may be generated from the relevance and the character feature using equation (4):

all v _tj Can be expressed as V _t ＝(v _t1 ,…,v _tw ) Wherein h is _t Expressed as character features corresponding to the first polyphones, h' _t Representing the generated feature vector.

Alternatively, in the above embodiment, the vector may be normalized to the probability distribution by a Softmax layer, which is the last layer of the deep learning model, for outputting the probability distribution result.

Alternatively, in the above embodiment, the preset threshold may be a static threshold or a dynamic threshold, for example, the dynamic threshold may be a value of min (T _max ,T _min + epoch s), wherein T _max For the upper threshold value, T _min For the lower threshold, s is used to adjust the updating frequency of the threshold, and represents the degree of each updating change, and epoch represents the iteration number of the model, if s is 2, the dynamic threshold is updated every two epochs.

Optionally, in the above embodiment, a semi-supervised learning method is used to solve the problem of lack of effective multi-word samples, more training corpus is generated through a small amount of existing training corpus to serve as samples, training efficiency of machine learning is improved, specifically, when multi-word pronunciation is determined, words exceeding a certain adjacent range are usually related to adjacent words of multi-word in sentences, and are less influenced by pronunciation of multi-word, so that multi-word samples with labels can be amplified, for example, multi-word and adjacent words in the samples are kept unchanged, other words in sentences are replaced by similar words, and the replaced sentences are used as new samples to train the deep learning model.

Through the embodiment, the problem of lack of the polyphone training sample can be solved, the model can learn from limited label data and extend to more label-free data, the accuracy of the deep learning model in the polyphone pronunciation prediction is improved, a large amount of manpower is not required to be consumed for marking data to obtain the labeled data sample, and cost reduction and synergy are realized.

In order to better understand the process of the method for determining the multi-tone word pronunciation, the implementation method flow of the determination of the multi-tone word pronunciation is described below with reference to the alternative embodiments, but is not limited to the technical solution of the embodiments of the present application.

In an alternative embodiment, fig. 3 is a schematic structural diagram of a deep learning model according to an embodiment of the present application, specifically shown in fig. 3:

the RoBERTa-wwm is used as a character encoder to mask the sentence to be processed in the corpus to be processed. Wherein C is ₁ To C _n Inputting the characters into a RoBERTa-wwm character encoder for processing to obtain a character embedded vector h for the characters after the sentence segmentation to be processed ₁ To h _n For the target polyphones C in the segmented characters _t Determining adjacent characters, and then taking the words matched with the adjacent characters in the Chinese word vector library as related words x _t . Then the characters corresponding to the target polyphones are embedded into a vector h _t Character embedding vector H of sentence to be processed ^S (h ₁ To h _n ) Related word x _t Input deviceThe word characteristic matcher obtains a characteristic vector h' _t 。

For the obtained feature vector h' _t As shown in fig. 4, the character embedding vector H of the sentence to be processed is obtained by using one-dimensional convolution Conv1D ^S Encoding the related word x _t Non-linear transformation is performed, and then a (sentence-word) attention mechanism is used to obtain the word most relevant to the sentence, and then a character embedding vector h corresponding to the target polyphonic word is obtained _t Adding and fusing to obtain a feature vector h' _t And normalizing the feature vectors into probability distribution through a Softmax layer in the deep learning model to obtain all possible pronunciation probabilities of the polyphones.

In an alternative embodiment, the process of semi-supervised learning may be described in conjunction with FIG. 5, as shown in FIG. 5:

in semi-supervised learning, the model is able to learn and expand more training corpus samples from existing training corpora. Taking the training corpus containing polyphones as an example in the embodiment of the application, the method for randomly amplifying the training corpus includes dividing sentences into words, only retaining words containing polyphones and polyphone adjacent words, replacing other words with similar words by Word2Vec to generate new sentences, for example, in fig. 5, the polyphones in the training corpus are "sleep" and the retained words containing polyphones and polyphone adjacent words are "sleep", and the amplified sentences are "i want to sleep". The polyphones in the training corpus are "branches", the reserved words containing polyphones are "de-analysis", and the amplified sentences are "de-analysis questions". The embedded vector of the amplified sentence is obtained through the character encoder and then is input into a deep learning model (a polyphone disambiguation model in fig. 5) to generate the pronunciation of the polyphone, the accuracy of the generated result is calculated through a loss function, and the amplified training corpus is added into a training sample under the condition that the accuracy meets the standard.

Through the embodiment, the character features of the polyphones and the related word features can be fused to improve the accuracy of determining the pronunciation of the polyphones, and the problem of lack of training samples in the process of training a deep learning model for determining the pronunciation of the polyphones is solved by adopting a semi-supervised learning method.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method of the embodiments of the present application.

Fig. 6 is a block diagram of a structure of a multi-tone word pronunciation determination device according to an embodiment of the present application; as shown in fig. 6, includes:

The parsing module 62 is configured to parse a corpus to be processed to determine a first polyphone in the corpus to be processed, where the first polyphone corresponds to a character feature;

the processing module 64 is configured to perform mask processing on a to-be-processed sentence in the to-be-processed corpus by using a character encoding model, so as to obtain a sentence embedded vector of the to-be-processed sentence;

a generating module 66, configured to generate a feature vector according to the sentence embedding vector, the character feature and the related word feature corresponding to the character feature;

the determining module 68 is configured to input the feature vector to a trained deep learning model, and determine, according to an output result of the deep learning model, a target pronunciation corresponding to the first polyphone, where the deep learning model is obtained by training with a historical feature vector corresponding to a second polyphone in a training corpus as an input sample and with a sample probability of each pronunciation of the second polyphone as an output sample.

Through the device, the first polyphones in the corpus to be processed are determined by analyzing the corpus to be processed, wherein the first polyphones correspond to character features; masking the to-be-processed sentences in the to-be-processed corpus by using a character coding model to obtain sentence embedded vectors of the to-be-processed sentences; generating a feature vector according to the sentence embedded vector, the character feature and the related word feature corresponding to the character feature; and inputting the feature vector into a trained deep learning model, and determining target pronunciation corresponding to the first polyphone according to an output result of the deep learning model, wherein the deep learning model is obtained by training by taking a historical feature vector corresponding to a second polyphone in a training corpus as an input sample and taking a sample probability of each pronunciation of the second polyphone as an output sample. By adopting the technical scheme, the problem of how to accurately determine the pronunciation of the polyphones is solved, and the effect of accurately determining the pronunciation of the polyphones is further realized.

In one exemplary embodiment, the generation module 66 is further configured to: determining adjacent characters of the first polyphones, and determining all words corresponding to the adjacent characters from a preset word vector database; obtaining a pre-training word vector obtained by vectorizing all the words, wherein the dimension of the pre-training word vector is consistent with the number of the words; and determining the pre-training word vector as the related word characteristic.

In one exemplary embodiment, the generating module 66 further includes: the acquisition unit is used for acquiring a coding result obtained by coding the sentence embedded vector and determining a transformation result obtained by carrying out nonlinear transformation on the related word characteristics; a calculating unit, configured to calculate a correlation degree corresponding to the related word feature based on the encoding result and the transformation result; and the generating unit is used for generating a feature vector according to the relevance and the character features.

In one exemplary embodiment, the determination module 68 is further configured to: determining the probability of all pronunciations of the first polyphones from the output result of the deep learning model; and determining the maximum value from the probabilities of all the pronunciations, and determining the pronunciation corresponding to the maximum value as the pronunciation corresponding to the first polyphone.

In one exemplary embodiment, the determination module 68 is further configured to: calculating information entropy corresponding to the probabilities of all the pronunciations; under the condition that the information entropy meets a preset threshold value, determining other samples from the training corpus, wherein the other samples at least comprise adjacent word samples of the first polyphone sample and word samples except the first polyphone sample in the training corpus; and training the deep learning model by taking the feature vectors of the other samples and the historical feature vectors corresponding to the second polyphones as input samples and taking the sample probability of each pronunciation of the second polyphones as output samples.

Embodiments of the present application also provide a storage medium including a stored program, wherein the program performs the method of any one of the above when run.

Alternatively, in the present embodiment, the above-described storage medium may be configured to store program code for performing the steps of:

s1, analyzing a corpus to be processed, and determining a first polyphone in the corpus to be processed, wherein the first polyphone corresponds to character features;

s2, carrying out mask processing on the sentences to be processed in the corpus to be processed by using a character coding model to obtain sentence embedded vectors of the sentences to be processed;

s3, generating a feature vector according to the sentence embedded vector, the character feature and the related word feature corresponding to the character feature;

s4, inputting the feature vector into a trained deep learning model, and determining a target pronunciation corresponding to the first polyphone according to an output result of the deep learning model, wherein the deep learning model is obtained by training by taking a historical feature vector corresponding to a second polyphone in a training corpus as an input sample and taking a sample probability of each pronunciation of the second polyphone as an output sample.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.

It will be appreciated by those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be centralized on a single computing device, or distributed across a network of computing devices, or they may alternatively be implemented in program code executable by computing devices, such that they may be stored in a memory device for execution by the computing devices and, in some cases, the steps shown or described may be performed in a different order than what is shown or described, or they may be implemented as individual integrated circuit modules, or as individual integrated circuit modules. Thus, the present application is not limited to any specific combination of hardware and software.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims

1. A method for determining pronunciation of a polyphone, comprising:

analyzing the corpus to be processed, and determining a first polyphone in the corpus to be processed, wherein the first polyphone corresponds to character features;

masking the to-be-processed sentences in the to-be-processed corpus by using a character coding model to obtain sentence embedded vectors of the to-be-processed sentences;

generating a feature vector according to the sentence embedded vector, the character feature and the related word feature corresponding to the character feature;

and inputting the feature vector into a trained deep learning model, and determining target pronunciation corresponding to the first polyphone according to an output result of the deep learning model, wherein the deep learning model is obtained by training by taking a historical feature vector corresponding to a second polyphone in a training corpus as an input sample and taking a sample probability of each pronunciation of the second polyphone as an output sample.

2. The method of claim 1, wherein before generating a feature vector from the sentence embedding vector, the character feature and the related word feature corresponding to the character feature, comprising:

determining adjacent characters of the first polyphones, and determining all words corresponding to the adjacent characters from a preset word vector database;

obtaining a pre-training word vector obtained by vectorizing all the words, wherein the dimension of the pre-training word vector is consistent with the number of the words;

and determining the pre-training word vector as the related word characteristic.

3. The method for determining pronunciation of polyphone as claimed in claim 1, wherein generating feature vectors from the sentence embedding vectors, the character features and the related word features corresponding to the character features, comprises:

obtaining a coding result obtained by coding the sentence embedded vector, and determining a transformation result obtained by carrying out nonlinear transformation on the related word characteristics;

calculating the correlation degree corresponding to the related word characteristics based on the coding result and the transformation result;

And generating a feature vector according to the relevance and the character feature.

4. A method of determining a pronunciation of a polyphone as claimed in claim 3, wherein calculating a degree of correlation corresponding to the relevant word feature based on the encoding result and the transformation result includes:

obtaining the product of the coding result, the transformation result and a preset weight matrix;

and carrying out normalization calculation on the product by using an activation function, and determining the result obtained by the normalization calculation as the correlation degree.

5. A method of determining a pronunciation of a polyphone as claimed in claim 3, wherein generating a feature vector from the relevance and the character feature includes:

determining products of target transformation results and target correlation degrees corresponding to all relevant word features according to all relevant word features to obtain a plurality of products;

and determining the sum of the products, and determining the sum value between the sum of the products and the character feature as the feature vector.

6. The method for determining the pronunciation of the first polyphone according to claim 1, wherein determining the pronunciation corresponding to the first polyphone according to the output result of the deep learning model comprises:

Determining the probability of all pronunciations of the first polyphones from the output result of the deep learning model;

and determining the maximum value from the probabilities of all the pronunciations, and determining the pronunciation corresponding to the maximum value as the pronunciation corresponding to the first polyphone.

7. The method of determining a multi-tone word pronunciation as claimed in claim 6, further comprising: calculating information entropy corresponding to the probabilities of all the pronunciations;

under the condition that the information entropy meets a preset threshold value, determining other samples from the training corpus, wherein the other samples at least comprise adjacent word samples of the first polyphone sample and word samples except the first polyphone sample in the training corpus;

and training the deep learning model by taking the feature vectors of the other samples and the historical feature vectors corresponding to the second polyphones as input samples and taking the sample probability of each pronunciation of the second polyphones as output samples.

8. A multi-tone word pronunciation determining apparatus, comprising:

the analysis module is used for analyzing the corpus to be processed and determining a first polyphone in the corpus to be processed, wherein the first polyphone corresponds to character features;

The processing module is used for carrying out mask processing on the sentences to be processed in the corpus to be processed by using the character coding model to obtain sentence embedded vectors of the sentences to be processed;

the generation module is used for generating a feature vector according to the sentence embedding vector, the character features and the related word features corresponding to the character features;

the determining module is used for inputting the feature vector into a trained deep learning model, and determining target pronunciation corresponding to the first polyphone according to an output result of the deep learning model, wherein the deep learning model is obtained by training by taking a historical feature vector corresponding to a second polyphone in a training corpus as an input sample and taking a sample probability of each pronunciation of the second polyphone as an output sample.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored program, wherein the program, when run, performs the method of any one of claims 1 to 7.

10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1 to 7 by means of the computer program.