CN112528648A

CN112528648A - Method, device, equipment and storage medium for predicting polyphone pronunciation

Info

Publication number: CN112528648A
Application number: CN202011432585.6A
Authority: CN
Inventors: 李俊杰; 张志宇; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-03-19
Also published as: WO2022121166A1; JP7441864B2; JP2023509257A

Abstract

The invention relates to the technical field of artificial intelligence, and provides a method, a device, equipment and a storage medium for predicting polyphone pronunciations, which are used for improving the accuracy of predicting the polyphone pronunciations. The method for predicting the pronunciations of polyphones comprises the following steps: acquiring a marked Chinese sentence to be processed, and acquiring a word expression vector set and a polyphone expression vector of the Chinese sentence to be processed, wherein the Chinese sentence to be processed comprises a target polyphone; performing word segmentation processing on the Chinese sentence to be processed to obtain a target word segmentation, and converting a word representation vector set into a word level feature representation vector according to the target word segmentation; splicing the polyphone expression vectors and the word level feature expression vectors based on an attention mechanism to obtain target vectors; and calculating the target pinyin probability of the target vector through a preset linear layer, and determining the target pronunciation of the target polyphone according to the target pinyin probability. In addition, the invention also relates to a block chain technology, and the marked Chinese sentences to be processed can be stored in the block chain.

Description

Method, device, equipment and storage medium for predicting polyphone pronunciation

Technical Field

The invention relates to the field of intelligent decision making of artificial intelligence, in particular to a method, a device, equipment and a storage medium for predicting polyphone pronunciations.

Background

Text-to-phone conversion is an important component in Text-to-Speech (Text-to-Speech) systems. However, unlike other languages, it is common for a character in Chinese to have different pronunciations under different conditions, even for many Chinese characters having more than 3 pronunciations. Therefore, the quality of the polyphone pronunciation labeling system greatly affects the quality of the Chinese speech synthesis system, and if the pronunciation is wrongly labeled, the synthesized speech is obviously wrong. Currently, the prediction method for polyphonic pronunciation usually uses labeled data and randomly initializes a set of vectors to predict the polyphonic pronunciation.

However, since randomly initializing a set of vectors may cause a problem that, when predicting polyphonic pronunciations, if a word that is not labeled during model training is encountered, the word cannot be identified, that is, an unknown word problem (out of vocalization), the accuracy of predicting polyphonic pronunciations is low.

Disclosure of Invention

The invention provides a method, a device, equipment and a storage medium for predicting polyphone pronunciations, which are used for improving the accuracy of predicting the polyphone pronunciations.

The invention provides a method for predicting polyphone pronunciation, which comprises the following steps:

acquiring a marked Chinese sentence to be processed, and acquiring a word expression vector set and a polyphone expression vector of the Chinese sentence to be processed, wherein the Chinese sentence to be processed comprises a target polyphone;

performing word segmentation processing on the Chinese sentence to be processed to obtain a target word segmentation, and converting the word representation vector set into word level feature representation vectors according to the target word segmentation;

splicing the polyphone representation vectors and the word level feature representation vectors based on an attention mechanism to obtain target vectors;

and calculating the target pinyin probability of the target vector through a preset linear layer, and determining the target pronunciation of the target polyphone according to the target pinyin probability.

Optionally, in a first implementation manner of the first aspect of the present invention, the performing word segmentation processing on the to-be-processed chinese statement to obtain a target word segmentation, and converting the word representation vector set into a word-level feature representation vector according to the target word segmentation includes:

performing word segmentation processing on the Chinese sentence to be processed to obtain a target word segmentation;

dividing the word representation vector set according to the target word segmentation to obtain a representation vector group of each word;

and performing mixed pooling on the expression vector group of each word through a preset mixed pooling layer to obtain word-level feature expression vectors.

Optionally, in a second implementation manner of the first aspect of the present invention, the performing splicing processing based on an attention mechanism on the polyphonic expression vector and the word-level feature expression vector to obtain a target vector includes:

performing attention calculation on the polyphone representation vectors and the word level feature representation vectors through a preset feedforward attention mechanism to obtain attention vectors;

and splicing the attention vector and the polyphone expression vector to obtain a target vector.

Optionally, in a third implementation manner of the first aspect of the present invention, the calculating, by a preset linear layer, a target pinyin probability of the target vector, and determining a target pronunciation of the target polyphonic character according to the target pinyin probability includes:

calculating the probability of the target vector based on each pinyin through a preset linear layer to obtain a polyphone pinyin probability value set;

sequencing the polyphone pinyin probability values in the polyphone pinyin probability value set according to the sequence from large value to small value, and determining the polyphone pinyin probability value with the first sequencing as the target pinyin probability;

and determining the pinyin corresponding to the target pinyin probability as the target pronunciation of the target polyphone.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the obtaining a marked to-be-processed chinese statement, and obtaining a set of word expression vectors and a set of polyphonic word expression vectors of the to-be-processed chinese statement, where the to-be-processed chinese statement includes a target polyphonic word, includes:

acquiring an initial Chinese sentence, a target polyphone in the initial Chinese sentence and polyphone position information corresponding to the target polyphone;

marking a target polyphone in the initial Chinese sentence according to the polyphone position information to obtain a Chinese sentence to be processed;

and sequentially carrying out word vector coding and polyphone word vector extraction on the Chinese sentence to be processed to obtain a word expression vector set and a polyphone expression vector.

Optionally, in a fifth implementation manner of the first aspect of the present invention, the sequentially performing word vector coding and polyphonic word vector extraction on the chinese statement to be processed to obtain a word expression vector set and a polyphonic word expression vector includes:

coding each word in the Chinese sentence to be processed through a preset deep neural network coder to obtain a word representation vector set, wherein each word representation vector corresponds to one word;

and according to the polyphone position information, extracting the expression vector corresponding to the target polyphone from the word expression vector set to obtain the polyphone expression vector.

Optionally, in a sixth implementation manner of the first aspect of the present invention, after the calculating, by a preset linear layer, a target pinyin probability of the target vector and determining a target pronunciation of the target polyphonic character according to the target pinyin probability, the method further includes:

and acquiring an error value of the target pronunciation based on the labeled pronunciation, and optimizing an acquisition strategy of the target pronunciation according to the error value, wherein the acquisition strategy comprises an execution process, an algorithm and a network structure for acquiring the target pronunciation.

A second aspect of the present invention provides a polyphonic pronunciation prediction apparatus, comprising:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring marked Chinese sentences to be processed and acquiring word expression vector sets and polyphone expression vectors of the Chinese sentences to be processed, and the Chinese sentences to be processed comprise target polyphone;

the conversion module is used for carrying out word segmentation processing on the Chinese sentence to be processed to obtain target word segmentation and converting the word representation vector set into word level feature representation vectors according to the target word segmentation;

the splicing module is used for splicing the polyphone representation vector and the word level feature representation vector based on an attention mechanism to obtain a target vector;

and the determining module is used for calculating the target pinyin probability of the target vector through a preset linear layer and determining the target pronunciation of the target polyphone according to the target pinyin probability.

Optionally, in a first implementation manner of the second aspect of the present invention, the conversion module is specifically configured to:

Optionally, in a second implementation manner of the second aspect of the present invention, the splicing module includes:

the calculation unit is used for performing attention calculation on the polyphone expression vectors and the word level characteristic expression vectors through a preset feedforward attention mechanism to obtain attention vectors;

and the splicing unit is used for splicing the attention vector and the polyphone expression vector to obtain a target vector.

Optionally, in a third implementation manner of the second aspect of the present invention, the determining module is specifically configured to:

Optionally, in a fourth implementation manner of the second aspect of the present invention, the obtaining module includes:

the system comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring an initial Chinese sentence, a target polyphone in the initial Chinese sentence and polyphone position information corresponding to the target polyphone;

the marking unit is used for marking the target polyphone in the initial Chinese sentence according to the polyphone position information to obtain a Chinese sentence to be processed;

and the coding extraction unit is used for sequentially carrying out word vector coding and polyphone word vector extraction on the Chinese sentence to be processed to obtain a word expression vector set and a polyphone expression vector.

Optionally, in a fifth implementation manner of the second aspect of the present invention, the encoding extraction unit is specifically configured to:

Optionally, in a sixth implementation manner of the second aspect of the present invention, the apparatus for predicting polyphonic pronunciation further includes:

and the optimization module is used for acquiring an error value of the target pronunciation based on the labeled pronunciation and optimizing an acquisition strategy of the target pronunciation according to the error value, wherein the acquisition strategy comprises an execution process, an algorithm and a network structure for acquiring the target pronunciation.

A third aspect of the present invention provides a polyphonic pronunciation prediction apparatus comprising: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the apparatus for predicting a polyphonic pronunciation to perform the method for predicting a polyphonic pronunciation described above.

A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to execute the above-described method of predicting polyphonic pronunciations.

In the technical scheme provided by the invention, a marked Chinese sentence to be processed is obtained, a word expression vector set and polyphone expression vectors of the Chinese sentence to be processed are obtained, and the Chinese sentence to be processed comprises a target polyphone; performing word segmentation processing on the Chinese sentence to be processed to obtain a target word segmentation, and converting a word representation vector set into a word level feature representation vector according to the target word segmentation; splicing the polyphone expression vectors and the word level feature expression vectors based on an attention mechanism to obtain target vectors; and calculating the target pinyin probability of the target vector through a preset linear layer, and determining the target pronunciation of the target polyphone according to the target pinyin probability. In the embodiment of the invention, the word expression vector set is converted into the word-level feature expression vector according to the target word segmentation, the character feature is converted into the word-level feature, the problem of unknown words is avoided, thereby effectively improving the accuracy of the pronunciation prediction of the polyphones, and by carrying out the splicing processing based on the attention mechanism on the polyphone expression vectors and the word-level feature expression vectors, and calculating the target pinyin probability of the target vector through a preset linear layer, determining the target pronunciation of the target polyphone according to the target pinyin probability, combining target word segmentation and attention mechanism, the pronunciation of the target polyphone is predicted without any rule and artificial characteristic design, the influence caused by the problem of labeling error in word segmentation is reduced, the text semantic information of the Chinese sentence to be processed can be accurately captured, and the accuracy rate of predicting the polyphone pronunciation is improved.

Drawings

FIG. 1 is a diagram of a method for predicting polyphonic pronunciations according to an embodiment of the present invention;

FIG. 2 is a diagram of another embodiment of a method for predicting polyphonic pronunciations according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an embodiment of a device for predicting polyphonic pronunciations according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of another embodiment of the apparatus for predicting polyphonic pronunciation in accordance with the present invention;

fig. 5 is a schematic diagram of an embodiment of a prediction apparatus for polyphonic pronunciation according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a method, a device, equipment and a storage medium for predicting polyphone pronunciations, which improve the accuracy of predicting the polyphone pronunciations.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For understanding, a detailed flow of an embodiment of the present invention is described below, and referring to fig. 1, an embodiment of a method for predicting polyphonic pronunciations according to an embodiment of the present invention includes:

101. the marked Chinese sentence to be processed is obtained, and a word expression vector set and polyphone expression vectors of the Chinese sentence to be processed are obtained, wherein the Chinese sentence to be processed comprises a target polyphone.

It is understood that the executing subject of the present invention may be a prediction device of polyphone pronunciation, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.

The server receives the initial Chinese sentences sent by the preset interface, performs data cleaning on the initial Chinese sentences to obtain candidate Chinese sentences, acquires pre-established polyphone labels, the polyphone tag can be a note created based on polyphone of at least one of a general dictionary, a business domain dictionary and a user portrait tag to improve the universality and accuracy of the polyphone based on the multi-domain label, and improving the labeling accuracy of polyphones by basing on the interests and hobbies of user portrait tags, the polyphone tags comprise polyphones and the pronunciations of the polyphones based on semantic information, identifying the business fields and user information of candidate Chinese sentences, calling corresponding polyphone tags based on the business fields and the user information, and identifying a target polyphone in the candidate Chinese sentence through the polyphone label, and labeling the target polyphone to obtain the labeled Chinese sentence to be processed.

After obtaining the marked Chinese sentence to be processed, the server calls a pre-trained word vector and a preset word vector conversion algorithm, carries out vector conversion on the word of the Chinese sentence to be processed to obtain a word expression vector set, and extracts an expression vector corresponding to the target polyphone in the word expression vector set according to the marked target polyphone so as to obtain the polyphone expression vector; or the server extracts the labeled target polyphone in the Chinese sentence to be processed, calls a pre-trained word vector and a preset word vector conversion algorithm, and respectively carries out vector conversion on the word of the Chinese sentence to be processed and the target polyphone to obtain a word expression vector set and a polyphone expression vector. Wherein, the number of the target polyphones comprises one or more than one.

102. And performing word segmentation processing on the Chinese sentence to be processed to obtain a target word segmentation, and converting the word expression vector set into a word-level feature expression vector according to the target word segmentation.

The server calls a preset jieba word segmentation tool or a Hanlp word segmentation tool or other word segmentation tools to perform word segmentation processing based on the original sentence sequence on the Chinese sentence to be processed to obtain initial word segmentation; or the server calls a preset Chinese word segmentation algorithm based on a dictionary or a Chinese word segmentation algorithm based on statistics, carries out word segmentation processing based on the original sentence sequence on the Chinese sentence to be processed to obtain initial word segmentation, and splices the initial word segmentation according to a preset word splicing rule to obtain target word segmentation, wherein the number of the initial word segmentation and the target word segmentation is one or more than one. The server classifies word representation vectors in the word representation vector set according to target word segmentation to obtain a word representation vector group corresponding to each target word segmentation, the word representation vector groups corresponding to each target word segmentation are spliced to obtain word level feature representation vectors, the number of the word level feature representation vectors comprises one or more than one, and one target word segmentation corresponds to one word level feature representation vector.

103. And splicing the polyphone expression vector and the word level characteristic expression vector based on an attention mechanism to obtain a target vector.

The server can calculate the polyphone attention value of the polyphone expression vector through a preset attention mechanism, multiply the polyphone attention value and the polyphone expression vector to obtain a polyphone vector matrix, calculate the word attention value of the word level characteristic expression vector based on the polyphone expression vector to obtain a word vector matrix, and perform matrix addition or matrix multiplication on the polyphone vector matrix and the word vector matrix to obtain a target vector; or the server can also calculate a first attention value of the polyphone representation vector relative to the word level feature representation vector through a preset attention mechanism, calculate a second attention value of the word level feature representation vector relative to the polyphone representation vector, multiply the first attention value and the word level feature representation vector to obtain a first vector, multiply the second attention value and the polyphone representation vector to obtain a second vector, and perform matrix addition or matrix multiplication on the first vector and the second vector to obtain a target vector.

104. And calculating the target pinyin probability of the target vector through a preset linear layer, and determining the target pronunciation of the target polyphone according to the target pinyin probability.

The number of layers of the preset linear layer can be multiple, each layer corresponds to one classifier, namely the linear layer comprises a plurality of classifiers, the server performs pinyin classification and probability value calculation on a target vector through the plurality of classifiers respectively to obtain a plurality of initial pinyin probabilities corresponding to each classifier, the initial pinyin probabilities corresponding to the plurality of classifiers are weighted and summed to obtain target pinyin probabilities of the target vector, the number of the initial pinyin probabilities comprises one or more, the initial pinyin probabilities are compared and analyzed with a preset threshold and the initial pinyin probabilities to obtain target pinyin probabilities, pinyins corresponding to the target pinyin probabilities are determined as target pronunciations of target polyphones, for example: the classifiers are respectively a classifier 1, a classifier 2 and a classifier 3, the classifier 1 carries out pinyin classification and probability value calculation on a target vector to obtain a probability A1 based on pinyin 1 and a probability A2 based on pinyin 2, the classifier 2 carries out pinyin classification and probability value calculation on the target vector to obtain a probability B1 based on pinyin 1 and a probability B2 based on pinyin 2, the classifier 3 carries out pinyin classification and probability value calculation on the target vector to obtain a probability C1 based on pinyin 1 and a probability C2 based on pinyin 2, A1, B1 and C1 are subjected to weighted summation to obtain an initial pinyin probability 1 based on pinyin 1 of the target vector, A2, B2 and C2 are subjected to weighted summation to obtain an initial pinyin probability 2 based on pinyin 2 of the target vector, if one of the initial pinyin probability 1 and the initial probability 2 is greater than a preset threshold, the initial probability greater than the preset threshold is determined as the target pinyin probability, and if the initial pinyin probability 1 and the initial pinyin probability 2 are both greater than a preset threshold value, determining the greater initial pinyin probability of the initial pinyin probability 1 and the initial pinyin probability 2 as the target pinyin probability, and if the initial pinyin probability 1 and the initial pinyin probability 2 are both less than or equal to the preset threshold value, recalculating the initial pinyin probability. And after obtaining the target pinyin probability, the server determines the pinyin corresponding to the target pinyin probability as the target pronunciation of the target polyphone.

In another embodiment, the server matches initial historical polyphone information stored in a preset database according to a Chinese sentence to be processed and a target polyphone to obtain corresponding target historical polyphone information, wherein the target historical polyphone information comprises a target historical Chinese sentence, historical polyphones in the target historical Chinese sentence and pronunciations of the historical polyphones; calculating the similarity between the target pronunciation of the target polyphone and the pronunciation of the historical polyphone; and calculating the difference between the similarity and 1 to obtain a target value, judging whether the target value is smaller than a preset similarity value, if so, determining the target pronunciation of the target polyphone as the final target pronunciation, and if not, determining the pronunciation of the historical polyphone as the target pronunciation of the target polyphone.

In the embodiment of the invention, the word expression vector set is converted into the word-level feature expression vector according to the target word segmentation, the character feature is converted into the word-level feature, the problem of unknown words is avoided, thereby effectively improving the accuracy of predicting the pronunciations of the polyphones, and by carrying out the splicing processing based on the attention mechanism on the polyphone expression vectors and the word-level characteristic expression vectors, and calculating the target pinyin probability of the target vector through a preset linear layer, determining the target pronunciation of the target polyphone according to the target pinyin probability, combining target word segmentation and attention mechanism, the pronunciation of the target polyphone is predicted without any rule and artificial characteristic design, the influence caused by the problem of labeling error in word segmentation is reduced, the text semantic information of the Chinese sentence to be processed can be accurately captured, and the accuracy rate of predicting the polyphone pronunciation is improved.

Referring to fig. 2, another embodiment of the method for predicting polyphonic pronunciation according to the embodiment of the present invention includes:

201. the marked Chinese sentence to be processed is obtained, and a word expression vector set and polyphone expression vectors of the Chinese sentence to be processed are obtained, wherein the Chinese sentence to be processed comprises a target polyphone.

Specifically, the server acquires an initial Chinese sentence, a target polyphone in the initial Chinese sentence and polyphone position information corresponding to the target polyphone; marking a target polyphone in the initial Chinese sentence according to the polyphone position information to obtain a Chinese sentence to be processed; and sequentially carrying out word vector coding and polyphone word vector extraction on the Chinese sentence to be processed to obtain a word expression vector set and a polyphone expression vector.

The server receives an initial Chinese sentence sent by a preset interface, calls a pre-created polyphone dictionary, performs polyphone recognition on the initial Chinese sentence to obtain a target polyphone, extracts position information (namely polyphone position information) of the target polyphone in the initial Chinese sentence, labels the target polyphone corresponding to the polyphone position information in the initial Chinese sentence, wherein the labeled content comprises the polyphone position information of the target polyphone and the polyphone position information of the target polyphone, and the labeled content can also comprise the pronunciation of the target polyphone based on the Chinese sentence corresponding to the initial Chinese sentence, wherein the Chinese sentence corresponding to the initial Chinese sentence can be matched by calculating the weight and the value of the semantic similarity, the emotion similarity and the sentence expression similarity, so as to obtain the Chinese sentence to be processed.

The server calls a preset supervised neural network encoder and/or an unsupervised pre-training network encoder, performs word vector encoding on the Chinese sentence to be processed to obtain a word expression vector set, and extracts polyphone expression vectors corresponding to the target polyphone from the word expression vector set.

Specifically, the server encodes each word in the Chinese sentence to be processed through a preset deep neural network encoder to obtain a word expression vector set, wherein one word expression vector corresponds to one word; and according to the polyphone position information, extracting the expression vector corresponding to the target polyphone from the word expression vector set to obtain the polyphone expression vector.

The server calls a deep neural network encoder in a preset supervised neural network encoder, wherein the deep neural network encoder may include, but is not limited to, at least one of a long short-term-memory artificial neural network (LSTM) model and a transformer-based Bidirectional Encoder Representation (BERT) model, and performs context semantic information-based encoding on each word in the chinese sentence to be processed according to a sequence order of each word in the chinese sentence to be processed by the deep neural network encoder to obtain a representation vector of each word, i.e., a word representation vector set, and extracts a representation vector corresponding to polyphonic character position information in the word representation vector set to obtain a polyphonic character representation vector, for example: the Chinese sentence to be processed is 'all goods are sold in discount', the polyphone position information is the seventh character in the Chinese sentence to be processed, the seventh character representation vector is extracted from the character representation vector set, and the polyphone representation vector corresponding to the target polyphone is obtained.

202. And performing word segmentation processing on the Chinese sentence to be processed to obtain a target word segmentation, and converting the word expression vector set into a word-level feature expression vector according to the target word segmentation.

Specifically, the server performs word segmentation processing on the Chinese sentence to be processed to obtain a target word segmentation; dividing a word representation vector set according to the target word segmentation to obtain a representation vector group of each word; and performing mixed pooling on the expression vector group of each word through a preset mixed pooling layer to obtain word-level feature expression vectors.

The method comprises the steps that a server calls a preset Chinese word segmentation algorithm, word segmentation processing is carried out on a Chinese sentence to be processed to obtain initial word segmentation, part of speech detection and word group detection are carried out on the initial word segmentation, the detected initial word segmentation is determined to be target word segmentation, and the Chinese word segmentation algorithm integrates a meta-grammar N-Gram model and a bidirectional maximum matching (BM) model, namely the output of the N-Gram model can be the input of the BM model, or the output of the BM model can be the input of the N-Gram model, or the N-Gram model and the BM model are connected in parallel.

The server divides the word representation vector set according to the target word segmentation to obtain a representation vector group of each word, for example: the Chinese sentence to be processed is that all the commodities are discounted for sale, the corresponding target participles are all, commodity, all, discount and sale, the expression vector group of the word of discount comprises the expression vector of discount and the expression vector of discount by taking discount as an example, and the other words are the same.

The preset mixed pooling layer is used for indicating the pooling layer combining the maximum pooling and the average pooling, the server calls the preset mixed pooling layer, the expression vector group of each word is mixed and pooled, and the word-level feature expression vectors are obtained, for example: and fusing the 'discounted' expression vector and the 'discounted' expression vector in the expression vector group of the word to obtain the 'discounted' word-level feature expression vector. The server can perform maximum pooling processing on the expression vector group of each word through a maximum pooling convolution kernel or a maximum pooling layer in the mixed pooling layer to obtain a first word expression vector group, and performs average pooling processing on the first word expression vector group through an average pooling convolution kernel or an average pooling layer in the mixed pooling layer to obtain a word-level feature expression vector; or the server performs maximum pooling processing on the expression vector group of each word through a maximum pooling convolution kernel or a maximum pooling layer in the mixed pooling layer to obtain a first word expression vector group, performs average pooling processing on the expression vector group of each word through an average pooling convolution kernel or an average pooling layer in the mixed pooling layer to obtain a second word expression vector group, and fuses the first word expression vector group and the second word expression vector group to obtain a word-level feature expression vector; or the server creates a mixed pooling layer fusing the maximum pooling convolution kernel and the average pooling convolution kernel in advance, performs pooling convolution processing on the expression vector group of each word to obtain word-level feature expression vectors, the number of the word-level feature expression vectors comprises one or more than one, and one target word corresponds to one word-level feature expression vector.

203. And performing attention calculation on the polyphone expression vector and the word level characteristic expression vector through a preset feedforward attention mechanism to obtain an attention vector.

The server calculates an attention value between the polyphone expression vector and the word level characteristic expression vector through a preset feed-forward attention mechanism, and performs weighted summation on the polyphone expression vector and the word level characteristic expression vector through the attention value to obtain an attention vector; or the server calculates the attention value of the polyphone expression vector relative to the word level feature expression vector through a preset feed-forward attention mechanism, multiplies the attention value by the polyphone expression vector to obtain a polyphone expression vector matrix, and performs matrix addition or matrix multiplication on the polyphone expression vector matrix and the word level feature expression vector to obtain the attention vector.

204. And splicing the attention vector and the polyphone expression vector to obtain a target vector.

After obtaining the attention vector, the server performs matrix multiplication or matrix addition on the attention vector and the polyphone expression vector to obtain a target vector; or the server performs weighted summation on the attention vector and the polyphone expression vector to obtain a target vector. The target vector is obtained through a preset feedforward attention mechanism, and the information of which word in the Chinese sentence to be processed is more important for the target polyphone is represented, so that the weight is more important, and the accuracy of the context semantic fusion of the target polyphone is improved.

205. And calculating the target pinyin probability of the target vector through a preset linear layer, and determining the target pronunciation of the target polyphone according to the target pinyin probability.

Specifically, the server calculates the probability of a target vector based on each pinyin through a preset linear layer to obtain a polyphone pinyin probability value set; sequencing polyphone pinyin probability values in the polyphone pinyin probability value set according to the sequence from large value to small value, and determining the polyphone pinyin probability value with the first sequencing as a target pinyin probability; and determining the pinyin corresponding to the target pinyin probability as the target pronunciation of the target polyphone.

For example, the number of the linear layers is one, the server inputs a target vector to a preset linear layer, the probability of the target vector based on each pinyin is calculated through the linear layer to obtain a polyphone pinyin probability value set, the polyphone pinyin probability value set is a polyphone pinyin probability value 1 and a polyphone pinyin probability value 2, the polyphone pinyin probability value 1 and the polyphone pinyin probability value 2 are sorted in the order from large to small to obtain a sequence of polyphone pinyin probability value 2-polyphone pinyin probability value 1, the polyphone pinyin probability value 2 is sorted first, the first sorted target pinyin probability is determined, and the pinyin corresponding to the target pinyin probability is determined as the target pronunciation of the target polyphone.

Specifically, the server calculates a target pinyin probability of a target vector through a preset linear layer, obtains an error value of the target pronunciation based on the labeled pronunciation after determining the target pronunciation of the target polyphone according to the target pinyin probability, and optimizes an acquisition strategy of the target pronunciation according to the error value, wherein the acquisition strategy comprises an execution process, an algorithm and a network structure for acquiring the target pronunciation.

The server obtains the marked pronunciation of the target polyphone, the marked pronunciation is the pronunciation of the target polyphone based on the sentence corresponding to the semantic and emotion of the Chinese sentence to be processed, the marked pronunciation can be marked manually or marked by a pre-trained polyphone marking model, the pronunciation similarity between the target pronunciation and the marked pronunciation of the target polyphone is calculated, the difference value between the pronunciation similarity and 1 is calculated, the error value of the target pronunciation based on the marked pronunciation is obtained, the execution process of obtaining the target pronunciation is adjusted by the error value, the network structure adopted by obtaining the target pronunciation is optimized by the error value, the network structure comprises a neural network structure and model parameters, the corresponding processing functions can be generation of a representation vector, extraction of the representation vector, calculation of the pinyin probability of a participle and a linear layer, and the like, and the algorithm adopted by obtaining the target pronunciation is added or deleted or the execution sequence is adjusted by the error value In the whole process, the accuracy rate of predicting the pronunciations of polyphones is improved by optimizing the acquisition strategy of the target pronunciations according to the error values.

In the embodiment of the invention, the word expression vector set is converted into the word-level feature expression vector according to the target word segmentation, the character feature is converted into the word-level feature, the problem of unknown words is avoided, thereby effectively improving the accuracy of predicting the pronunciations of the polyphones, and by carrying out the splicing processing based on the attention mechanism on the polyphone expression vectors and the word-level characteristic expression vectors, and calculating the target pinyin probability of the target vector through a preset linear layer, determining the target pronunciation of the target polyphone according to the target pinyin probability, combining word segmentation and attention mechanism through the target, the pronunciation of the target polyphone is predicted without any rule and artificial characteristic design, the influence caused by the problem of labeling error in word segmentation is reduced, the text semantic information of the Chinese sentence to be processed can be accurately captured, and the accuracy rate of predicting the polyphone pronunciation is improved.

With reference to fig. 3, the method for predicting polyphonic pronunciations in the embodiment of the present invention is described above, and a prediction apparatus for polyphonic pronunciations in the embodiment of the present invention is described below, where an embodiment of the prediction apparatus for polyphonic pronunciations in the embodiment of the present invention includes:

an obtaining module 301, configured to obtain a marked to-be-processed chinese statement, and obtain a word expression vector set and a polyphone expression vector of the to-be-processed chinese statement, where the to-be-processed chinese statement includes a target polyphone;

the conversion module 302 is configured to perform word segmentation on a to-be-processed chinese sentence to obtain a target word segmentation, and convert a word representation vector set into a word-level feature representation vector according to the target word segmentation;

the splicing module 303 is configured to perform splicing processing based on an attention mechanism on the polyphone representation vectors and the word-level feature representation vectors to obtain target vectors;

and the determining module 304 is used for calculating the target pinyin probability of the target vector through a preset linear layer and determining the target pronunciation of the target polyphone according to the target pinyin probability.

The function realization of each module in the device for predicting polyphone pronunciation corresponds to each step in the embodiment of the method for predicting polyphone pronunciation, and the function and the realization process are not described in detail herein.

Referring to fig. 4, another embodiment of the apparatus for predicting polyphonic pronunciation according to the embodiment of the present invention includes:

wherein, the splicing module 303 specifically includes:

a calculating unit 3031, configured to perform attention calculation on the polyphone representation vectors and the word-level feature representation vectors through a preset feedforward attention mechanism to obtain an attention vector;

the splicing unit 3032 is configured to splice the attention vector and the polyphone representation vector to obtain a target vector;

Optionally, the conversion module 302 may be further specifically configured to:

dividing a word representation vector set according to the target word segmentation to obtain a representation vector group of each word;

Optionally, the determining module 304 may be further specifically configured to:

sequencing polyphone pinyin probability values in the polyphone pinyin probability value set according to the sequence from large value to small value, and determining the polyphone pinyin probability value with the first sequencing as a target pinyin probability;

Optionally, the obtaining module 301 includes:

an obtaining unit 3011, configured to obtain an initial chinese sentence, a target polyphone in the initial chinese sentence, and polyphone position information corresponding to the target polyphone;

a labeling unit 3012, configured to label a target polyphone in an initial chinese sentence according to the polyphone position information, to obtain a to-be-processed chinese sentence;

and the code extraction unit 3013 is configured to perform word vector coding and polyphone vector extraction on the to-be-processed chinese statement in sequence to obtain a word expression vector set and a polyphone expression vector.

Optionally, the code extracting unit 3013 may be further specifically configured to:

coding each word in the Chinese sentence to be processed through a preset deep neural network coder to obtain a word representation vector set, wherein one word representation vector corresponds to one word;

Optionally, the apparatus for predicting polyphonic pronunciation further comprises:

and the optimization module 305 is configured to obtain an error value of the target pronunciation based on the labeled pronunciation, and optimize an obtaining strategy of the target pronunciation according to the error value, where the obtaining strategy includes an execution process, an algorithm, and a network structure of obtaining the target pronunciation.

The function realization of each module and each unit in the polyphone pronunciation prediction device corresponds to each step in the polyphone pronunciation prediction method embodiment, and the function and realization process are not described in detail herein.

Fig. 3 and 4 describe the prediction apparatus of polyphonic pronunciation in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the prediction apparatus of polyphonic pronunciation in the embodiment of the present invention is described in detail from the perspective of hardware processing.

Fig. 5 is a schematic structural diagram of a polyphonic pronunciation prediction apparatus 500 according to an embodiment of the present invention, which may include one or more processors (CPUs) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) storing applications 533 or data 532. Memory 520 and storage media 530 may be, among other things, transient or persistent storage. The program stored on the storage medium 530 may include one or more modules (not shown), each of which may include a sequence of instruction operations in the prediction device 500 for polyphonic pronunciations. Still further, the processor 510 may be configured to communicate with the storage medium 530 to execute a series of instruction operations in the storage medium 530 on the polyphonic pronunciation prediction device 500.

The apparatus 500 for predicting polyphonic pronunciations may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input-output interfaces 560, and/or one or more operating systems 531, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and the like. Those skilled in the art will appreciate that the configuration of the predictive device for polyphonic pronunciation shown in FIG. 5 does not constitute a limitation of the predictive device for polyphonic pronunciation and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, which may also be a volatile computer readable storage medium, having stored therein instructions, which, when executed on a computer, cause the computer to perform the steps of the method for predicting polyphonic pronunciations.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for predicting polyphonic pronunciations, the method comprising:

2. The method for predicting polyphonic pronunciations according to claim 1, wherein the performing word segmentation on the chinese sentence to be processed to obtain a target word segmentation, and converting the word representation vector set into word-level feature representation vectors according to the target word segmentation includes:

3. The method for predicting polyphonic pronunciations according to claim 1, wherein the performing an attention-based concatenation process on the polyphonic expression vector and the word-level feature expression vector to obtain a target vector comprises:

4. The method for predicting polyphonic pronunciation according to claim 1, wherein the calculating a target pinyin probability of the target vector through a preset linear layer and determining the target pronunciation of the target polyphonic character according to the target pinyin probability comprises:

5. The method for predicting polyphonic pronunciations according to claim 1, wherein the obtaining a labeled to-be-processed chinese sentence, the word representation vector set and polyphonic word representation vector of the to-be-processed chinese sentence, the to-be-processed chinese sentence including a target polyphonic word, comprises:

6. The method for predicting polyphone pronunciation according to claim 5, wherein the sequentially performing word vector encoding and polyphone vector extraction on the Chinese sentence to be processed to obtain a word expression vector set and a polyphone expression vector comprises:

7. The method for predicting polyphonic pronunciation according to any one of claims 1-6, wherein after calculating the target Pinyin probability of the target vector through a preset linear layer and determining the target pronunciation of the target polyphonic character according to the target Pinyin probability, the method further comprises:

8. A polyphonic pronunciation prediction device, comprising:

9. A polyphonic pronunciation prediction device, comprising: a memory and at least one processor, the memory having instructions stored therein;

the at least one processor invokes the instructions in the memory to cause the apparatus for predicting a polyphonic pronunciation to perform the method for predicting a polyphonic pronunciation according to any of claims 1-7.

10. A computer-readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, implement a method for predicting a polyphonic pronunciation according to any of claims 1-7.