WO2022121166A1

WO2022121166A1 - Method, apparatus and device for predicting heteronym pronunciation, and storage medium

Info

Publication number: WO2022121166A1
Application number: PCT/CN2021/083522
Authority: WO
Inventors: 李俊杰; 张志宇; 马骏; 王少军
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-12-10
Filing date: 2021-03-29
Publication date: 2022-06-16
Also published as: CN112528648A; JP7441864B2; JP2023509257A

Abstract

The present invention relates to the technical field of artificial intelligence, and provides a method, apparatus and device for predicting the heteronym pronunciation, and a storage medium, for use in improving the accuracy of predicting the heteronym pronunciation. The method for predicting the heteronym pronunciation comprises: acquiring an annotated Chinese statement to be processed, and acquiring a word representation vector set and a heteronym representation vector of said Chinese statement, said Chinese statement comprising a target heteronym (101); performing word segmentation processing on said Chinese statement to obtain a target word, and converting the word representation vector set into a word-level feature representation vector according to the target word (102); connecting the heteronym representation vector to the word-level feature representation vector on the basis of an attention mechanism to obtain a target vector (103); and calculating a target Pinyin probability of the target vector by means of a preset linear layer, and determining a target pronunciation of the target heteronym according to the target Pinyin probability (104). In addition, the present invention further relates to blockchain technology, and said annotated Chinese statement can be stored in a blockchain.

Description

Method, device, device and storage medium for predicting pronunciation of polyphonic words

This application claims the priority of the Chinese patent application filed on December 10, 2020 with the application number 202011432585.6 and the invention titled "Method, Apparatus, Equipment and Storage Medium for Predicting the Pronunciation of Polyphones", the entire contents of which are approved by Reference is incorporated in the application.

technical field

The present application relates to the field of intelligent decision-making of artificial intelligence, and in particular, to a method, device, device and storage medium for predicting the pronunciation of polyphonic words.

Background technique

Grapheme-to-phoneme conversion is an important component in Text-to-Speech systems. But unlike other languages, it is very common for a character in Chinese to have different pronunciations in different situations, and even many Chinese characters have more than 3 pronunciations. Therefore, the quality of the polyphonic pronunciation labeling system greatly affects the quality of the Chinese speech synthesis system. If the pronunciation is wrongly labelled, it will lead to obvious errors in the synthesized speech. At present, the method for predicting the pronunciation of polyphonic words is usually to predict the pronunciation of polyphonic words by using the labeled data and randomly initializing a set of vectors on the labeled data.

However, the inventor realized that, due to the random initialization of a set of vectors, when predicting the pronunciation of polyphonic words, if there is a problem of unrecognized words that have not been labeled when training the model, that is, the problem of unregistered words (out of vocabulary), Therefore, the accuracy of predicting the pronunciation of polyphonic words is low.

SUMMARY OF THE INVENTION

The present application provides a method, device, device and storage medium for predicting the pronunciation of a polyphonic word, which are used to improve the accuracy of predicting the pronunciation of a polyphonic word.

A first aspect of the present application provides a method for predicting the pronunciation of a polyphonic word, including:

Obtain the marked Chinese sentences to be processed, and obtain the word representation vector set and the polyphonic word representation vector of the to-be-processed Chinese sentences, and the to-be-processed Chinese sentences include target polyphonic words;

Perform word segmentation processing on the to-be-processed Chinese sentence to obtain a target word segmentation, and convert the word representation vector set into a word-level feature representation vector according to the target word segmentation;

performing splicing processing based on an attention mechanism on the polyphonic word representation vector and the word-level feature representation vector to obtain a target vector;

Through the preset linear layer, the target pinyin probability of the target vector is calculated, and the target pronunciation of the target polyphonic word is determined according to the target pinyin probability.

A second aspect of the present application provides a device for predicting the pronunciation of polyphonic words, including:

an acquisition module, configured to acquire the marked Chinese sentences to be processed, and to acquire a word representation vector set and a polyphonic word representation vector of the to-be-processed Chinese sentences, where the to-be-processed Chinese sentences include target polyphonic words;

a conversion module, configured to perform word segmentation on the to-be-processed Chinese sentence to obtain a target word segmentation, and convert the word representation vector set into a word-level feature representation vector according to the target word segmentation;

a splicing module, configured to perform splicing processing based on an attention mechanism on the polyphonic word representation vector and the word-level feature representation vector to obtain a target vector;

A determination module, configured to calculate the target pinyin probability of the target vector through a preset linear layer, and determine the target pronunciation of the target polyphonic word according to the target pinyin probability.

A third aspect of the present application provides a device for predicting the pronunciation of a polyphonic word, including: a memory and at least one processor, where an instruction is stored in the memory; the at least one processor calls the instruction in the memory to Make the prediction equipment of described polyphonic word pronunciation carry out the prediction method of polyphonic word pronunciation as follows:

A fourth aspect of the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when it runs on a computer, the computer executes the following method for predicting the pronunciation of polyphonic words:

In the technical solution provided by this application, the marked Chinese sentences to be processed are obtained, and the word representation vector set and polyphonic word representation vector of the Chinese sentences to be processed are obtained, and the Chinese sentences to be processed include target polyphonic words; the Chinese sentences to be processed are segmented. The target word segmentation is obtained, and the word representation vector set is converted into a word-level feature representation vector according to the target word segmentation; the polyphonic word representation vector and the word-level feature representation vector are spliced based on the attention mechanism to obtain the target vector; layer, calculate the target pinyin probability of the target vector, and determine the target pronunciation of the target polyphonic word according to the target pinyin probability. In the embodiment of the present application, the word representation vector set is converted into a word-level feature representation vector according to the target word segmentation, and the word features are converted into word-level features to avoid the problem of unregistered words, thereby effectively improving the pronunciation prediction of polyphonic words. The accuracy rate of the target vector is calculated by splicing the polyphonic word representation vector and the word-level feature representation vector based on the attention mechanism, and through the preset linear layer, the target pinyin probability of the target vector is calculated, and the target polyphonic word is determined according to the target pinyin probability. By combining the target word segmentation and attention mechanism, the pronunciation of the target polyphonic word is predicted without any rules and artificial feature design, which alleviates the impact of the labeling error in word segmentation, and can accurately capture the Chinese to be processed. The textual semantic information of the sentence improves the accuracy of predicting the pronunciation of polyphonic words.

Description of drawings

Fig. 1 is an embodiment schematic diagram of the prediction method of polyphonic word pronunciation in the embodiment of the application;

Fig. 2 is another embodiment schematic diagram of the prediction method of polyphonic word pronunciation in the embodiment of the application;

3 is a schematic diagram of an embodiment of a device for predicting the pronunciation of polyphonic words in an embodiment of the present application;

4 is a schematic diagram of another embodiment of the device for predicting the pronunciation of polyphonic words in the embodiment of the present application;

FIG. 5 is a schematic diagram of an embodiment of a device for predicting the pronunciation of polyphonic words in an embodiment of the present application.

Detailed ways

Embodiments of the present application provide a method, device, device and storage medium for predicting the pronunciation of a polyphonic word, which improves the accuracy of predicting the pronunciation of a polyphonic word.

The terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of the present application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that data so used may be interchanged under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" or "having" and any variations thereof are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed steps or units, but may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.

For ease of understanding, the specific flow of the embodiment of the present application is described below, referring to FIG. 1 , an embodiment of the method for predicting the pronunciation of polyphonic words in the embodiment of the present application includes:

101. Acquire the marked Chinese sentences to be processed, and acquire a word representation vector set and a polyphonic word representation vector of the to-be-processed Chinese sentences, where the to-be-processed Chinese sentences include target polyphonic words.

It can be understood that the execution body of the present application may be a device for predicting the pronunciation of polyphonic words, and may also be a terminal or a server, which is not specifically limited here. The embodiments of the present application take the server as an execution subject as an example for description.

The server receives the initial Chinese sentence sent by the preset interface, cleans the data of the initial Chinese sentence, obtains the candidate Chinese sentence, and obtains the pre-created polyphonic word label. The polyphonic word label can be based on the general dictionary, business domain dictionary and user portrait label. A sticky note created by at least one type of polyphonic word, so as to improve the universality and accuracy of the polyphonic word based on multi-domain labeling, and improve the labeling accuracy of the polyphonic word based on the hobbies of the user portrait tag, and the polyphonic word label includes the polyphonic word. Based on the pronunciation of polyphonic words based on semantic information, identify the business domain and user information of candidate Chinese sentences, call the corresponding polyphonic word label based on the business domain and user information, and identify the target polyphonic word in the candidate Chinese sentence through the polyphonic word label, And mark the target polyphonic words, so as to obtain the marked Chinese sentences to be processed.

After the server obtains the marked Chinese sentences to be processed, it calls the pre-trained word vector and the preset word vector conversion algorithm, performs vector conversion on the words to be processed in the Chinese sentence, and obtains a word representation vector set. According to the marked target polyphonic words, Extract the representation vector corresponding to the target polyphonic word in the word representation vector set, thereby obtaining the polyphonic word representation vector; or the server extracts the target polyphonic word in the marked Chinese sentence to be processed, and invokes the pre-trained word vector and preset word vector conversion Algorithm, respectively perform vector transformation on the characters of the Chinese sentence to be processed and the target polyphonic words, and obtain the word representation vector set and the polyphonic word representation vector. Wherein, the number of target polyphonic words includes one or more than one.

102. Perform word segmentation processing on the Chinese sentence to be processed to obtain a target word segmentation, and convert the word representation vector set into a word-level feature representation vector according to the target word segmentation.

The server calls the preset jieba word segmentation tool or the Chinese language processing package hanlp word segmentation tool or other word segmentation tools to perform word segmentation processing based on the order of the original sentence to obtain the initial word segmentation; or, the server calls the preset dictionary-based Chinese word segmentation tool The word segmentation algorithm or the Chinese word segmentation algorithm based on statistics, performs word segmentation processing based on the order of the original sentence to obtain the initial word segmentation, and splices the initial word segmentation according to the preset word splicing rules to obtain the target word segmentation, among which, the initial word segmentation and The number of target participles includes one or more than one. The server classifies the word representation vectors in the word representation vector set according to the target word segmentation, obtains the word representation vector group corresponding to each target word segmentation, and splices the word representation vector group corresponding to each target word segmentation to obtain the word-level feature representation vector, The number of word-level feature representation vectors includes one or more than one, and one target word segment corresponds to one word-level feature representation vector.

103. Perform a splicing process based on an attention mechanism on the polyphonic word representation vector and the word-level feature representation vector to obtain a target vector.

The server can calculate the polyphonic attention value of the polyphonic word representation vector through the preset attention mechanism, multiply the polyphonic attention value by the polyphonic representation vector to obtain the polyphonic word vector matrix, and calculate the word-level feature representation vector Based on the word attention value of the polyphonic word representation vector, a word vector matrix is obtained, and the polyphonic word vector matrix and the word vector matrix are matrix-added or multiplied to obtain the target vector; or the server can also use the preset attention mechanism, Calculate the first attention value of the polysyllabic representation vector relative to the word-level feature representation vector, calculate the second attention value of the word-level feature representation vector relative to the polysyllabic representation vector, and multiply the first attention value by the word-level feature representation vector , obtain the first vector, multiply the second attention value with the polyphonic word representation vector to obtain the second vector, and perform matrix addition or matrix multiplication between the first vector and the second vector to obtain the target vector.

104. Calculate the target pinyin probability of the target vector through a preset linear layer, and determine the target pronunciation of the target polyphonic word according to the target pinyin probability.

The number of preset linear layers can be multiple layers, and each layer corresponds to a classifier, that is, the linear layer includes multiple classifiers. The initial pinyin probabilities corresponding to each classifier are weighted and summed to obtain the target pinyin probability of the target vector. The number of the initial pinyin probabilities includes one or more. The probability is compared with the preset threshold and the initial pinyin probability, and the target pinyin probability is obtained, and the pinyin corresponding to the target pinyin probability is determined as the target pronunciation of the target polysyllabic word, for example: multiple classifiers are Classifier 1, Classifier 2 and Classifier 3, Classifier 1 performs pinyin classification and probability value calculation on the target vector, and obtains that the probability based on Pinyin 1 is A1 and the probability based on Pinyin 2 is A2, and classifier 2 performs Pinyin classification and probability on the target vector. Value calculation, the probability based on pinyin 1 is B1 and the probability based on pinyin 2 is B2, classifier 3 performs pinyin classification and probability value calculation on the target vector, and obtains the probability based on pinyin 1 is C1 and the probability based on pinyin 2 is C2 , A1, B1 and C1 are weighted and summed to obtain the initial pinyin probability 1 of the target vector based on pinyin 1, and A2, B2 and C2 are weighted and summed to obtain the initial pinyin probability 2 of the target vector based on pinyin 2, if the initial pinyin probability 2 One of the probability 1 and the initial pinyin probability 2 is greater than the preset threshold, then the initial pinyin probability greater than the preset threshold is determined as the target pinyin probability, if both the initial pinyin probability 1 and the initial pinyin probability 2 are greater than the preset threshold, then the initial pinyin probability The larger initial pinyin probability among the pinyin probability 1 and the initial pinyin probability 2 is determined as the target pinyin probability. If both the initial pinyin probability 1 and the initial pinyin probability 2 are less than or equal to the preset threshold, the initial pinyin probability is recalculated. After obtaining the target pinyin probability, the server determines the pinyin corresponding to the target pinyin probability as the target pronunciation of the target polyphonic word.

Wherein, in another embodiment, the server matches the initial historical polyphonic word information stored in the preset database according to the Chinese sentence to be processed and the target polyphonic word, and obtains the corresponding target historical polyphonic word information, and the target historical polyphonic word information includes: Pronunciation of the target historical Chinese sentence, the historical polyphone and the historical polyphone in the target historical Chinese sentence; calculate the similarity between the target pronunciation of the target polyphone and the pronunciation of the historical polyphone; calculate the difference between the similarity and 1 to get target value, it is judged whether the target value is less than the preset similarity value, if so, the target pronunciation of the target polyphonic word is determined as the final target pronunciation, if not, the pronunciation of the historical polyphonic word is determined as the target pronunciation of the target polyphonic word.

In the embodiment of the present application, the word representation vector set is converted into a word-level feature representation vector according to the target word segmentation, and the word features are converted into word-level features to avoid the problem of unregistered words, thereby effectively improving the prediction of the pronunciation of polyphonic words. The accuracy rate of the target vector is calculated by splicing the polyphonic word representation vector and the word-level feature representation vector based on the attention mechanism, and through the preset linear layer, the target pinyin probability of the target vector is calculated, and the target polyphonic word is determined according to the target pinyin probability. By combining the target word segmentation and attention mechanism, the pronunciation of the target polyphonic word is predicted without any rules and artificial feature design, which alleviates the impact of the labeling error in word segmentation, and can accurately capture the Chinese to be processed. The textual semantic information of the sentence improves the accuracy of predicting the pronunciation of polyphonic words.

Referring to Fig. 2, another embodiment of the method for predicting the pronunciation of polyphonic words in the embodiment of the present application includes:

201. Acquire the marked Chinese sentences to be processed, and acquire a word representation vector set and a polyphonic word representation vector of the to-be-processed Chinese sentences, where the to-be-processed Chinese sentences include target polyphonic words.

Specifically, the server obtains the initial Chinese sentence, the target polyphone in the initial Chinese sentence, and the position information of the polyphone corresponding to the target polyphone; according to the position information of the polyphone, the target polyphone in the initial Chinese sentence is marked to obtain the Chinese character to be processed. Sentences; word vector encoding and polyphonic word vector extraction are sequentially performed on the Chinese sentences to be processed to obtain a word representation vector set and a polyphonic word representation vector.

The server receives the initial Chinese sentence sent by the preset interface, calls the pre-created polyphonic word dictionary, performs polyphonic word recognition on the initial Chinese sentence, obtains the target polyphonic word, and extracts the position information of the target polyphonic word in the initial Chinese sentence (that is, the position of the polyphonic word). information), in the initial Chinese sentence, the target polyphone corresponding to the polyphone position information is marked, and the marked content includes the target polyphone and the polyphone position information of the target polyphone, and the marked content can also include the target polyphone based on the initial Chinese The pronunciation of the Chinese sentence corresponding to the sentence, wherein the Chinese sentence corresponding to the initial Chinese sentence can be matched by calculating the weight and value of the semantic similarity, emotional similarity and sentence similarity, so as to obtain the Chinese sentence to be processed.

The server invokes a preset supervised neural network encoder and/or an unsupervised pre-trained network encoder to encode the Chinese sentences to be processed with word vectors to obtain a word representation vector set, and extracts the corresponding polyphonic words from the word representation vector set. Polyphone representation vector.

Specifically, the server encodes each word in the Chinese sentence to be processed through a preset deep neural network encoder, and obtains a word representation vector set, one word representation vector corresponds to one word; according to the position information of polyphonic words, the word representation vector The representation vector corresponding to the target polyphonic word is extracted centrally, and the polyphonic word representation vector is obtained.

The server invokes the deep neural network encoder among the preset supervised neural network encoders. The deep neural network encoder may include, but is not limited to, long short-term memory (LSTM) models and transformer-based At least one of the bidirectional encoder representations from transformers (BERT) models, through the deep neural network encoder, according to the sequence order of each word in the Chinese sentence to be processed, each word in the Chinese sentence to be processed is based on Coding of contextual semantic information, obtaining the representation vector of each word, that is, the word representation vector set, extracting the representation vector corresponding to the position information of the polyphonic word in the word representation vector set, and obtaining the polyphonic word representation vector, for example: the Chinese sentence to be processed is "all products" All at a discount”, the location information of the polyphonic word is the seventh word in the Chinese sentence to be processed, extract the seventh word representation vector from the word representation vector set, and obtain the polyphonic word representation vector corresponding to the target polyphonic word.

202. Perform word segmentation processing on the Chinese sentence to be processed to obtain a target word segmentation, and convert the word representation vector set into a word-level feature representation vector according to the target word segmentation.

Specifically, the server performs word segmentation processing on the Chinese sentence to be processed to obtain the target word segmentation; according to the target word segmentation, the word representation vector set is divided to obtain the representation vector group of each word; through the preset hybrid pooling layer, the representation of each word is divided The vector group is mixed and pooled to obtain the word-level feature representation vector.

The server calls the preset Chinese word segmentation algorithm, performs word segmentation processing on the Chinese sentence to be processed, obtains the initial word segmentation, performs part-of-speech detection and phrase detection on the initial word segmentation, and determines the initial word segmentation that passes the detection as the target word segmentation. The Chinese word segmentation algorithm integrates meta-grammar N-Gram model and bi-directction matching method (BM) model, that is, the output of the N-Gram model can be the input of the BM model, or the output of the BM model can be the input of the N-Gram model, or, The N-Gram model is connected in parallel with the BM model.

The server divides the word representation vector set according to the target word segmentation, and obtains the representation vector group of each word. For example, the Chinese sentence to be processed is "all products are sold at a discount", and the corresponding target word segmentation is "all", "commodity", " All, "discount" and "sale", taking "discount" as an example, the representation vector group of the word "discount" includes the representation vector of "hit" and the representation vector of "discount", and the same is true for other words.

The preset hybrid pooling layer is used to indicate a pooling layer that combines maximum pooling and average pooling. The server calls the preset hybrid pooling layer to perform hybrid pooling on the representation vector group of each word to obtain word-level features. Representation vector, for example: fuse the representation vector of "Dai" and the representation vector of "Zhe" in the representation vector group of the word to obtain the word-level feature representation vector of "Discount". Among them, the server can perform the maximum pooling process on the representation vector group of each word through the maximum pooling convolution kernel or the maximum pooling layer in the mixed pooling layer to obtain the first word representation vector group. The average pooling convolution kernel or average pooling layer in the average pooling process is performed on the first word representation vector group to obtain the word-level feature representation vector; or, the server uses the maximum pooling convolution kernel in the mixed pooling layer. Or the maximum pooling layer, which performs maximum pooling on the representation vector group of each word to obtain the first word representation vector group. Through the average pooling convolution kernel or average pooling layer in the mixed pooling layer, each Perform average pooling on the word representation vector group to obtain the second word representation vector group, and fuse the first word representation vector group and the second word representation vector group to obtain the word-level feature representation vector; or, the server pre-creates the fusion maximum The mixed pooling layer of the pooling convolution kernel and the average pooling convolution kernel performs pooling and convolution processing on the representation vector group of each word to obtain the word-level feature representation vector. The number of word-level feature representation vectors includes one or More than one, a target word corresponds to a word-level feature representation vector.

203. Through a preset feedforward attention mechanism, perform attention calculation on the polyphonic word representation vector and the word-level feature representation vector to obtain an attention vector.

The server calculates the attention value between the polyphonic word representation vector and the word-level feature representation vector through the preset feed-forward attention mechanism feed-forward attention. Weighted summation to obtain the attention vector; or, the server calculates the attention value of the polyphonic word representation vector relative to the word-level feature representation vector through the preset feed-forward attention mechanism feed-forward attention, and expresses the attention value with the polyphonic word representation. The vectors are multiplied to obtain the polyphonic word representation vector matrix, and the polyphonic word representation vector matrix and the word-level feature representation vector are subjected to matrix addition or matrix multiplication to obtain the attention vector.

204. Splicing the attention vector and the polyphonic word representation vector to obtain a target vector.

After the server obtains the attention vector, it performs matrix multiplication or matrix addition of the attention vector and the polyphonic word representation vector to obtain the target vector; or the server performs the weighted summation of the attention vector and the polyphonic word representation vector to obtain the target vector. The target vector is obtained through the preset feed-forward attention mechanism, indicating which word information in the Chinese sentence to be processed is more important for the target polysyllabic word and requires greater weight, thereby improving the accuracy of the contextual semantic fusion of the target polyphonic word .

205. Calculate the target pinyin probability of the target vector through the preset linear layer, and determine the target pronunciation of the target polyphonic word according to the target pinyin probability.

Specifically, the server calculates the probability of the target vector based on each pinyin through a preset linear layer, and obtains a set of pinyin probability values for polysyllabic words; The values are sorted, and the pinyin probability value of the first sorted polysyllabic word is determined as the target pinyin probability; the pinyin corresponding to the target pinyin probability is determined as the target pronunciation of the target polysyllabic word.

For example, the number of linear layers is one, and the server inputs the target vector into the preset linear layer, and calculates the probability of the target vector based on each pinyin through the linear layer, and obtains a set of probability values for the pinyin of the multi-syllable word, which is the pinyin probability value of the poly-syllable word. 1 and the probability value of the pinyin of the multi-syllable word 2, according to the order of the values from large to small, sort the probability value of the pinyin of the multi-syllable word 1 and the probability value of the pinyin of the multi-syllable word 2, and obtain the sequence "the probability value of the pinyin of the multi-syllable word 2 - the probability value of the pinyin of the multi-syllable word. 1", the pinyin probability value of the polyphonic word 2 ranks first, then the first ranking is the target pinyin probability, and the pinyin corresponding to the target pinyin probability is determined as the target pronunciation of the target polyphonic word.

Specifically, the server calculates the target pinyin probability of the target vector through the preset linear layer, and after determining the target pronunciation of the target polyphonic word according to the target pinyin probability, obtains the error value of the target pronunciation based on the marked pronunciation, and determines the target pronunciation according to the error value. The acquisition strategy is optimized, and the acquisition strategy includes the execution process, algorithm and network structure of acquiring the target pronunciation.

The server obtains the marked pronunciation of the target polyphonic word, and the marked pronunciation is the pronunciation of the target polyphonic word based on the pronunciation of the sentence corresponding to the semantics and emotions of the Chinese sentence to be processed. The labeling model is obtained by labeling, calculating the pronunciation similarity between the target pronunciation of the target polyphonic word and the labeled pronunciation, calculating the difference between the pronunciation similarity and 1, and obtaining the error value of the target pronunciation based on the labeled pronunciation, which is obtained through the error value pair. The execution process of the target pronunciation is adjusted, and the network structure used to obtain the target pronunciation is optimized by the error value. Extraction, word segmentation and pinyin probability calculation of the linear layer, etc., through the error value to add or delete the algorithm used to obtain the target pronunciation or adjust the execution order, and improve the prediction by optimizing the acquisition strategy of the target pronunciation according to the error value. The accuracy of the pronunciation of polyphonic words.

In the embodiment of the present application, the word representation vector set is converted into a word-level feature representation vector according to the target word segmentation, and the word features are converted into word-level features to avoid the problem of unregistered words, thereby effectively improving the prediction of the pronunciation of polyphonic words. The accuracy rate of the target vector is calculated by splicing the polyphonic word representation vector and the word-level feature representation vector based on the attention mechanism, and through the preset linear layer, the target pinyin probability of the target vector is calculated, and the target polyphonic word is determined according to the target pinyin probability. The target pronunciation of the target is combined with the word segmentation and attention mechanism to predict the pronunciation of the target polyphonic word without any rules and artificial feature design, which reduces the impact of the labeling error in the word segmentation, and can accurately capture the Chinese to be processed. The textual semantic information of the sentence improves the accuracy of predicting the pronunciation of polyphonic words.

The method for predicting the pronunciation of a polyphonic word in the embodiment of the present application has been described above. The following describes the device for predicting the pronunciation of a polyphonic word in the embodiment of the present application. Please refer to FIG. 3 , an embodiment of the device for predicting the pronunciation of a polyphonic word in the embodiment of the present application. include:

The obtaining module 301 is used to obtain the marked Chinese sentences to be processed, and to obtain a word representation vector set and a polyphonic word representation vector of the to-be-processed Chinese sentences, and the to-be-processed Chinese sentences include target polyphonic words;

The conversion module 302 is used to perform word segmentation processing on the Chinese sentence to be processed to obtain a target word segmentation, and convert the word representation vector set into a word-level feature representation vector according to the target word segmentation;

The splicing module 303 is used to perform splicing processing based on the attention mechanism on the polyphonic word representation vector and the word-level feature representation vector to obtain a target vector;

The determining module 304 is configured to calculate the target pinyin probability of the target vector through a preset linear layer, and determine the target pronunciation of the target polyphonic word according to the target pinyin probability.

The function implementation of each module in the above-mentioned apparatus for predicting the pronunciation of a polyphonic word corresponds to each step in the above-mentioned embodiment of the method for predicting the pronunciation of a polyphonic word, and the functions and implementation process thereof will not be repeated here.

Referring to FIG. 4, another embodiment of the device for predicting the pronunciation of polyphonic words in the embodiment of the present application includes:

Wherein, the splicing module 303 specifically includes:

The calculation unit 3031 is used to perform attention calculation on the polyphonic word representation vector and the word-level feature representation vector through the preset feedforward attention mechanism, and obtain the attention vector;

The splicing unit 3032 is used for splicing the attention vector and the polyphonic word representation vector to obtain the target vector;

Optionally, the conversion module 302 can also be specifically used for:

Perform word segmentation processing on the Chinese sentence to be processed to obtain the target word segmentation;

Divide the word representation vector set according to the target word segmentation to obtain the representation vector group of each word;

Through the preset mixed pooling layer, the representation vector group of each word is mixed and pooled to obtain the word-level feature representation vector.

Optionally, the determining module 304 can also be specifically used for:

Through the preset linear layer, calculate the probability of the target vector based on each pinyin, and obtain the probability value set of multi-syllable word pinyin;

According to the order of value from big to small, sort the polyphonic word pinyin probability values in the polyphonic word pinyin probability value set, and determine the first polyphonic word pinyin probability value to be the target pinyin probability;

The pinyin corresponding to the target pinyin probability is determined as the target pronunciation of the target polyphonic word.

Optionally, the obtaining module 301 includes:

Obtaining unit 3011 is used to obtain the initial Chinese sentence, the target polyphone in the initial Chinese sentence, and the polyphone position information corresponding to the target polyphone;

The labeling unit 3012 is used to label the target polyphonic word in the initial Chinese sentence according to the position information of the polyphonic word to obtain the Chinese sentence to be processed;

The encoding and extracting unit 3013 is configured to sequentially perform word vector encoding and polyphonic word vector extraction on the Chinese sentence to be processed to obtain a word representation vector set and a polyphonic word representation vector.

Optionally, the code extraction unit 3013 can also be specifically used for:

Through the preset deep neural network encoder, encode each word in the Chinese sentence to be processed, and obtain a word representation vector set, one word representation vector corresponds to one word;

According to the position information of the polyphonic word, the representation vector corresponding to the target polyphonic word is extracted from the word representation vector set to obtain the representation vector of the polyphonic word.

Optionally, the device for predicting the pronunciation of polyphonic words further includes:

The optimization module 305 is used to obtain the error value of the target pronunciation based on the marked pronunciation, and optimize the acquisition strategy of the target pronunciation according to the error value, and the acquisition strategy includes the execution process, algorithm and network structure of obtaining the target pronunciation.

The function implementation of each module and each unit in the above-mentioned polyphonic word pronunciation prediction apparatus corresponds to each step in the above-mentioned polyphonic word pronunciation prediction method embodiment, and the functions and implementation process thereof will not be repeated here.

Figures 3 and 4 above describe in detail the device for predicting the pronunciation of polyphones in the embodiment of the present application from the perspective of modular functional entities. The following describes the device for predicting the pronunciation of polyphones in the embodiment of the present application in detail from the perspective of hardware processing.

5 is a schematic structural diagram of a device for predicting the pronunciation of a polyphonic word provided by an embodiment of the present application. The device 500 for predicting the pronunciation of a polyphonic word may vary greatly due to different configurations or performance, and may include one or more processors (central processing units, CPU) 510 (eg, one or more processors) and memory 520, one or more storage media 530 (eg, one or more mass storage devices) that store application programs 533 or data 532. Among them, the memory 520 and the storage medium 530 may be short-term storage or persistent storage. The program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the apparatus 500 for predicting the pronunciation of polyphonic words. Furthermore, the processor 510 may be configured to communicate with the storage medium 530 to execute a series of instruction operations in the storage medium 530 on the device 500 for predicting the pronunciation of polyphones.

The device 500 for predicting the pronunciation of polyphones may also include one or more power sources 540, one or more wired or wireless network interfaces 550, one or more input and output interfaces 560, and/or, one or more operating systems 531, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and more. Those skilled in the art can understand that the structure of the prediction device for the pronunciation of polyphonic words shown in FIG. 5 does not constitute a limitation on the prediction device for the pronunciation of polyphonic words, and may include more or less components than those shown in the figure, or combine some components , or a different component arrangement.

The present application also provides a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium. The computer-readable storage medium may also be a volatile computer-readable storage medium. Instructions are stored in the read storage medium, and when the instructions are executed on the computer, make the computer execute the steps of the method for predicting the pronunciation of polyphonic words.

Further, the computer-readable storage medium may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required by at least one function, and the like; Use the created data, etc.

The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.

The integrated unit, if implemented as a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods in the various embodiments of the present application. The aforementioned storage medium includes: U disk, removable hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the above-mentioned embodiments, those of ordinary skill in the art should understand that: it can still be used for the above-mentioned implementations The technical solutions described in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present application.

Claims

A method for predicting the pronunciation of a polyphone, wherein the method for predicting the pronunciation of the polyphone comprises:

Obtain the marked Chinese sentences to be processed, and obtain the word representation vector set and the polyphonic word representation vector of the to-be-processed Chinese sentences, and the to-be-processed Chinese sentences include target polyphonic words;

Perform word segmentation processing on the to-be-processed Chinese sentence to obtain a target word segmentation, and convert the word representation vector set into a word-level feature representation vector according to the target word segmentation;

performing splicing processing based on an attention mechanism on the polyphonic word representation vector and the word-level feature representation vector to obtain a target vector;

Through the preset linear layer, the target pinyin probability of the target vector is calculated, and the target pronunciation of the target polyphonic word is determined according to the target pinyin probability.
The method for predicting the pronunciation of a polyphonic word according to claim 1, wherein the word segmentation process is performed on the to-be-processed Chinese sentence to obtain a target word segmentation, and the word representation vector set is converted into a word-level feature according to the target word segmentation Represents a vector, including:

Perform word segmentation processing on the to-be-processed Chinese sentences to obtain target word segmentation;

Divide the word representation vector set according to the target word segmentation to obtain a representation vector group of each word;

Through a preset mixed pooling layer, the representation vector group of each word is mixed and pooled to obtain a word-level feature representation vector.
The method for predicting the pronunciation of a polyphonic word according to claim 1, wherein the splicing process based on the attention mechanism is performed on the polyphonic word representation vector and the word-level feature representation vector to obtain a target vector, including:

Through the preset feedforward attention mechanism, the attention calculation is performed on the polyphonic word representation vector and the word-level feature representation vector to obtain the attention vector;

The attention vector is spliced with the polyphonic word representation vector to obtain a target vector.
The method for predicting the pronunciation of a polyphonic word according to claim 1, wherein the target phonetic probability of the target vector is calculated through a preset linear layer, and the target of the target polyphonic word is determined according to the target phonetic probability pronunciation, including:

Through the preset linear layer, calculate the probability of the target vector based on each pinyin, and obtain the probability value set of the polyphonic word pinyin;

According to the order of values from large to small, sort the polyphonetic word pinyin probability values in the polyphonetic word pinyin probability value set, and determine the first polyphonetic word pinyin probability value to be the target pinyin probability;

The pinyin corresponding to the target pinyin probability is determined as the target pronunciation of the target polyphonic word.
The method for predicting the pronunciation of polyphonic words according to claim 1, wherein the obtaining marked Chinese sentences to be processed, and the character representation vector set and the polyphonic word representation vectors of the Chinese sentences to be processed are obtained, and the to-be-processed Chinese sentences are obtained. Chinese sentences include target polyphonic characters, including:

Obtain the initial Chinese sentence, the target polyphone in the initial Chinese sentence, and the polyphone position information corresponding to the target polyphone;

According to the position information of the polyphonic word, the target polyphonic word in the initial Chinese sentence is marked to obtain the Chinese sentence to be processed;

The word vector encoding and polyphonic word vector extraction are sequentially performed on the to-be-processed Chinese sentences to obtain a word representation vector set and a polyphonic word representation vector.
The method for predicting the pronunciation of polyphonic words according to claim 5, wherein, the described Chinese sentences to be processed are sequentially subjected to word vector encoding and polyphonic word vector extraction to obtain a word representation vector set and a polyphonic word representation vector, including:

Through a preset deep neural network encoder, each word in the Chinese sentence to be processed is encoded to obtain a word representation vector set, and a word representation vector corresponds to a word;

According to the position information of the polyphonic word, a representation vector corresponding to the target polyphonic word is extracted from the set of word representation vectors to obtain a representation vector of the polyphonic word.
The method for predicting the pronunciation of a polyphonic word according to any one of claims 1-6, wherein the target pinyin probability of the target vector is calculated through a preset linear layer, and the target pinyin probability is determined according to the target pinyin probability. After describing the target pronunciation of the target polyphonic word, it also includes:

The acquisition of the target pronunciation is based on the error value of the marked pronunciation, and the acquisition strategy of the target pronunciation is optimized according to the error value, and the acquisition strategy includes an execution process, an algorithm and a network structure for acquiring the target pronunciation.
A prediction device for the pronunciation of a polyphonic word, wherein the prediction device for the pronunciation of the polyphonic word comprises:

an acquisition module, configured to acquire the marked Chinese sentences to be processed, and to acquire a word representation vector set and a polyphonic word representation vector of the to-be-processed Chinese sentences, where the to-be-processed Chinese sentences include target polyphonic words;

a conversion module, configured to perform word segmentation on the to-be-processed Chinese sentence to obtain a target word segmentation, and convert the word representation vector set into a word-level feature representation vector according to the target word segmentation;

a splicing module, configured to perform splicing processing based on an attention mechanism on the polyphonic word representation vector and the word-level feature representation vector to obtain a target vector;

A determination module, configured to calculate the target pinyin probability of the target vector through a preset linear layer, and determine the target pronunciation of the target polyphonic word according to the target pinyin probability.
A device for predicting the pronunciation of a polyphonic word, wherein the device for predicting the pronunciation of a polyphonic word comprises: a memory and at least one processor, wherein an instruction is stored in the memory;

The at least one processor invokes the instructions in the memory, so that the prediction device for the pronunciation of the polyphonic word executes the prediction method for the pronunciation of the polyphonic word as described below:

Perform word segmentation processing on the to-be-processed Chinese sentence to obtain a target word segmentation, and convert the word representation vector set into a word-level feature representation vector according to the target word segmentation;

performing splicing processing based on an attention mechanism on the polyphonic word representation vector and the word-level feature representation vector to obtain a target vector;

Through the preset linear layer, the target pinyin probability of the target vector is calculated, and the target pronunciation of the target polyphonic word is determined according to the target pinyin probability.
The device for predicting the pronunciation of a polyphonic word according to claim 9, wherein the device for predicting the pronunciation of a polyphonic word is executed by the processor to perform the word segmentation process on the to-be-processed Chinese sentence to obtain a target word segmentation, and according to the When the target word segmentation converts the word representation vector set into word-level feature representation vector, it includes:

Perform word segmentation processing on the to-be-processed Chinese sentence to obtain target word segmentation;

Divide the word representation vector set according to the target word segmentation to obtain a representation vector group of each word;

Through a preset mixed pooling layer, the representation vector group of each word is mixed and pooled to obtain a word-level feature representation vector.
The device for predicting the pronunciation of a polyphonic word according to claim 9, wherein the device for predicting the pronunciation of a polyphonic word is executed by the processor by performing the attention-based performing of the polyphonic word representation vector and the word-level feature representation vector. For the splicing process of the force mechanism, the steps to obtain the target vector include:

Through the preset feedforward attention mechanism, the attention calculation is performed on the polyphonic word representation vector and the word-level feature representation vector to obtain the attention vector;

The attention vector is spliced with the polyphonic word representation vector to obtain a target vector.
The device for predicting the pronunciation of a polyphonic word according to claim 9, wherein the device for predicting the pronunciation of a polyphonic word is executed by the processor to calculate the target pinyin probability of the target vector through a preset linear layer, and When the step of determining the target pronunciation of the target polyphonic word according to the target pinyin probability includes:

Through the preset linear layer, calculate the probability of the target vector based on each pinyin, and obtain the probability value set of the polyphonic word pinyin;

According to the order of values from large to small, sort the polyphonetic word pinyin probability values in the polyphonetic word pinyin probability value set, and determine the first polyphonetic word pinyin probability value to be the target pinyin probability;

The pinyin corresponding to the target pinyin probability is determined as the target pronunciation of the target polyphonic word.
The device for predicting the pronunciation of polyphones according to claim 9, wherein the device for predicting the pronunciation of polyphones is executed by the processor to obtain the marked Chinese sentences to be processed, and obtain the data of the Chinese sentences to be processed. The word representation vector set and the polyphonic word representation vector, when the Chinese sentence to be processed includes the steps of the target polyphonic word, including:

Obtain the initial Chinese sentence, the target polyphone in the initial Chinese sentence, and the polyphone position information corresponding to the target polyphone;

According to the position information of the polyphonic word, the target polyphonic word in the initial Chinese sentence is marked to obtain the Chinese sentence to be processed;

The word vector encoding and polyphonic word vector extraction are sequentially performed on the to-be-processed Chinese sentences to obtain a word representation vector set and a polyphonic word representation vector.
The device for predicting the pronunciation of a polyphonic word according to claim 13, wherein the device for predicting the pronunciation of a polyphonic word is executed by the processor by performing the sequence of performing word vector encoding and polyphonic word vector extraction on the to-be-processed Chinese sentence, The steps for obtaining the word representation vector set and polyphonic word representation vector include:

Through a preset deep neural network encoder, each word in the Chinese sentence to be processed is encoded to obtain a word representation vector set, and a word representation vector corresponds to a word;

According to the position information of the polyphonic word, a representation vector corresponding to the target polyphonic word is extracted from the set of word representation vectors to obtain a representation vector of the polyphonic word.
The device for predicting the pronunciation of a polyphonic word according to any one of claims 9-14, wherein, in the device for predicting the pronunciation of a polyphonic word, the processor executes the preset linear layer to calculate the target The target phonetic probability of the vector, and after the step of determining the target pronunciation of the target polyphonic word according to the target phonetic probability, it also includes:

The acquisition of the target pronunciation is based on the error value of the marked pronunciation, and the acquisition strategy of the target pronunciation is optimized according to the error value, and the acquisition strategy includes an execution process, an algorithm and a network structure for acquiring the target pronunciation.
A computer-readable storage medium having instructions stored on the computer-readable storage medium, wherein, when the instructions are executed by a processor, the following method for predicting the pronunciation of polyphonic words is implemented:

Perform word segmentation processing on the to-be-processed Chinese sentence to obtain a target word segmentation, and convert the word representation vector set into a word-level feature representation vector according to the target word segmentation;

performing splicing processing based on an attention mechanism on the polyphonic word representation vector and the word-level feature representation vector to obtain a target vector;

Through the preset linear layer, the target pinyin probability of the target vector is calculated, and the target pronunciation of the target polyphonic word is determined according to the target pinyin probability.
The computer-readable storage medium according to claim 16, wherein the prediction instruction of the pronunciation of the polyphonic word is executed by the processor, and the word segmentation process is performed on the to-be-processed Chinese sentence to obtain a target word segmentation, and according to the target word segmentation The step of converting the word representation vector set into word-level feature representation vector during word segmentation includes:

Perform word segmentation processing on the to-be-processed Chinese sentence to obtain target word segmentation;

Divide the word representation vector set according to the target word segmentation to obtain a representation vector group of each word;

Through a preset mixed pooling layer, the representation vector group of each word is mixed and pooled to obtain a word-level feature representation vector.
17. The computer-readable storage medium of claim 16, wherein the instructions for predicting the pronunciation of the polyphone are executed by the processor and the attention-based performing the attention-based performing of the polyphone representation vector and the word-level feature representation vector is executed by the processor. In the splicing process of the mechanism, the steps to obtain the target vector include:

Through the preset feedforward attention mechanism, the attention calculation is performed on the polyphonic word representation vector and the word-level feature representation vector to obtain the attention vector;

The attention vector is spliced with the polyphonic word representation vector to obtain a target vector.
The computer-readable storage medium according to claim 16, wherein the prediction instruction of the pronunciation of the polyphonic word is executed by the processor through the preset linear layer, the target pinyin probability of the target vector is calculated, and the target pinyin probability is calculated according to the When the target pinyin probability determines the target pronunciation of the target polyphonic word, it includes:

Through the preset linear layer, calculate the probability of the target vector based on each pinyin, and obtain the probability value set of the polyphonic word pinyin;

According to the order of values from large to small, sort the polyphonetic word pinyin probability values in the polyphonetic word pinyin probability value set, and determine the first polyphonetic word pinyin probability value to be the target pinyin probability;

The pinyin corresponding to the target pinyin probability is determined as the target pronunciation of the target polyphonic word.
The computer-readable storage medium according to claim 16, wherein the prediction instruction of the pronunciation of the polyphonic word is executed by the processor, the obtaining the marked Chinese sentence to be processed, and the character of the Chinese sentence to be processed is obtained. Representation vector set and polyphonic word representation vector, when the Chinese sentence to be processed includes the steps of the target polyphonic word, including:

Obtain the initial Chinese sentence, the target polyphone in the initial Chinese sentence, and the polyphone position information corresponding to the target polyphone;

According to the position information of the polyphonic word, the target polyphonic word in the initial Chinese sentence is marked to obtain the Chinese sentence to be processed;

The word vector encoding and polyphonic word vector extraction are sequentially performed on the to-be-processed Chinese sentences to obtain a word representation vector set and a polyphonic word representation vector.