CN116959407A

CN116959407A - Pronunciation prediction method and device and related products

Info

Publication number: CN116959407A
Application number: CN202310279558.7A
Authority: CN
Inventors: 田彦培; 胡海峰; 孙钟前
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-03-20
Filing date: 2023-03-20
Publication date: 2023-10-27

Abstract

The embodiment of the application discloses a pronunciation prediction method, a pronunciation prediction device and related products, which are used for converting characters in a target text into vector representations; extracting a vector representation of the target polyphones and a vector representation of associated characters of the target polyphones from the vector representations of the plurality of characters obtained through conversion; calling a pronunciation prediction model corresponding to the target polyphones in the pronunciation prediction models; based on the vector representation of the target polyphones and the vector representation of the associated characters of the target polyphones, pronunciation of the target polyphones in the target text is predicted by using a pronunciation prediction model corresponding to the target polyphones. Therefore, the pronunciation prediction model corresponding to the target polyphone performs pronunciation prediction on the target polyphone, the pronunciation prediction capability of the pronunciation prediction model can be independently improved, the pronunciation prediction capability of other polyphones cannot be influenced, and therefore pronunciation predictions of different polyphones are decoupled, and the accuracy of pronunciation prediction of different polyphones in a text is improved.

Description

Pronunciation prediction method and device and related products

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a pronunciation prediction method, a pronunciation prediction device and related products.

Background

With the continuous development of internet technology, the speech synthesis technology for generating human language by a machine is widely applied to various application scenes such as vehicle navigation, a voice chat robot, a voice book and the like. In generating corresponding speech from text, it is often necessary to specify the exact pronunciation of each character in the text, but it is often the case that polyphones are included in the text. The polyphones have a plurality of different pronunciations, different pronunciation forms and meanings, different usages and sometimes different parts of speech. In order to make the converted speech meet the use requirement, pronunciation of the polyphones in the text is generally predicted by a pronunciation prediction model.

In the related art, a text containing a polyphone is generally input into a pronunciation prediction model, so as to obtain a pronunciation prediction result of the polyphone in the current text. The pronunciation prediction model assumes the task of pronunciation prediction for a plurality of polyphones, but has the following problems: because the model is associated with the pronunciation prediction capabilities of multiple polyphones, when the pronunciation prediction capabilities of the model in individual polyphones need to be improved, the influence of the improvement on the pronunciation prediction capabilities of other polyphones cannot be avoided. That is, while improving the pronunciation prediction capability of individual multi-tone words, the pronunciation prediction capability of other multi-tone words is likely to be affected, so that the pronunciation prediction model cannot effectively predict the pronunciation of other multi-tone words, and thus the accuracy of pronunciation prediction of multi-tone words is affected.

Disclosure of Invention

The embodiment of the application provides a pronunciation prediction method, a pronunciation prediction device and related products, which aim to effectively predict the pronunciation of different polyphones in a text so as to improve the pronunciation prediction accuracy of the polyphones.

In view of this, a first aspect of the present application provides a pronunciation prediction method, including:

converting characters in the target text into vector representations; the target text comprises polyphones;

extracting a vector representation of a target polyphone and a vector representation of an associated character of the target polyphone from the vector representations of the plurality of characters obtained through conversion; the target polyphones are one of the polyphones in the target text;

calling a pronunciation prediction model corresponding to the target polyphones in the pronunciation prediction models; the pronunciation prediction models are respectively used for predicting the pronunciation of the corresponding polyphones;

based on the vector representation of the target polyphones and the vector representation of the associated characters of the target polyphones, pronunciation of the target polyphones in the target text is predicted by using a pronunciation prediction model corresponding to the target polyphones.

The second aspect of the present application provides a pronunciation prediction apparatus, comprising:

a character conversion unit for converting characters in the target text into a vector representation; the target text comprises polyphones;

A vector extraction unit, configured to extract a vector representation of a target polyphone and a vector representation of an associated character of the target polyphone from vector representations of the plurality of characters obtained by conversion; the target polyphones are one of the polyphones in the target text;

the model calling unit is used for calling a pronunciation prediction model corresponding to the target polyphones in the pronunciation prediction models; the pronunciation prediction models are respectively used for predicting the pronunciation of the corresponding polyphones;

and the pronunciation prediction unit is used for predicting the pronunciation of the target polyphone in the target text by using a pronunciation prediction model corresponding to the target polyphone based on the vector representation of the target polyphone and the vector representation of the associated character of the target polyphone.

A third aspect of the present application provides a pronunciation prediction device comprising a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute the steps of the pronunciation prediction method provided in the first aspect according to instructions in the program code.

A fourth aspect of the present application provides a computer readable storage medium storing program code for performing the steps of the pronunciation prediction method provided in the first aspect.

A fifth aspect of the application provides a computer program product comprising a computer program or instructions which, when executed by a pronunciation prediction device, implement the steps of the pronunciation prediction method provided in the first aspect.

From the above technical solutions, the embodiment of the present application has the following advantages:

according to the pronunciation prediction method provided by the embodiment of the application, based on the vector representation of the target text comprising the polyphones, the vector representation of the target polyphones and the vector representation of the associated characters of the target polyphones are extracted, the associated information of the context of the target polyphones in the target text is reserved, rich and reliable input data are provided for the use of a subsequent model, and the pronunciation prediction effect of the model on the target polyphones is improved. And calling a pronunciation prediction model corresponding to the target polyphone, and predicting the pronunciation of the target polyphone in the target text based on the vector representation of the target polyphone and the vector representation of the associated character of the target polyphone. Because different pronunciation prediction models are respectively used for predicting the pronunciation of the corresponding polyphones, the pronunciation prediction model corresponding to the target polyphones predicts the pronunciation of the target polyphones, and when the pronunciation prediction capacity of the pronunciation prediction model is improved, the pronunciation prediction capacity of the pronunciation prediction models corresponding to other polyphones is not affected. Therefore, the pronunciation prediction of different polyphones is decoupled, and the condition that the pronunciation prediction capacities of the unified pronunciation prediction model are mutually influenced for the different polyphones is avoided. According to the technical scheme, the pronunciation prediction model corresponding to the polyphones is adopted to predict the pronunciation of the polyphones in the text, so that training and improvement of the pronunciation prediction capability of different polyphones can be independently carried out, the pronunciation of different polyphones in the text can be effectively predicted, and the pronunciation prediction accuracy of the polyphones in the text is improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an application scenario of a pronunciation prediction method according to an embodiment of the present application;

FIG. 2 is a flowchart of a pronunciation prediction method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of converting target text into vector representation by BERT model according to the embodiment of the present application;

FIG. 4 is a schematic diagram of a pronunciation prediction training sentence of a target polyphone according to an embodiment of the present application;

FIG. 5 is a flowchart of another pronunciation prediction method according to an embodiment of the present application;

FIG. 6a is a schematic diagram of determining associated characters of a target polyphone in a target text according to an embodiment of the present application;

FIG. 6b is a schematic diagram of predicting pronunciation of multiple polyphones in a target text according to an embodiment of the present application;

FIG. 7 is a schematic diagram of processing input data of a pronunciation prediction model into input data matching with an input size of a convolution module according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a pronunciation prediction model according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a pronunciation prediction device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

At present, if the pronunciation of the polyphone in the text needs to be predicted, the text comprising the polyphone needs to be input into a pronunciation prediction model to obtain a pronunciation prediction result of the polyphone in the text. However, it is found through research that, for a text including different polyphones, the unified pronunciation prediction model is used to predict the pronunciation of the different polyphones, and then the pronunciation prediction model needs to be responsible for pronunciation prediction of a large number of polyphones. And too many polyphones for which the pronunciation prediction model is responsible may cause the prediction capability of different polyphones to have a coupling relationship, and when the pronunciation prediction capability of individual polyphones is improved for the unified pronunciation prediction model, the pronunciation prediction capability of the pronunciation prediction model on other polyphones is easily affected. That is, the pronunciation prediction model may have a relationship between the pronunciation prediction capabilities of different polyphones, thereby affecting the accuracy of pronunciation prediction of the polyphones.

For example, the text a includes a polyphone a, the text B includes a polyphone B, and since the polyphone a is widely used in daily life, the pronunciation prediction capability of the pronunciation prediction model for the polyphone a needs to be particularly improved, specifically, model parameters of the pronunciation prediction model are trained through text samples including a large number of polyphones a, and the pronunciation prediction accuracy of the trained pronunciation prediction model for the polyphone a can be effectively improved; the pronunciation prediction model with the pronunciation prediction capability of the polyphone a is improved, model parameters of the pronunciation prediction model are more suitable for accurately predicting the pronunciation of the polyphone a through training, the changed model parameters are not necessarily more suitable for the polyphone b, the pronunciation prediction capability of the pronunciation prediction model on the polyphone b is easily reduced, the pronunciation of the polyphone b cannot be accurately predicted, and therefore the pronunciation prediction accuracy of the pronunciation prediction model on the polyphone is affected.

In view of the above problems, embodiments of the present application provide a pronunciation prediction method, apparatus, and related products, in which characters in a target text are converted into vector representations; the target text comprises polyphones; extracting a vector representation of a target polyphone and a vector representation of an associated character of the target polyphone from the vector representations of the plurality of characters obtained through conversion; the target polyphones are one of the polyphones in the target text; calling a pronunciation prediction model corresponding to the target polyphones in the pronunciation prediction models; the pronunciation prediction models are respectively used for predicting the pronunciation of the corresponding polyphones; based on the vector representation of the target polyphones and the vector representation of the associated characters of the target polyphones, pronunciation of the target polyphones in the target text is predicted by using a pronunciation prediction model corresponding to the target polyphones. The method extracts the vector representation of the target polyphones and the associated characters thereof from the vector representation of the target text, and retains the context associated information of the target polyphones in the target text; calling a pronunciation prediction model corresponding to the target polyphone, wherein the pronunciation prediction model is trained for the target polyphone, so that pronunciation predictions of different polyphones are decoupled, and the situation that the pronunciation predictions of different polyphones are influenced mutually is avoided; based on the vector representation of the target polyphones and the associated characters, the pronunciation of the target polyphones in the target text can be accurately predicted through the called pronunciation prediction model corresponding to the target polyphones, so that the pronunciation prediction accuracy of the polyphones is improved.

The pronunciation prediction method, the pronunciation prediction device and the related products provided by the application can be applied to a server or terminal equipment with data processing capability. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing a cloud computing service, but is not limited thereto. Terminal devices include, but are not limited to, cell phones, tablets, computers, smart cameras, smart voice interaction devices, smart appliances, vehicle terminals, aircraft, and the like. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.

In order to facilitate understanding of the technical scheme of the application, the pronunciation prediction method provided by the embodiment of the application is described below in connection with practical application scenes.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a pronunciation prediction method according to an embodiment of the present application. The terminal device 100 is included in the application scenario shown in fig. 1.

The terminal device 100 converts characters in the target text into a vector representation; the target text includes polyphones. As an example, the target text is "text a", including "character 1", "character 2", "polyphone a", and "character n", etc., and the "polyphone a" is a polyphone in "text a", the terminal device 100 may convert the character in "text a" into a corresponding vector representation "vector representation 1", including the vector representation "v character 1", "v character 2", "v polyphone a", and "v character n", etc., corresponding to the character in "text a". It should be noted that, in the embodiment of the present application, the "text a" may also include other polyphones besides the "polyphones a", for example, the "polyphones b" and the "polyphones c", which are not shown in fig. 1.

The terminal device 100 extracts a vector representation of a target polyphone and a vector representation of an associated character of the target polyphone from the converted vector representations of the plurality of characters; the target polyphone is one of the polyphones in the target text. As an example, the target polyphone is "polyphone a", and the associated character of "polyphone a" is "associated character a". Specifically, the associated character "associated character a" of "polyphone a" may be several adjacent characters of "polyphone a" preceding and following in the text. Based on the above example, the terminal device 100 may extract the vector representation of the "polyphone a" and the vector representation of the "associated character a" from the vector representation "vector representation 1" corresponding to the "text a", and the "vector representation 2" may include a plurality of "v associated characters" and "v polyphones a" in the "vector representation 2", and the present application may select the associated characters of the target polyphones according to the setting without limitation on the number of the associated characters. In addition, the "vector representation 2" is not processed or converted as compared with the "vector representation 1", but the "vector representation 1" is extracted from the vector representation of the partial character.

The terminal device 100 invokes a pronunciation prediction model corresponding to the target polyphones among the plurality of pronunciation prediction models; the pronunciation prediction models are respectively used for predicting the pronunciation of the corresponding polyphones. As an example, the pronunciation prediction model corresponding to the target polyphone is "pronunciation prediction model a", the pronunciation prediction models corresponding to other polyphones are "pronunciation prediction model 1", "pronunciation prediction model 2", and "pronunciation prediction model n", etc., different pronunciation prediction models are used for predicting different polyphones, respectively, and the polyphone to be predicted and the pronunciation prediction model may be in one-to-one correspondence. Based on the above example, the "pronunciation prediction model a" is used to predict the pronunciation of the target polyphone "polyphone a", and the terminal device 100 may call the "pronunciation prediction model a" among the plurality of pronunciation prediction models.

The terminal device 100 predicts the pronunciation of the target polyphone in the target text using a pronunciation prediction model corresponding to the target polyphone based on the vector representation of the target polyphone and the vector representation of the associated character of the target polyphone. As an example, based on the above example, the terminal device 100 may predict the pronunciation of the "polyphone a" in the "text a", that is, "pronunciation a", using the "pronunciation prediction model a" based on the vector representation of the "polyphone a" and the vector representation of the "associated character a", that is, "vector representation 2".

It can be seen that, in the method, the terminal device 100 may extract the vector representation of the "polyphone a" and the vector representation of the associated character "associated character a" of the "polyphone a" based on the "vector representation 1" of the "text a", and retain the context associated information of the "polyphone a" in the "text a"; and calling a pronunciation prediction model 1 corresponding to the multi-sound word a to predict the pronunciation of the multi-sound word a in the text A, wherein different pronunciation prediction models are respectively used for predicting the pronunciation of the corresponding multi-sound word, namely, the pronunciation prediction model 1 predicts the pronunciation of the multi-sound word a, and when the pronunciation prediction capability of the pronunciation prediction model 1 is improved, the pronunciation prediction capability of the pronunciation prediction model corresponding to other multi-sound words is not influenced, so that the pronunciation predictions of different multi-sound words are decoupled, the condition that the unified pronunciation prediction model mutually influences the pronunciation prediction capability of different multi-sound words in the text is avoided, and the pronunciation prediction accuracy of the multi-sound word is improved.

The application provides a pronunciation prediction method, and relates to the field of artificial intelligence. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Basic technologies for artificial intelligence generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, electromechanical integration, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

The pronunciation prediction method provided by the application mainly relates to large directions such as voice processing technology, machine learning/deep learning and the like. Among the key technologies of the speech technology (Speech Technology) are automatic speech recognition technology and speech synthesis technology, and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The pronunciation prediction method provided by the application can be applied to various scenes, including but not limited to vehicle navigation, voice chat robots, virtual anchor, game NPC, audio books and the like.

Next, the pronunciation prediction method provided by the embodiment of the present application is specifically described below with respect to the above terminal device.

Referring to fig. 2, the method is a flowchart of a pronunciation prediction method provided in an embodiment of the present application. As shown in fig. 2, the pronunciation prediction method includes the following steps:

s201: converting characters in the target text into vector representations; the target text includes polyphones.

In the embodiment of the application, the target text refers to a text which comprises polyphones and the polyphones need pronunciation prediction, the target text is used for distinguishing from a text sample, a certain text is not specified, and the target text can be any text comprising polyphones. The characters refer to the contents of Chinese characters, letters, symbols and the like included in the target text. As one example, the target text may be "friendship between them," including the polyphone "between; the target text can also be "how to cope with the difficulty", including three polyphones of "should", "seed", "difficult"; the target text can also be "bottle pour, water pour out", including the multi-tone word "pour" appearing twice, and the two different pronunciations. Polyphones are characters with two or more pronunciations, the expressions and meanings of different pronunciations are different, the usage is different, and the part of speech may also be different. As an example, the multi-tone word "thin" has three pronunciations of "bao2", "bo2" and "bo4", respectively, wherein the numerals in the pronunciations specifically represent the tones of the pronunciations, and 1-4 represent the first to fourth sounds, respectively, and the "thin" reading "bao2" has a meaning of not being thick; "thin" when read as "bo2" is generally used to synthesize words such as "thick and thin; "thin" when read as "bo4" is the proper term "mint".

In practical application, the pronunciation prediction model cannot directly predict the pronunciation of the target text including the polyphones, so before the pronunciation of the polyphones in the target text is predicted by the pronunciation prediction model, all characters in the target text are firstly required to be respectively converted into corresponding vector representations, and then the polyphones in the target text can be predicted by using the pronunciation prediction model based on the corresponding vector representations of the target text. As an example, where the target text is "how to cope with such difficulty", the target text includes 8 characters, and these 8 characters in the target text may be converted into corresponding vector representations, respectively.

S202: extracting a vector representation of a target polyphone and a vector representation of an associated character of the target polyphone from the vector representations of the plurality of characters obtained through conversion; the target polyphone is one of the polyphones in the target text.

In the embodiment of the present application, the associated characters refer to characters within a preset range around the target polyphones, and as an example, if the target text is specifically "how to cope with such difficulty", the associated characters of the polyphones "should" may be "pairs", or "how", "what" and "pairs", or all characters except for the polyphones "should" in the target text. The target polyphones refer to polyphones that need to be subjected to pronunciation prediction, are to be distinguished from polyphones, and are not particularly limited to a certain polyphone, and may be any polyphone, and as an example, if the target text is specifically "how to cope with such difficulty", the target polyphone may refer to any one of three polyphones that should be "typed" or "difficult".

In practical application, when the pronunciation prediction model predicts the pronunciation of the polyphone in the target text, the pronunciation prediction model cannot predict the pronunciation of the target polyphone only, the specific pronunciation of the polyphone depends on the associated character, and different associated characters may cause different pronunciation of the polyphone, so after the vector representation of each character in the target text is obtained, the vector representation of the target polyphone and the vector representation of the associated character of the target polyphone need to be extracted, and based on the pronunciation prediction model, the pronunciation of the target polyphone in the target text can be predicted. As an example, the target polyphones may specifically be "answer", when the associated character of the polyphones "answer" is "pair", the pronunciation of the polyphones "answer" is "ying4", but when the associated character of the polyphones "answer" is "this", the pronunciation of the polyphones "answer" is "ying1".

S203: calling a pronunciation prediction model corresponding to the target polyphones in the pronunciation prediction models; the pronunciation prediction models are respectively used for predicting the pronunciation of the corresponding polyphones.

In the embodiment of the application, different polyphones respectively correspond to different pronunciation prediction models, namely, each pronunciation prediction model only predicts the corresponding polyphones and does not predict the pronunciation of other polyphones. As an example, if the target text is specifically "how to cope with such difficulty", and the target polyphones are "should", a pronunciation prediction model corresponding to "should" is called among a plurality of pronunciation prediction models that predicts only the pronunciation of the polyphones "should" in "how to cope with such difficulty", and does not predict the pronunciation of the other polyphones "species" and "difficulty" in "how to cope with such difficulty".

Therefore, when the pronunciation of the target polyphones in the target text is predicted, only the pronunciation prediction model corresponding to the pronunciation prediction model is called, and even if the target text also comprises other polyphones, the called pronunciation prediction model only predicts the target polyphones, and the pronunciation predictions of different polyphones are decoupled, so that the interaction of the pronunciation prediction capabilities of different polyphones is avoided.

S204: based on the vector representation of the target polyphones and the vector representation of the associated characters of the target polyphones, pronunciation of the target polyphones in the target text is predicted by using a pronunciation prediction model corresponding to the target polyphones.

Based on the vector representation of the target polyphones and the vector representation of the associated characters of the target polyphones extracted in the steps, inputting the vector representation and the vector representation into a called pronunciation prediction model corresponding to the target polyphones, predicting the pronunciation of the target polyphones by the pronunciation prediction model corresponding to the target polyphones, and outputting the pronunciation of the target polyphones in the target text.

In addition, considering that the unified pronunciation prediction model has a coupling relation to the pronunciation prediction capability of a plurality of polyphones, when only improving the pronunciation prediction capability of the unified pronunciation prediction model for the target polyphones, the pronunciation prediction capability of the unified pronunciation prediction model for other polyphones may be affected. Therefore, the application trains the pronunciation prediction model corresponding to the target polyphones. For the training process of the pronunciation prediction model corresponding to the target polyphones, the application provides a possible implementation manner, and the training step of the pronunciation prediction model corresponding to the target polyphones specifically can comprise the following steps 1 to 5:

Step 1: acquiring a plurality of text samples; the text sample includes the target polyphones.

In the embodiment of the present application, the text sample refers to any text including a target polyphone, and the text sample may include only the target polyphone, or may include other polyphones at the same time. In order to train the corresponding pronunciation training model aiming at the target polyphones, the acquired plurality of text samples comprising the target polyphones need to comprise the position information of the target polyphones in the text samples and the correct pronunciation of the target polyphones in the text samples, so that the pronunciation prediction model corresponding to the target polyphones is trained according to the position information of the target polyphones and the correct pronunciation of the target polyphones in the subsequent steps.

Step 2: characters in the text sample are converted to a vector representation.

The specific implementation manner of step 2 may refer to the specific implementation manner of S201, and will not be described herein.

Step 3: constructing training data corresponding to the target polyphones according to vector representations obtained by converting characters in the text samples, the position information of the target polyphones in the text samples and pronunciation labels of the target polyphones; the pronunciation tag is the correct pronunciation of the target polyphones in the text sample.

And constructing a training sentence corresponding to the target polyphone through a vector representation obtained by converting a text sample comprising the target polyphone, the character position of the target polyphone in the text sample and the correct pronunciation of the target polyphone in the text sample.

Step 4: training the model to be trained corresponding to the target polyphones through a plurality of training data corresponding to the target polyphones.

Through the step of constructing the training sentences, a plurality of training sentences corresponding to the target polyphones can be obtained, and the model to be trained is trained by utilizing the plurality of training sentences. The model to be trained includes, but is not limited to, convolutional neural network, cyclic neural network, self-coding neural network, and the like, which is not limited by the present application.

As an example, refer to fig. 4, which is a schematic diagram of a pronunciation prediction training sentence of a target polyphones according to an embodiment of the present application. As shown in fig. 4, the target polyphones are "obtained", the text sample of the training sentence 1 is "seamless in the combination with the simple and stereoscopic modeling, so that the overall side sense is very natural, the pronunciation labels of the target polyphones are" de5 "in the training sentence 1, the pronunciation is light, the position information of the target polyphones are" obtained "in the training sentence 1 is 21, namely the character position of the number of the target polyphones from left to right is" obtained "is 21; the text sample of the training sentence 2 is 'get-to-get audience', the pronunciation label of the target polyphones 'get' in the training sentence 2 is 'dei 3', the position information of the target polyphones 'get' in the training sentence 2 is 5, the second training sentence comprises two 'get', and the second 'get' is marked this time; the text sample of the training sentence 3 is 'Xu Mingnian Ji' and cancer is slightly acquired, more than ten thousand of treatment is used, and the family is now four-wall and liability are tired, so that the training sentence is a lean user who is lean due to disease, the pronunciation label of the target polyphones in the training sentence 3 is 'de 2', and the position information of the target polyphones in the training sentence 3 is '7'.

Step 5: and when the training cut-off condition is met, ending training to obtain the pronunciation prediction model corresponding to the target polyphone.

In the process of training the network to be trained through a plurality of training sentences corresponding to the target polyphones, when the preset training cut-off condition is met, the training of the network to be trained is finished, and the trained network to be trained is the pronunciation prediction model corresponding to the target polyphones.

The training cut-off condition of the model to be trained includes, but is not limited to, that the pronunciation prediction accuracy of the target polyphones in the training sentences is larger than the preset prediction accuracy, the training times of all training data corresponding to the target polyphones, which are preset, are reached, and the value of the loss function in the network to be trained is smaller than the preset value.

In addition, based on the same data set comprising a plurality of polyphones, the application and the open source tool pyinyin all carry out pronunciation prediction on the polyphones in the data set, in 27996 texts comprising the polyphones, the pronunciation prediction accuracy of the polyphones of pyinyin is 95.2%, and the pronunciation prediction accuracy of the polyphones of the application is 99.5%, so that the accuracy of the pronunciation prediction result of the polyphones can be effectively improved.

In the method for training the pronunciation prediction model corresponding to the target polyphone based on the steps 1 to 5, only the training sentences corresponding to the target polyphone are used for training the model to be trained, the training sentences of other polyphones are not needed, the training cost is saved, the trained pronunciation prediction model also has the advantage of light weight, and the prediction efficiency of the pronunciation of the target polyphone can be improved.

As an example, when training the model to be trained corresponding to the target polyphone by the training data corresponding to the target polyphone, the training cut-off condition may specifically be training 50 epochs, that is, training each training data in the plurality of training sentences corresponding to the target polyphone 50 times. The learning rate of the network to be trained can be set to be 5e-4, batch size can be set to be 1024, and the batch size refers to the iteration number of the network to be trained for training multiple training sentences corresponding to the target polyphones in one Epoch. The application is not particularly limited to the Epoch, batch size and learning rate.

In practical applications, when the device is in different scenes such as games, daily life, and movies, the habit of using polyphones may be different. As an example, the polyphone "belongs to a variant word in the polyphones, where the variant word refers to a speech form derived under certain grammatical and semantic conditions with respect to the present sound. For the polyphones "one" which are used in the daily life before the fourth character to become second sound, for example, in "one time" and "certain", the pronunciation of the polyphones "one" is "yi2"; however, in the game scene, the pronunciation of "one" in "one skill" is the present pronunciation "yi1".

In order to solve the problem that the polyphones may have different pronunciations under different scenes, the pronunciation prediction model also needs to consider the influence of different scenes on the pronunciation of the target polyphones. Therefore, a plurality of text samples associated with the target scene can be acquired according to the requirements of the target polyphones in the target scene, and when the target polyphones are annotated with pronunciation labels, the requirements of the target scene are considered, so that the pronunciation of the target polyphones is ensured to conform to the use habit in the target scene. In one possible embodiment of the present application, the step 1 may specifically be: according to the requirement for predicting the pronunciation of the target polyphones in a target scene, acquiring a plurality of text samples associated with the target scene; correspondingly, the step 5 may specifically be: and when the training cut-off condition is met, ending training to obtain a pronunciation prediction model corresponding to the target polyphones facing the target scene. Correspondingly, before the corresponding pronunciation prediction model is called for the target polyphones, the usage scenario of the target text may be determined first, and when the target text is the scene associated with the target scene, the pronunciation prediction model corresponding to the target polyphones trained for the target scene is called, and then the step S203 may specifically be: and when the target text is determined to be the text associated with the target scene, calling a pronunciation prediction model corresponding to the target polyphone facing the target scene. Therefore, when the pronunciation of the target polyphone in the target text is predicted, the use habit of the target text in different scenes is considered, namely, the personalized customization is performed aiming at the scenes with the changed use habit, so that the pronunciation of the target polyphone in the target text can be accurately predicted even in different scenes, and the pronunciation prediction accuracy of the polyphone is further improved.

In addition, in the embodiment of the application, along with the change of factors such as places, occasions and objects, a situation that a large amount of target polyphones need to be identified may be encountered, but the pronunciation prediction accuracy of the pronunciation prediction model corresponding to the target polyphones cannot meet the requirement, and at this time, the performance of the pronunciation prediction model corresponding to the target polyphones needs to be improved. Thus, in one possible embodiment of the present application, the pronunciation prediction method may specifically further include S6: and according to the prediction accuracy improvement requirement of the target polyphones, carrying out parameter adjustment on the pronunciation prediction model corresponding to the target polyphones. Therefore, when the model performance is improved, the parameters of the pronunciation prediction models corresponding to other polyphones are not affected, the pronunciation prediction capacity of the pronunciation prediction models corresponding to other polyphones is prevented from being reduced, the pronunciation prediction capacity of different polyphones is decoupled, and the pronunciation prediction capacity of the target polyphones is improved on the basis that the pronunciation prediction capacity of other polyphones is not reduced.

In addition, in the embodiment of the present application, in order to determine what polyphones are included in the target text and the positions of the polyphones in the target text, before the step S201 is executed to convert the characters in the target text into vector representations, each character in the target text is matched with a polyphone library including a large number of polyphones, the successfully matched characters are determined as polyphones, the unsuccessfully matched characters are determined as monophones with only one pronunciation, and then the positions of the successfully matched characters in the text are recorded. Thus, the present application provides a possible implementation manner, and before S201, the pronunciation prediction method may further include steps 7 to 9:

Step 7: matching the characters in the target text with a polyphone library; the polyphone library includes a plurality of polyphones.

In the embodiment of the application, the polyphone library refers to a character library comprising a plurality of polyphones. As an example, the polyphonic word library may be made up of 177 commonly used polyphonic words, and the present application is not limited to the number of characters in the polyphonic word library. And matching each character in the target text in a polyphone library, namely searching whether the characters in the target text exist in the polyphone library. As an example, if the target text is specifically "how to cope with such difficulty", the eight characters included in the target text are matched one by one in the polyphone library.

Step 8: and determining the character successfully matched with the polyphone library in the target text as a polyphone.

If the character in the target text is successfully matched with the polyphone library, determining that the character is a polyphone; if the character in the target text is not successfully matched with the polyphone library, determining that the character is a single-tone character. As an example, common polyphones such as "answer", "species", "difficult" are included in the polyphone library, and when the target text is specifically "how to cope with such difficulty", it can be determined that three polyphones such as "answer", "species" and "difficult" are included in the target text.

Step 9: recording character positions of polyphones in the target text.

In the subsequent steps of the application, vector representations of the target polyphones and associated characters are required to be correspondingly extracted, so that the positions of the polyphones in the target text are required to be determined in advance. In the process of matching the target text with the polyphone library, the number of characters in the target text which are being matched can be clarified, so that when the successfully matched characters are determined to be polyphones, the character positions of the successfully matched characters can be recorded. As an example, the target text is "how to cope with this difficulty", and after the matching of the third character "should" of the target text with the polyphone library is successful, the character position of the polyphone "should" is recorded as the third character in the target text.

It can be seen that, before executing S201 to convert the characters in the target text into vector representations, the present application determines the polyphones in the target text and the character positions of the polyphones through the above steps 7-9, so that the corresponding pronunciation prediction model can be called according to the polyphones, and the associated characters can also be determined according to the positions of the polyphones in the target text.

In which it is considered that different pronunciations of a polyphone are affected by the context, i.e. in different language environments, the same character may have different pronunciations, thereby producing a polyphone. When speaking, people are in different environments and states, and the pronunciation of the polyphones can be influenced differently, so that the context information of the polyphones needs to be fully considered when predicting the pronunciation of the polyphones. The context information includes a context, such as characters preceding and following the polyphones in the target text, as an example, the polyphones "and" in the text "beijing and shanghai" with the pronunciation "he2" and the text "one-record and" in the pronunciation "he4"; the context information also includes a context, and is composed of objective factors such as a place, an occasion, an object, and a situation when talking, and when only the context of the polyphones is considered and the pronunciation cannot be clarified, the context is considered at the same time, and as an example, in a pair of the text "good reading and poor reading", the first pronunciation of the polyphones is "hao3" and the second pronunciation is "hao4" in consideration of the context.

Thus, for the above process of converting the characters in the target text into vector representations in S201, the present application provides a possible implementation manner, and S201 may specifically be: converting characters in the target text into vector representations through a context information extraction model; the converted vector representation contains semantic information of the character itself and context information of the character in the context of the target text. Therefore, characters in the target text are converted into vector representations containing the context information, the language environment of the polyphones in the target text is fully considered, the vector representations rich in the context information are obtained, the representation capability of the vector representations can be further improved, and more accurate target polyphone pronunciation can be obtained. The context information extraction model is a model which is trained to extract the context information of each character in the target text, and the vector representation corresponding to the target text also comprises the context information of the character in the context of the target text on the basis of containing the semantic information of the target text. The context information extraction model includes, but is not limited to, a BERT model, a two-way long and short term memory network, etc., as the application is not limited in this regard.

As an example, reference is made to fig. 3, which is a schematic diagram of converting target text into a vector representation by a BERT model according to an embodiment of the present application. As shown in fig. 3, the target text is "how to cope with such difficulty", and the context information extraction model is a standard-scale BERT model, which includes a word segmentation device Tokenizer, a vector conversion layer coding and a transducer module, wherein the transducer module includes 12 transducer layers, namely, a transducer 1, a transducer 2, a transducer … and a transducer 12. The method comprises the steps that firstly, each character in a text is separated through a token layer in a Bert model to obtain eight characters of "how", "should", "pair", "this", "species", "trapped" and "difficult", the eight characters are converted into vector representations of "e how", "e should", "e pair", "e this", "e species", "e trapped" and "e difficult" which can be input into a transducer module through an Embedding layer, finally, the 8 vectors obtained through conversion of the Embedding layer represent the context information of each character in the text through the transducer module comprising 12 transducer layers, and the vector representations of "v how", "v should", "v pair", "v this", "v species", "v trapped" and "v difficult" corresponding to each character in the text are obtained. The Bert model includes 12 converter layers, specifically 768 hidden units, and has 1.1 hundred million parameters, so that the vector representation width corresponding to each character in the "how to cope with the difficulty" is 768.

In addition, on the basis of converting characters in the target text into vector representations through the context information extraction model, a related model for generating word vectors can be added into the context information extraction model, so that the information extraction capability of the context information extraction model is further improved. As an example, a word2vec model may be added to the context information extraction model.

The application converts characters in the target text into vector representations based on the pronunciation prediction method provided by the above embodiments S201-S204; the target text comprises polyphones; extracting a vector representation of a target polyphone and a vector representation of an associated character of the target polyphone from the vector representations of the plurality of characters obtained through conversion; the target polyphones are one of the polyphones in the target text; calling a pronunciation prediction model corresponding to the target polyphones in the pronunciation prediction models; the pronunciation prediction models are respectively used for predicting the pronunciation of the corresponding polyphones; based on the vector representation of the target polyphones and the vector representation of the associated characters of the target polyphones, pronunciation of the target polyphones in the target text is predicted by using a pronunciation prediction model corresponding to the target polyphones.

It can be seen that the method extracts a vector representation of a target polyphone and a vector representation of an associated character of the target polyphone based on a vector representation of a target text comprising the polyphone, retaining contextual information of the target polyphone in the target text; and calling the pronunciation prediction model corresponding to the target polyphone to predict the pronunciation of the target polyphone in the target text, wherein different pronunciation prediction models are respectively used for predicting the pronunciation of the corresponding polyphone, namely the pronunciation prediction model corresponding to the target polyphone is trained for the pronunciation prediction capability of the target polyphone, and when the pronunciation prediction capability of the pronunciation prediction model is improved, the pronunciation prediction capability of other pronunciation prediction models is not influenced, so that the pronunciation predictions of different polyphones are decoupled, the condition that the unified pronunciation prediction model is mutually influenced for the pronunciation prediction capability of different polyphones is avoided, and the pronunciation of different polyphones in the text can be effectively predicted, thereby improving the pronunciation prediction accuracy of the polyphone.

Next, still at the angle of the terminal device, another pronunciation prediction method provided by the embodiment of the present application is specifically described below.

Referring to fig. 5, a flowchart of another pronunciation prediction method according to an embodiment of the present application is shown. Referring to fig. 5, the pronunciation prediction method includes the following steps:

s501: converting characters in the target text into vector representations; the target text includes polyphones.

The specific implementation of S501 may refer to the specific implementation of S201, and will not be described herein.

S502: and determining the associated characters of the target polyphones in the target text according to the character positions of the target polyphones in the target text and preset window width parameters.

In the embodiment of the application, the preset window width parameter refers to the preset number of extracted polyphones and polyphone related characters, namely the number of polyphones and preceding and following characters which need to be extracted in the target text. The preset window width parameter may be specifically set according to the length of the target text, or may be set to a fixed value, which is not limited in the present application. The character position refers to what character the target polyphone is in the target text, and as an example, the target text is "seamless in the world combined with a simple and stereoscopic modeling, so that the overall sense of the side is very natural," the target polyphone is "obtained," and the character position of "obtained" is the 21 st character from front to back in the target text. Or the character position from the back to the front in the target text, which is not limited by the present application.

In addition, considering that the length of the target text is not constant, when the associated character of the target polyphone in the target text is determined according to the preset window width parameter, the condition that the length of the target text is smaller than the preset window width parameter may occur, and at this time, the missing positions corresponding to the preset window width parameter may be complemented so that the length of the target text is matched with the preset window width parameter. As an example, the missing positions of the target text may be padded with preset pad characters, which the present application is not limited to.

As an example, referring to fig. 6a, a schematic diagram of determining an associated character of a target polyphonic word in a target text according to an embodiment of the application is shown. Referring to fig. 6a, the target text is specifically "how to cope with the difficulty", the target polyphone is "should", the preset window width parameter is 12, the number of characters of the target text is 8, the vector representation length corresponding to "how to cope with the difficulty" obtained by the context information extraction model is also 8, which is smaller than the preset window width parameter, at this time, the length of the vector representation corresponding to the target text can be complemented to 12 by using 4 padding tokens, and the padding tokens can be specifically "vNull", so that the vector representation length output by the context information extraction model is 12 and matches with the preset window width parameter. The associated characters of the target polyphones in the target text determined according to the preset window width parameter are the first 4 characters and the last 7 characters of the 'response'.

In addition, based on the above example, the target text further includes a polyphone "kind" and a polyphone "difficulty", so when predicting how to cope with the polyphone pronunciation in the difficulty, the pronunciation prediction models corresponding to the three polyphones, namely "should", "kind" and "difficulty" can be respectively called, the pronunciation prediction models corresponding to different polyphones can be simultaneously called, and the pronunciation prediction models can be sequentially called, which is not limited in the application.

As an example, referring to fig. 6b, a schematic diagram of predicting pronunciation of multiple polyphones in a target text according to an embodiment of the present application is shown. The target text is specifically "how to cope with such difficulty" as shown in fig. 6b, wherein the target text comprises 3 polyphones "should", "species" and "difficult", the preset window width parameter is set to be 5, the associated character of each polyphone is determined to be the first two characters and the last two characters of the polyphone according to the preset window width parameter, the vector representation of each target polyphone and the vector representation of the associated character thereof are input into the corresponding pronunciation prediction model, and the pronunciation of the target polyphone in the target text is obtained, wherein the accurate pronunciation of the polyphone in the target text can be underlined. In addition, in some other implementations, labeling includes, but is not limited to, retaining accurate readings or labeling accurate readings with colors, etc., which the present application does not limit.

S503: extracting the vector representation of the target polyphones and the vector representation of the associated characters of the target polyphones from the vector representations of the plurality of characters obtained through conversion.

And extracting the vector representation of the target polyphones and the vector representation of the associated characters of the target polyphones from the vector representations obtained by converting the characters of the target text according to the character positions of the target polyphones in the target text and the character positions of the associated characters of the target polyphones determined in the steps, so as to input the corresponding pronunciation prediction model of the target polyphones for pronunciation prediction. Therefore, when the length of the target text is too long, only partial text containing the target polyphones and related characters can be reserved, and the pronunciation prediction model performs feature extraction on the reserved partial text, so that feature extraction on characters with weak relevance with the target polyphones is not needed, calculation resources are saved, and the prediction speed of pronunciation of the target polyphones can be accelerated.

S504: calling a pronunciation prediction model corresponding to the target polyphones in the pronunciation prediction models; the pronunciation prediction models are respectively used for predicting the pronunciation of the corresponding polyphones.

The specific implementation of S504 may refer to the specific implementation of S203, and will not be described herein.

S505: and taking the vector representation of the target polyphones and the vector representation of the associated characters of the target polyphones as the input of the pronunciation prediction model corresponding to the target polyphones, and respectively carrying out convolution operation based on the input content of the pronunciation prediction model corresponding to the target polyphones through the plurality of convolution modules.

The pronunciation prediction model corresponding to the target polyphones comprises a plurality of convolution modules, wherein the convolution modules have different window widths;

in the embodiment of the application, the convolution module refers to an important component for extracting the characteristics of input data in the convolution neural network, namely, the convolution module performs characteristic extraction on the vector representation of the target polyphones and the vector representation of the associated characters of the target polyphones.

Considering that if the convolution module only extracts features from the vector representation of the target polyphones and their associated characters with a fixed window width, the feature information extracted is single, and thus the pronunciation prediction result of the target polyphones is affected. Therefore, the pronunciation prediction model in the embodiment of the application comprises a plurality of convolution modules for extracting the characteristic information, and the window widths of the convolution modules are different. The convolution modules with smaller window width pay more attention to the extraction of small-range characteristic information in input data, and the convolution modules with different window widths can extract characteristic information in different ranges from the input data, so that richer characteristic information is obtained, and the accuracy of pronunciation prediction of a target polyphone is further improved.

The input data of the pronunciation prediction model is the vector representation of the target polyphones and the vector representation of the related characters, which are obtained according to the preset window width parameters, and the pronunciation prediction model is the data with fixed window width, but the pronunciation prediction model comprises a plurality of convolution modules with different window widths, if the input data with fixed window width are input into the convolution modules, the convolution modules with small window width may encounter the input data with oversized input size, the input data with large input size comprise a plurality of related characters, the convolution modules with small window width extract the characteristic information with small range, so that the output data of the convolution modules easily contain too much characteristic information irrelevant to the target polyphones, the calculation resources are increased, and the pronunciation prediction efficiency is affected.

In order to solve the above problems, the input size may be configured according to the convolution modules with different window widths, then the data with the fixed window width input to the pronunciation prediction model is processed according to the input size, so as to obtain the input data more suitable for the input sizes of the convolution modules, and finally the convolution operation is performed on the processed data by the convolution modules. Thus, the present application provides a possible implementation manner, and S505 may specifically include: configuring the input size of the convolution module according to the window width of the convolution module; processing input content constructed from a vector representation of the target polyphone and a vector representation of an associated character of the target polyphone into input data matching an input size of the convolution module; and performing convolution operation by the convolution module based on the input data matched with the input size of the convolution module.

The input size of the convolution module refers to the size of the input data of the convolution module. The input data of the pronunciation prediction model is vector representation of target polyphones and related characters determined according to preset window width parameters, and is of fixed input size, but a plurality of convolution modules in the pronunciation prediction model are of different window widths, and in order to adapt to the window widths of the convolution modules, the sizes of the input data need to be correspondingly configured.

As an example, the window width of the convolution module is in one-to-one correspondence with the change in size of the input data size of the convolution module, the larger the window width, the larger the configured input data size, and the input data size of the convolution module changes with the window width of the convolution module. The width of the convolution module is sequentially increased, the size of input data of the convolution module is also sequentially increased, so that different convolution modules can extract richer features according to the input data of different sizes, the size of the input data of the convolution module is not excessively increased compared with the window width of the input data of the convolution module, calculation resources can be saved, and the method is beneficial to quickly obtaining more accurate pronunciation prediction results.

As an example, referring to fig. 7, a schematic diagram of processing input data of a pronunciation prediction model into input data matching with an input size of a convolution module according to an embodiment of the present application is shown. Referring to fig. 7, the target text is "how to cope with such difficulty", the target polyphone is "should", the preset window width parameter is 12, the convolution modules 1 to 4 in the pronunciation prediction model corresponding to the target polyphone "should" are specifically 4 one-dimensional convolution kernels, the widths of the convolution modules 1 to 4 are 1, 2, 3 and 5, respectively, the sizes of the convolution modules are 32, and the input sizes of the convolution modules 1 to 4 are configured to be 3, 5, 7 and 11, respectively, then the vector representation of "should" and the vector representation of "should" associated characters determined according to the preset window width parameter are processed as input data matching the input sizes of the convolution modules 1 to 4, respectively. Specifically, the vector representation width based on the Bert model output in the above example is 768, that is, the vector representation width corresponding to each character in "how to cope with such difficulty" is 768, and the matrix size of the vector representation of "should" and the vector representation of the associated character of "should" is [12, 768]. The input size of the convolution module 1 is specifically [3, 768], and the vector representation of the previous character and the next character of the target polyphones 'should' in the target text is reserved; the input size of the convolution module 2 is specifically [5, 768], and vector representations of the first two characters and the last two characters of the target polyphones 'should' and 'should' in the target text are reserved; the input size of the convolution module 3 is specifically [7, 768], and vector representations of the first three characters and the last three characters of the target polyphones 'should' and 'should' in the target text are reserved; the convolution module 4 has an input size of [11, 768] that preserves the vector representations of the first three characters and the last seven characters of the target polyphones "should", "should" in the target text. The width of the one-dimensional convolution kernel is m, the corresponding input size is n, the output size of the one-dimensional convolution kernel is m-n+1, the size 32 of the one-dimensional convolution kernel is 1, and the output sizes of the convolution modules 1 to 4 are respectively [3, 32], [4, 32], [5, 32] and [7, 32].

The processing method for the input data of the convolution module can firstly place the target polyphones in the middle position of the input size length, and the vector representations of the same number of characters are reserved before and after the target polyphones, so that the vector representations corresponding to the target polyphones are always located in the middle position of the input data, and the marking of the specific positions of the target polyphones is facilitated. In addition, in some other possible implementations, the position of the target polyphone in the target text is not necessarily in the middle of the target text, and the number of characters reserved before and after can be considered specifically according to the number of characters before and after the target polyphone in the target text, but the vector of the target polyphone needs to be marked to represent the specific position in the processed input data.

In addition, it should be noted that in some other possible implementation manners, the convolution module may be replaced by an attention mechanism model or a long-short-term memory network, etc. in the pronunciation prediction model, so as to process the input data of the pronunciation prediction model, and further obtain the pronunciation of the target polyphone in the target text.

S506: and performing splicing processing on convolution operation results output by the convolution modules to obtain spliced vector representations.

The convolution operation results output by the convolution modules respectively extract characteristic information in different ranges for input data with different sizes, the convolution operation results output by the convolution modules are spliced, the characteristic information of the target polyphones in the different ranges can be fully considered, the spliced vector representation comprises the characteristic information in various ranges, semantic information of the target polyphones and related characters is extracted on different granularities, and a more accurate polyphone pronunciation prediction result can be further obtained. As an example, based on the above example, the output sizes of the convolution modules 1 to 4 are [3, 32], [4, 32], [5, 32] and [7, 32], respectively, the vector width obtained by stitching is (3+4+5+7) ×32, and the vector obtained by stitching represents a one-dimensional vector with a width of 608.

S507: and predicting the pronunciation of the target polyphones in the target text according to the spliced vector representation.

Based on the vector representation spliced in the steps, the semantic information of the associated characters of the target polyphones on different granularities is combined, and the spliced vector representation contains both the semantic information of the target polyphones and the semantic information of the associated characters of the target polyphones on different granularities, so that accurate pronunciation of the target polyphones in the target text can be obtained.

As an example, referring to fig. 8, a schematic structural diagram of a pronunciation prediction model according to an embodiment of the present application is shown. Referring to fig. 8, the pronunciation prediction model includes a convolution layer, a full connection layer and an output layer, the convolution layer includes convolution modules 1 to 4, the convolution modules 1 to 4 respectively perform convolution operation processing according to input data to obtain corresponding convolution operation results 1 to 4, then perform splicing processing on the convolution operation results 1, 2, 3 and 4, and finally classify vector representations obtained after the splicing processing through the full connection layer to obtain the pronunciation of the target polyphones in the target text.

The application is based on the pronunciation prediction method provided by the above embodiments S501-S507, and according to the preset window width parameter, determining the associated character of the target text, and retaining the context associated information of the target polyphones in the target text; invoking a pronunciation prediction model corresponding to the target polyphones, wherein the pronunciation prediction model comprises convolution modules with different window widths, and correspondingly configuring the input sizes of the convolution modules corresponding to the window widths of the convolution modules, extracting characteristic information of the target polyphones in different ranges, and obtaining semantic information of the target polyphones in related characters with different granularity; when the pronunciation prediction capability of the pronunciation prediction model corresponding to the target multi-tone word is improved, the pronunciation prediction capability of other pronunciation prediction models is not influenced, so that the pronunciation predictions of different multi-tone words are decoupled, the condition that the unified pronunciation prediction model influences the pronunciation prediction capability of different multi-tone words mutually is avoided, the pronunciation of different multi-tone words in a text can be effectively predicted, and the pronunciation prediction accuracy of the multi-tone words is improved.

Based on the pronunciation prediction method provided by the embodiment, the application also correspondingly provides a pronunciation prediction device, and the pronunciation prediction device provided by the embodiment of the application is specifically described below.

Referring to fig. 9, the structure of a pronunciation prediction device according to an embodiment of the present application is shown. As shown in fig. 9, the pronunciation prediction device specifically includes:

a character conversion unit 91 for converting characters in the target text into a vector representation; the target text comprises polyphones;

a vector extraction unit 92, configured to extract a vector representation of a target polyphone and a vector representation of an associated character of the target polyphone from the vector representations of the plurality of characters obtained by conversion; the target polyphones are one of the polyphones in the target text;

a model calling unit 93 for calling a pronunciation prediction model corresponding to the target polyphones among a plurality of pronunciation prediction models; the pronunciation prediction models are respectively used for predicting the pronunciation of the corresponding polyphones;

the pronunciation prediction unit 94 is configured to predict a pronunciation of the target polyphone in the target text using a pronunciation prediction model corresponding to the target polyphone based on the vector representation of the target polyphone and the vector representation of the associated character of the target polyphone.

Alternatively, the vector extraction unit 92 specifically includes:

the associated character determining subunit is used for determining the associated characters of the target polyphones in the target text according to the character positions of the target polyphones in the target text and the preset window width parameters;

and the vector extraction subunit is used for extracting the vector representation of the target polyphones and the vector representation of the associated characters of the target polyphones from the vector representations of the plurality of characters obtained through conversion.

Optionally, the pronunciation prediction model corresponding to the target polyphone includes a plurality of convolution modules, where the plurality of convolution modules have different window widths;

accordingly, the pronunciation prediction unit 94 specifically includes:

a convolution operation subunit, configured to perform convolution operation by using the vector representation of the target polyphone and the vector representation of the associated character of the target polyphone as the input of the pronunciation prediction model corresponding to the target polyphone, based on the input content of the pronunciation prediction model corresponding to the target polyphone, through a plurality of convolution modules;

the vector splicing subunit is used for splicing the convolution operation results output by the convolution modules to obtain a spliced vector representation;

and the pronunciation prediction subunit is used for predicting the pronunciation of the target polyphones in the target text according to the spliced vector representation.

Optionally, the convolution operation subunit is specifically configured to configure an input size of the convolution module according to a window width of the convolution module; processing input content constructed from a vector representation of the target polyphones and a vector representation of associated characters of the target polyphones into input data matching an input size of the convolution module; and performing convolution operation by the convolution module based on the input data matched with the input size of the convolution module.

Alternatively, the character conversion unit 91 is specifically configured to convert characters in the target text into a vector representation by the context information extraction model; the converted vector representation contains semantic information of the character itself and context information of the character in the context of the target text.

Optionally, the pronunciation prediction device may further include:

the character matching unit is used for matching the characters in the target text with the polyphone library; the polyphone library comprises a plurality of polyphones;

the multi-tone character determining unit is used for determining characters successfully matched with the multi-tone character library in the target text as multi-tone characters;

and the position recording unit is used for recording the character positions of the polyphones in the target text.

Optionally, the pronunciation prediction device may further include: :

A sample acquisition unit configured to acquire a plurality of text samples; the text sample includes a target polyphone;

a character conversion unit 91, which is further configured to convert characters in the text sample into a vector representation;

the training data construction unit is used for constructing training data corresponding to a target polyphone according to vector representation obtained by character conversion in a text sample, the position information of the target polyphone in the text sample and the pronunciation label of the target polyphone; the pronunciation tag is the correct pronunciation of the target polyphones in the text sample;

the model training unit is used for training a model to be trained corresponding to the target polyphones through training data corresponding to the target polyphones;

and the model obtaining unit is used for ending training to obtain the pronunciation prediction model corresponding to the target polyphones when the training cut-off condition is met.

Optionally, the sample acquiring unit is specifically configured to acquire a plurality of text samples associated with the target scene according to a requirement for predicting pronunciation of the target polyphones in the target scene;

the model obtaining unit is specifically used for ending training to obtain a pronunciation prediction model corresponding to the target polyphones facing the target scene when the training cut-off condition is met;

The model calling unit 93 is specifically configured to call a pronunciation prediction model corresponding to a target polyphone facing the target scene when determining that the target text is a text associated with the target scene.

Optionally, the pronunciation prediction device may further include:

and the parameter adjustment unit is used for carrying out parameter adjustment on the pronunciation prediction model corresponding to the target polyphone according to the prediction accuracy improvement requirement on the target polyphone.

The structure of the pronunciation prediction device will be described below in terms of a server form and a terminal device form, respectively.

Fig. 10 is a schematic diagram of a server structure provided in an embodiment of the present application, where the server 900 may vary considerably in configuration or performance, and may include one or more central processing units (central processing units, CPU) 922 (e.g., one or more processors) and memory 932, one or more storage media 930 (e.g., one or more mass storage devices) storing application programs 942 or data 944. Wherein the memory 932 and the storage medium 930 may be transitory or persistent. The program stored in the storage medium 930 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 922 may be arranged to communicate with a storage medium 930 to execute a series of instruction operations in the storage medium 930 on the server 900.

The server 900 may also include one or more power sources 926, one or more wired or wireless network interfaces 950, one or moreInput/output interface 958, and/or one or more operating systems 941, e.g., windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM Etc.

Wherein, CPU 922 is configured to perform the steps of:

converting characters in the target text into vector representations; the target text includes polyphones;

extracting a vector representation of the target polyphones and a vector representation of associated characters of the target polyphones from the vector representations of the plurality of characters obtained through conversion; the target polyphones are one of the polyphones in the target text;

The embodiment of the present application further provides another pronunciation prediction device, as shown in fig. 11, for convenience of explanation, only the relevant parts of the embodiment of the present application are shown, and specific technical details are not disclosed, please refer to the method part of the embodiment of the present application. The terminal can be any terminal equipment including a mobile phone, a tablet personal computer, a personal digital assistant (English full name: personal Digital Assistant, english abbreviation: PDA), a Sales terminal (English full name: point of Sales, english abbreviation: POS), a vehicle-mounted computer and the like, taking the mobile phone as an example of the terminal:

Fig. 11 is a block diagram showing a part of the structure of a mobile phone related to a terminal provided by an embodiment of the present application. Referring to fig. 11, the mobile phone includes: radio Frequency (RF) circuit 1010, memory 1020, input unit 1030, display unit 1040, sensor 1050, audio circuit 1060, wireless fidelity (wireless fidelity, wiFi) module 1070, processor 1080, and power source 1090. Those skilled in the art will appreciate that the handset configuration shown in fig. 11 is not limiting of the handset and may include more or fewer components than shown, or may combine certain components, or may be arranged in a different arrangement of components.

The following describes the components of the mobile phone in detail with reference to fig. 11:

the RF circuit 1010 may be used for receiving and transmitting signals during a message or a call, and particularly, after receiving downlink information of a base station, the signal is processed by the processor 1080; in addition, the data of the design uplink is sent to the base station. Generally, RF circuitry 1010 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (English full name: low Noise Amplifier, english abbreviation: LNA), a duplexer, and the like. In addition, the RF circuitry 1010 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to global system for mobile communications (english: global System of Mobile communication, english: GSM), general packet radio service (english: general Packet Radio Service, GPRS), code division multiple access (english: code Division Multiple Access, english: CDMA), wideband code division multiple access (english: wideband Code Division Multiple Access, english: WCDMA), long term evolution (english: long Term Evolution, english: LTE), email, short message service (english: short Messaging Service, SMS), and the like.

The memory 1020 may be used to store software programs and modules that the processor 1080 performs various functional applications and data processing of the handset by executing the software programs and modules stored in the memory 1020. The memory 1020 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, memory 1020 may include high-speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state memory device.

The input unit 1030 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the handset. In particular, the input unit 1030 may include a touch panel 1031 and other input devices 1032. The touch panel 1031, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on the touch panel 1031 or thereabout using any suitable object or accessory such as a finger, stylus, etc.), and drive the corresponding connection device according to a predetermined program. Alternatively, the touch panel 1031 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 1080 and can receive commands from the processor 1080 and execute them. Further, the touch panel 1031 may be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 1030 may include other input devices 1032 in addition to the touch panel 1031. In particular, other input devices 1032 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a track ball, a mouse, a joystick, etc.

The display unit 1040 may be used to display information input by a user or information provided to the user and various menus of the mobile phone. The display unit 1040 may include a display panel 1041, and alternatively, the display panel 1041 may be configured in the form of a liquid crystal display (english full name: liquid Crystal Display, acronym: LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 1031 may overlay the display panel 1041, and when the touch panel 1031 detects a touch operation thereon or thereabout, the touch panel is transferred to the processor 1080 to determine a type of touch event, and then the processor 1080 provides a corresponding visual output on the display panel 1041 according to the type of touch event. Although in fig. 11, the touch panel 1031 and the display panel 1041 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1031 and the display panel 1041 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 1050, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 1041 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 1041 and/or the backlight when the mobile phone moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and direction when stationary, and can be used for applications of recognizing the gesture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and knocking), and the like; other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that may also be configured with the handset are not described in detail herein.

Audio circuitry 1060, a speaker 1061, and a microphone 1062 may provide an audio interface between a user and a cell phone. Audio circuit 1060 may transmit the received electrical signal after audio data conversion to speaker 1061 for conversion by speaker 1061 into an audio signal output; on the other hand, microphone 1062 converts the collected sound signals into electrical signals, which are received by audio circuit 1060 and converted into audio data, which are processed by audio data output processor 1080 for transmission to, for example, another cell phone via RF circuit 1010 or for output to memory 1020 for further processing.

WiFi belongs to a short-distance wireless transmission technology, and a mobile phone can help a user to send and receive emails, browse webpages, access streaming media and the like through a WiFi module 1070, so that wireless broadband Internet access is provided for the user. Although fig. 11 shows a WiFi module 1070, it is understood that it does not belong to the necessary constitution of the handset, and can be omitted entirely as required within the scope of not changing the essence of the invention.

Processor 1080 is the control center of the handset, connects the various parts of the entire handset using various interfaces and lines, performs various functions and processes of the handset by running or executing software programs and/or modules stored in memory 1020, and invoking data stored in memory 1020, thereby performing overall data and information collection for the handset. Optionally, processor 1080 may include one or more processing units; preferably, processor 1080 may integrate an application processor primarily handling operating systems, user interfaces, applications, etc., with a modem processor primarily handling wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 1080.

The handset further includes a power source 1090 (e.g., a battery) for powering the various components, which may preferably be logically connected to the processor 1080 by a power management system, such as to provide for managing charging, discharging, and power consumption by the power management system.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which will not be described herein.

In an embodiment of the present application, the processor 1080 included in the terminal further has the following functions:

The embodiments of the present application also provide a computer readable storage medium storing program code for executing any one of the foregoing pronunciation prediction methods according to the foregoing embodiments.

The embodiments of the present application also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform any one of the foregoing implementations of a pronunciation prediction method as described in the various embodiments.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working processes of the above-described system and apparatus may refer to corresponding processes in the foregoing method embodiments, which are not described herein again.

In the several embodiments provided by the present application, it should be understood that the disclosed systems and methods may be implemented in other ways. For example, the system embodiments described above are merely illustrative, e.g., the division of the system is merely a logical function division, and there may be additional divisions of a practical implementation, e.g., multiple systems may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The system described as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: u disk, mobile hard disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A pronunciation prediction method, comprising:

2. The pronunciation prediction method according to claim 1, wherein extracting the vector representation of the target polyphones and the vector representation of the associated characters of the target polyphones from the vector representations of the plurality of characters obtained by conversion specifically comprises:

determining associated characters of the target polyphones in the target text according to character positions of the target polyphones in the target text and preset window width parameters;

extracting the vector representation of the target polyphones and the vector representation of the associated characters of the target polyphones from the vector representations of the plurality of characters obtained through conversion.

3. The pronunciation prediction method of claim 1, wherein the pronunciation prediction model corresponding to the target polyphones comprises a plurality of convolution modules, the plurality of convolution modules having different window widths; the method for predicting pronunciation of the target polyphones in the target text by using a pronunciation prediction model corresponding to the target polyphones based on the vector representation of the target polyphones and the vector representation of the associated characters of the target polyphones specifically comprises the following steps:

taking the vector representation of the target polyphones and the vector representation of the associated characters of the target polyphones as the input of the pronunciation prediction model corresponding to the target polyphones, and respectively carrying out convolution operation based on the input content of the pronunciation prediction model corresponding to the target polyphones through the plurality of convolution modules;

Splicing the convolution operation results output by the convolution modules to obtain a spliced vector representation;

and predicting the pronunciation of the target polyphones in the target text according to the spliced vector representation.

4. The pronunciation prediction method according to claim 3, wherein the performing, by the plurality of convolution modules, convolution operation based on input contents of a pronunciation prediction model corresponding to the target polyphones, respectively, specifically includes:

configuring the input size of the convolution module according to the window width of the convolution module;

processing input content constructed from a vector representation of the target polyphone and a vector representation of an associated character of the target polyphone into input data matching an input size of the convolution module;

and performing convolution operation by the convolution module based on the input data matched with the input size of the convolution module.

5. The pronunciation prediction method as claimed in claim 1, wherein the converting the characters in the target text into vector representations specifically comprises:

converting characters in the target text into vector representations through a context information extraction model; the converted vector representation contains semantic information of the character itself and context information of the character in the context of the target text.

6. The pronunciation prediction method of claim 1, further comprising:

matching the characters in the target text with a polyphone library; the polyphone library comprises a plurality of polyphones;

determining characters successfully matched with the polyphone library in the target text as polyphones;

recording character positions of polyphones in the target text.

7. The pronunciation prediction method of claim 1, wherein the pronunciation prediction model corresponding to the target polyphones is predicted as being obtained through training by:

acquiring a plurality of text samples; the text sample includes the target polyphones;

converting characters in the text sample into a vector representation;

constructing training data corresponding to the target polyphones according to vector representations obtained by converting characters in the text samples, the position information of the target polyphones in the text samples and pronunciation labels of the target polyphones; the pronunciation tag is the correct pronunciation of the target polyphones in the text sample;

training a model to be trained corresponding to the target polyphones through a plurality of training data corresponding to the target polyphones;

And when the training cut-off condition is met, ending training to obtain the pronunciation prediction model corresponding to the target polyphone.

8. The pronunciation prediction method of claim 7, wherein the obtaining a plurality of text samples specifically comprises:

according to the requirement for predicting the pronunciation of the target polyphones in a target scene, acquiring a plurality of text samples associated with the target scene;

when the training cut-off condition is met, finishing training to obtain a pronunciation prediction model corresponding to the target polyphones, wherein the pronunciation prediction model specifically comprises the following steps:

when the training cut-off condition is met, ending training to obtain a pronunciation prediction model corresponding to the target polyphones facing the target scene;

the method specifically comprises the steps of calling a pronunciation prediction model corresponding to the target polyphones in a plurality of pronunciation prediction models, wherein the pronunciation prediction model specifically comprises:

and when the target text is determined to be the text associated with the target scene, calling a pronunciation prediction model corresponding to the target polyphone facing the target scene.

9. The pronunciation prediction method of claim 1, further comprising:

and according to the prediction accuracy improvement requirement of the target polyphones, carrying out parameter adjustment on the pronunciation prediction model corresponding to the target polyphones.

10. A pronunciation prediction device, comprising:

11. A pronunciation prediction device, the device comprising a processor and a memory:

The processor is configured to perform the steps of the pronunciation prediction method according to any one of claims 1 to 9 according to instructions in the program code.

12. A computer readable storage medium, characterized in that the computer readable storage medium is for storing a program code for performing the steps of the pronunciation prediction method according to any one of claims 1 to 9.

13. A computer program product comprising computer programs or instructions which, when executed by a pronunciation prediction device, implement the steps of the pronunciation prediction method of any one of claims 1 to 9.