CN113011135A

CN113011135A - Arabic vowel recovery method, device, equipment and storage medium

Info

Publication number: CN113011135A
Application number: CN202110234392.8A
Authority: CN
Inventors: 储银雪; 高丽; 祖漪清; 江源
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2021-03-03
Filing date: 2021-03-03
Publication date: 2021-06-22

Abstract

The application discloses a method, a device, equipment and a storage medium for recovering Arabic vowels, wherein the language model is trained by utilizing an Arabic training text without vowels, the language model with better generalization expression capability on the words of the Arabic can be obtained based on mass Arabic training texts, then for the Arabic text to be processed, the characteristic representation of each word in the Arabic text to be processed is determined by utilizing the language model, the text characteristic of each character in the Arabic text to be processed is obtained, and the vowel labeling result corresponding to each character in the Arabic text to be processed is determined based on the text characteristic of each character and the characteristic representation of the word to which each character belongs. When the vowel is restored, the method and the device refer to the text characteristics of each character and the characteristic representation of the word to which the character belongs, namely the reference information is richer, and meanwhile, the accuracy of the vowel predicted by each character can be improved by means of the better generalization representation capability of the language model to the bilingual word.

Description

Arabic vowel recovery method, device, equipment and storage medium

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for recovering an arabic vowel.

Background

Arabic, also called Arabic for short, has 28 consonant characters and 8 vowel characters, and in its conventional writing, except special cases, the vowel character information in the characters is generally omitted, and only the consonant face information is retained. However, the pronunciation of aloud needs to embody the information of vowels, and the combination of consonants and different vowels affects the semantic information of words. Therefore, the reader often needs to add corresponding vowel information after reading the consonant characters according to his own judgment, and read out the consonants together with the pronunciation of the vowels.

Because of the particularity of arabic language, there is no clear and detailed rule for adding vowels, after different vowel characters are added to the same consonant word, the meaning of the word may change, or the corresponding end vowel of the same word changes according to the position change of the word in the sentence, so that the same consonant word may have multiple vowel recovery forms in different sentences, such as words

Possible forms of marking with vowels are

And the like.

Since the conventional bilingual text does not contain vowel characters, the synthesis effect is affected when the bilingual text is subjected to voice synthesis, so that front-end processing needs to be performed on the bilingual text before voice synthesis, namely, the vowel information of the bilingual text is recovered, and therefore complete and accurate input information is guaranteed when the bilingual text is synthesized. Therefore, it is necessary to provide an bilingual vowel restoration scheme to ensure the correctness and completeness of the semantic meaning of the bilingual text.

Disclosure of Invention

In view of the foregoing, the present application is provided to provide a method, an apparatus, a device and a storage medium for recovering arabic vowels, so as to ensure correctness and completeness of semantics of the arabic texts. The specific scheme is as follows:

an arabic vowel restoration method, comprising:

acquiring an Artisian text to be processed;

determining the feature representation of each word in the to-be-processed whisper text by utilizing a pre-trained language model, wherein the language model is obtained by training a whisper training text without vowel labels;

acquiring text characteristics of each character in the to-be-processed whisper text;

and determining a vowel labeling result corresponding to each character in the to-be-processed bilingual text based on the text characteristics of each character in the to-be-processed bilingual text and the characteristic representation of the word to which each character belongs.

Preferably, the training process of the language model includes:

obtaining an aphasian training text without vowel labels;

randomly shielding characters in the Alphabet training text, and inputting a language model;

and training the language model by taking the character blocked in the predicted Alphabet training text as a target.

Preferably, the language model is a mask language model based on a BERT structure.

Preferably, the said a language training text comprises modern a language training text and/or classical a language training text.

Preferably, the determining the feature representation of each word in the to-be-processed whisper text by using the pre-trained language model includes:

and inputting the to-be-processed whisper text into the language model to obtain the word vector characteristics of each word in the to-be-processed whisper text output by the language model.

Preferably, the determining a vowel labeling result corresponding to each character in the to-be-processed bilingual text based on the text feature of each character in the to-be-processed bilingual text and the feature representation of the word to which each character belongs includes:

fusing the text characteristics of each character in the to-be-processed whisper text with the characteristic representation of the word to which the character belongs to obtain the fused characteristic representation of the to-be-processed whisper text;

and determining a vowel labeling result corresponding to each character in the to-be-processed bilingual text based on the fusion characteristic representation of the to-be-processed bilingual text.

Preferably, the process of obtaining the text feature of each character in the to-be-processed whisper text, and determining the vowel labeling result corresponding to each character in the to-be-processed whisper text based on the text feature and the feature representation of the word to which each character belongs includes:

processing the to-be-processed whisper text and the feature representation of each word thereof by utilizing a pre-trained vowel recovery model to obtain vowel labeling results corresponding to each character in the to-be-processed whisper text and output by the vowel recovery model;

the vowel recovery model is obtained by training with an aphasian training text with vowel labeling results and the feature representation of each word in the aphasian training text as training data.

Preferably, the process of processing the to-be-processed aloud text and the feature representation of each word in the to-be-processed aloud text by using the pre-trained vowel restoration model includes:

acquiring the text characteristics of each character in the to-be-processed whisper text by utilizing the characteristic extraction layer of the vowel recovery model;

fusing the text features of each character in the to-be-processed whisper text with the feature representation of the word to which the character belongs by using the feature fusion layer of the vowel recovery model to obtain the fusion feature representation of the to-be-processed whisper text;

and determining a vowel labeling result corresponding to each character in the to-be-processed bilingual text based on the fusion characteristic representation of the to-be-processed bilingual text by utilizing the classification layer of the vowel recovery model.

Preferably, the obtaining the text feature of each character in the to-be-processed bilingual text by using the feature extraction layer of the vowel restoration model includes:

acquiring the coding characteristics of each character in the to-be-processed whisper text by utilizing a first characteristic extraction layer of the vowel recovery model;

and acquiring character features of each character in the to-be-processed whisper text within a set length window by utilizing a second feature extraction layer of the vowel recovery model, and combining the coding features and the character features into text features of the characters.

Preferably, before the determining the feature representation of each word in the to-be-processed whisper text by using the pre-trained language model, the method further comprises:

if the fact that the to-be-processed whisper text contains the digital symbols is detected, the digital symbols are converted into whisper pronunciation words, and other characters except the tail character in the converted whisper pronunciation words are marked with vowels.

Preferably, the training process of the vowel restoration model includes:

training an initial vowel recovery model by using a classical Alphabet training text with vowel labeling results and representing the characteristics of each word in the classical Alphabet training text as training data;

and performing fine adjustment on the initial vowel recovery model by taking the modern aphasian training text with vowel marking results and the characteristic representation of each word in the modern aphasian training text as training data to obtain a final vowel recovery model.

Preferably, the process for acquiring the modern ananas training text with the vowel labeling result comprises the following steps:

acquiring a modern Arabic training text, wherein other characters except for a tail character in the modern Arabic training text are marked with vowels;

and converting the digital symbols contained in the modern Alphasic training text into Alphasic pronunciation words, and labeling vowels of other characters except the tail character in the converted Alphasic pronunciation words.

Preferably, the method further comprises the following steps:

and correcting the determined vowel labeling result corresponding to each character in the to-be-processed bilingual text by referring to the set vowel labeling rule.

Preferably, the vowel tagging rules comprise vowel tagging rules for a first type of words, and/or vowel tagging rules for a second type of words, and/or vowel tagging rules for a third type of words, wherein the first type of words are words with fixed vowel tagging forms, the second type of words are words with end characters determining word positions and only the position changes, and the third type of words are words formed by combining articles and nouns;

the step of correcting the determined vowel labeling result corresponding to each character in the to-be-processed bilingual text by referring to the set vowel labeling rule comprises the following steps of:

detecting whether the to-be-processed whisper text contains a first type word, if so, replacing the vowel labeling result of each character of the first type word in the vowel labeling result of the to-be-processed whisper text with a configured vowel labeling result corresponding to each character of the first type word;

and/or the presence of a gas in the gas,

detecting whether the to-be-processed whisper text contains a second type word, if so, replacing the vowel labeling result of the corresponding character in the second type word in the vowel labeling result of the to-be-processed whisper text by using the configured vowel labeling result of each character except the tail character in the second type word;

and/or the presence of a gas in the gas,

detecting whether the to-be-processed Arabic texts contain a combined word formed by articles and nouns, if so, processing the to-be-processed Arabic texts by referring to vowel labeling rules of the first type words and the second type words, and determining whether the combined word where the articles are located is in the middle of a sentence or in the head position of the sentence for the articles;

if the text is in the middle position of the sentence, replacing the vowel labeling result of the corresponding character of the article in the to-be-processed bilingual text by using the configured vowel labeling form of the first article;

and if the text is positioned at the beginning of the sentence, replacing the vowel marking result of the corresponding character of the article in the to-be-processed Arabic text by using the configured vowel marking form of the second article.

An arabic vowel restoration apparatus, comprising:

the text acquisition unit is used for acquiring the to-be-processed whisper text;

the word characteristic representation determining unit is used for determining the characteristic representation of each word in the to-be-processed whisper text by utilizing a pre-trained language model, wherein the language model is obtained by training a whisper training text without vowel labels;

the character text characteristic acquisition unit is used for acquiring the text characteristic of each character in the to-be-processed whisper text;

and the vowel labeling result determining unit is used for determining vowel labeling results corresponding to the characters in the to-be-processed bilingual text based on the text characteristics of each character in the to-be-processed bilingual text and the characteristic representation of the word to which each character belongs.

An arabic vowel restoration apparatus comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the arabic vowel restoration method described above.

A storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the arabic vowel recovery method described above.

By means of the technical scheme, the Arabic vowel recovery scheme can train a language model by using a large amount of easily-obtained Arabic training texts without vowels, and the Arabic training texts without vowels can be easily obtained in a large batch, so that the language model can be trained on the basis of the large batch of Arabic training texts, the trained language model has good generalization expression capacity on the Arabic words, the characteristic expression of each word in the Arabic texts to be processed is determined by using the language model, the text characteristic of each character in the Arabic texts to be processed is obtained, and the vowel labeling result corresponding to each character in the Arabic texts to be processed is determined on the basis of the text characteristic of each character and the characteristic expression of the word to which each character belongs. When the vowel recovery is carried out on the to-be-processed whisper text, the text characteristics of each character and the characteristic representation of the word to which the character belongs are referred, namely, the reference information is richer, meanwhile, the accuracy of the vowel predicted by each character can be improved by means of the better generalization representation capability of the language model to the whisper word, the whisper text marked with the vowel is obtained finally, and the correctness and the integrity of the semantics of the whisper text are ensured.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a schematic flow chart of an arabic vowel recovery method according to an embodiment of the present disclosure;

FIG. 2 illustrates a vowel reply model process diagram;

FIG. 3 illustrates another vowel reply model process diagram;

fig. 4 is a schematic structural diagram of an arabic vowel restoration apparatus disclosed in an embodiment of the present application;

fig. 5 is a schematic structural diagram of an arabic vowel restoration apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The application provides an Arabic vowel recovery scheme, which can be suitable for performing vowel recovery on Arabic consisting of consonant characters, namely, annotating vowel characters corresponding to each consonant character. To simplify the expression, Arabic will be simply referred to as Arabic hereinafter.

The scheme can be realized based on a terminal with data processing capacity, and the terminal can be a mobile phone, a computer, a server, a cloud terminal and the like.

In order to realize the recovery of the bilingual vowels, the applicant of the present application firstly thinks that a vowel recovery model is trained by using a large number of bilingual training texts carrying vowel labeling results, and then the vowel recovery model is used for performing vowel recovery on the bilingual text to be processed.

However, in practical situations, the aloud training text data carrying the vowel labeling result is scarce, and if the aloud training text carrying the vowel labeling result is obtained in the form of manual expert labeling, a large amount of human resources are occupied. Therefore, the scheme aims to provide an Arabic vowel recovery scheme under a low-resource scene, wherein the low resource refers to that the resource of the Arabic training text data carrying the vowel labeling result is less.

Next, as described in conjunction with fig. 1, the method for recovering the aloud vowel may include the following steps:

and S100, acquiring an to-be-processed Alphabet text.

Specifically, the to-be-processed bilingual text is a bilingual text needing vowel restoration. The to-be-processed bilingual text is composed of consonant characters, or consonant characters, spaces, punctuations, non-bilingual characters and the like.

And step S110, determining the feature representation of each word in the to-be-processed Arabic text by utilizing a pre-trained language model.

The language model is obtained by training an ananas training text without vowel labels. In the embodiment of the application, a language model can be pre-trained by using a large batch of ananas training texts without vowel labels, and the ananas training texts without vowel labels can be easily acquired, such as through web crawling or other open-source data sets. Therefore, the language model can be trained on the basis of a large batch of the bilingual training texts, so that the trained language model has better generalization expression capability on the bilingual words, and further, for the bilingual texts to be processed, the characteristic expression of each word in the bilingual texts to be processed is determined by using the language model.

And step S120, acquiring text characteristics of each character in the to-be-processed whisper text.

Specifically, for the to-be-processed bilingual text, the text characteristics of each character in the to-be-processed bilingual text can be acquired. The text features represent semantic information of characters in the to-be-processed bilingual text.

The text characteristics of each character may include the results of encoding a single character or may include the results of encoding a character in conjunction with the character's context information.

The process of obtaining the text features of the characters can be realized by adopting a character embedding layer in a neural network structure, or realized by other methods.

It should be noted that the to-be-processed aloud text is composed of consonant characters and other characters, such as spaces, punctuations, and the like. Therefore, in this step, corresponding text features can be determined for various characters in the to-be-processed bilingual text.

Step S130, determining vowel labeling results corresponding to all characters in the to-be-processed bilingual text based on text characteristics of all characters in the to-be-processed bilingual text and characteristic representation of words to which all characters belong.

Specifically, when vowel restoration is performed on each character in the to-be-processed bilingual text, the text feature of each character is referred, and meanwhile, the feature representation of the word to which each character belongs is used as an auxiliary reference feature to predict the vowel labeling result corresponding to each character in the to-be-processed bilingual text.

The feature representation of the word to which each character belongs can be obtained through the step S110, that is, the feature representation of each word in the to-be-processed aloud text determined by the language model is used. It will be appreciated that for a plurality of different characters belonging to the same word, the features of the word to which they belong are the same.

The method for recovering the vowels of the arabic language provided in the embodiments of the present application can train a language model by using a large amount of easily obtained vowel label-free arabic training texts, and since the vowel label-free arabic training texts can be easily obtained in a large batch, the language model can be trained on the basis of the large batch of the arabic training texts, so that the trained language model has a better generalization expression capability for the words of the arabic language, and further for the to-be-processed arabic texts, the language model is used to determine the feature representation of each word in the to-be-processed arabic texts, and obtain the text features of each character in the to-be-processed arabic texts, and the vowel label result corresponding to each character in the to-be-processed arabic texts is determined based on the text features of each character and the feature representation of the word to which each character belongs. When the vowel recovery is carried out on the to-be-processed whisper text, the text characteristics of each character and the characteristic representation of the word to which the character belongs are referred, namely, the reference information is richer, meanwhile, the accuracy of the vowel predicted by each character can be improved by means of the better generalization representation capability of the language model to the whisper word, the whisper text marked with the vowel is obtained finally, and the correctness and the integrity of the semantics of the whisper text are ensured.

In some embodiments of the present application, the language model described above is introduced.

To train the language model, training data may be collected first, i.e., a non-vowel labeled training text in arabian may be obtained, and this data may be crawled from the network or obtained through an existing set of open source data. After the bilingual training text is obtained, characters in the bilingual training text can be randomly shielded, a language model is input, the shielded characters in the bilingual training text are predicted as a target, and the language model is trained, namely a mask language model is trained.

Considering that the BERT model has excellent performance in natural language processing tasks and can maintain good generalization performance for downstream tasks, the BERT model may be selected as the language model in this embodiment. The BERT model mainly comprises an embedding layer and a multi-layer bidirectional transform structure.

For an input bilingual training text, after passing through a language model, each word in the bilingual training text generates a word vector feature word embedding which is used as the feature representation of the word. Optionally, the word vector feature word embedding of the word may be data of 768 dimensions or other dimensions, and the specific dimension may be adjusted according to the network structure of the language model.

It is further noted that the obtained aloud training text without vowel labels may be one or two of a modern aloud training text and a classical aloud training text.

The classical arabic is the written arabic from bonobo dynasty to abas dynasty (the 7 th to 9 th century of official yuan). Modern Arabic is the immediate descendant of it, and is used in the current world in Arabic for written and formal speech. Modern and classical languages are different and have some similarities. The vocabulary and the body of modern Alphabet are different from classical Alphabet, but the morphology and the syntax are basically unchanged. The classic representative of classical Alphabet is the "classic blue Jing".

In the embodiment, the similarity and the difference between the modern Alphabet and the classical Alphabet are considered, and the vowel recovery task is mainly performed on the modern Alphabet, so that the modern Alphabet training text and the classical Alphabet training text can be simultaneously used as the Alphabet training text and the language model is trained, so that the trained language model can simultaneously learn the similarity and the difference between the modern Alphabet and the classical Alphabet, and has better generalization expression capability on the Alphabet words.

Further, as to step S130 in the foregoing embodiment, based on the text feature of each character in the to-be-processed whisper text and the feature representation of the word to which each character belongs, a process of determining a vowel labeling result corresponding to each character in the to-be-processed whisper text is introduced.

In this embodiment, when determining a vowel labeling result corresponding to each character in the to-be-processed aloud text, the text feature of each character in the to-be-processed aloud text and the feature representation of the word to which each character belongs are referred to, and specifically, the text feature of each character in the to-be-processed aloud text and the feature representation of the word to which each character belongs may be fused, so as to obtain a fusion feature representation of the to-be-processed aloud text. Further, determining vowel labeling results corresponding to characters in the to-be-processed bilingual text based on the fusion characteristic representation of the to-be-processed bilingual text.

The text characteristics of the characters and the characteristic representations of the words are fused, so that the fused characteristic representations not only contain the text characteristics of the characters, but also contain generalized representations of the characters in the bilingual environment, namely the fused characteristic representations contain richer information, and the vowel labeling results corresponding to the characters in the bilingual text to be processed are determined more accurately based on the information.

In some embodiments of the present application, the processes of step S120 and step S130 may be implemented by means of a neural network model.

Specifically, the application can train the vowel restoration model in advance. In the training process, an aphasian training text with a vowel labeling result and the feature representation of each word in the aphasian training text can be used as training data, and a vowel recovery model is obtained through training.

The bilingual training texts with the vowel labeling result can comprise modern bilingual training texts and classical bilingual training texts. The feature representation of each word in the bilingual training text can be obtained through the language model trained above. Specifically, the bilingual training text can be input into the trained language model, and then the feature representation of each word in the bilingual training text output by the language model is obtained. In this embodiment, an aloud training text carrying a vowel labeling result and a feature representation of each word therein are used as training data to train a vowel restoration model. The vowel labeling result carried by the bilingual training text is a vowel label corresponding to each character in the bilingual training text.

Specifically, taking Unicode encoding as an example, all possible vowel tags in aloud may include: "\ u 064B", "\ u 064C", "\ u 064D", "\ u 064E", "\\ u 064F", "\\ u 0650", "\ u 0651", "\ u 0652", "\ u0651\ 064B", "\ u0651\ u 064C", "\\ u0651\ 064D", "\\ u0651\ 064E", "\ u0651\ u 064F", "\\\\ u0651\ u 0650", and in 15 cases of the empty. For non-Arabic characters in the Arabic training text, the vowel labels of the Arabic training text can be uniformly set to be null. Therefore, the output of the above vowel restoration model has 15 categories, and the process of vowel restoration is also a process of predicting a vowel label category corresponding to a character.

Further, considering that modern anarchic vowel annotation data is poor in practical situations and the similarity and the difference between classical and modern anarchic languages, the embodiment provides an optional training process of a vowel restoration model.

Specifically, since the open source of the classical a-language training texts carrying the vowel labeling results is large in number and the vowel labeling results are accurate, such as the classical blue warp, in this embodiment, a large number of classical a-language training texts carrying the vowel labeling results and the features of each word in the classical a-language training texts can be used to represent training data to train the initial vowel recovery model. Wherein the feature representation of each word in the classical Alphabet training text can be obtained by the trained language model.

For the initial vowel recovery model trained on the classical apha training text, in this embodiment, a small amount of modern apha training texts with vowel labeling results and the features of each word in the modern apha training texts are further represented as training data, and the initial vowel recovery model is fine-tuned to obtain a final vowel recovery model.

The modern Arabic training text with the vowel labeling result can be obtained from an open source data set or can be manually labeled.

The training mode of the vowel restoration model provided by this embodiment is that, at first, an initial vowel restoration model is trained by using a classical aphasian training text which is easy to obtain and carries vowel labeling results, and then, the initial vowel restoration model is finely tuned by using a small amount of modern aphasian training texts which carry vowel labeling results, so that the problem that the modern aphasian training texts with vowel labeling results are deficient is solved, meanwhile, the generalization capability of the model on small sample data is enhanced, and the finally obtained vowel restoration model is guaranteed to have a good effect on modern aphasian.

Further, vowel restoration of the bilingual text can be used for performing speech synthesis on the bilingual text, in the speech synthesis process, a digital symbol needs to be converted into a bilingual pronunciation word, the digital symbol has lattice bit transformation, usually, the lattice bit is embodied on the vowel of a character at the end of the word, that is, the lattice bit of the word may be different according to the vowel form of the character at the end of the word, and may include 6 lattice bits or be in a silent symbol form. The lattice transformation of the digital symbols requires prediction in combination with the result of vowel recovery.

In order to solve the problem of digit symbol lattice transformation, in the training process of a vowel recovery model, the embodiment of the application can preprocess the acquired modern Arabic training text.

Specifically, the process of acquiring the modern ananas training text with the vowel labeling result may include:

and S1, acquiring the modern Alphabet training text.

And other characters except the number symbols in the modern Arabic training text are marked with vowels.

And S2, converting the numeric symbols contained in the modern Alphasic training text into Alphasic pronunciation words, and labeling vowels of other characters except the tail character in the converted Alphasic pronunciation words.

Specifically, since the position of the word is generally determined by the vowel form of the last character, in order to allow the vowel recovery model to accurately predict the position change of the numeric symbol, in this step, the numeric symbol is first converted into an aloud pronunciation word, and other characters except the last character in the word are labeled with vowels, that is, the vowel recovery model is focused on recovering the vowel of the last character.

It can be understood that, after the trained vowel restoration model is obtained, the to-be-processed bilingual text and the feature representation of each word in the to-be-processed bilingual text can be processed by using the vowel restoration model, so as to obtain a vowel labeling result corresponding to each character in the to-be-processed bilingual text output by the vowel restoration model.

The vowel recovery model obtains text features of each character in the to-be-processed whisper text, and determines vowel labeling results corresponding to each character in the to-be-processed whisper text based on the text features and feature representation of a word to which each character belongs.

Next, the processing procedure of the vowel restoration model will be described with reference to the vowel restoration model structure illustrated in fig. 2.

FIG. 2 illustrates an alternative vowel restoration model structure, which may include a feature extraction layer, a feature fusion layer, and a classification layer.

Wherein:

and acquiring the text characteristics of each character in the to-be-processed whisper text by a characteristic extraction layer.

And fusing the text characteristics of each character in the to-be-processed whisper text with the characteristic representation of the word to which the character belongs by a characteristic fusion layer to obtain the fusion characteristic representation of the to-be-processed whisper text.

Wherein the feature representation of the word to which each character belongs can be obtained by a pre-trained language model.

And determining vowel labeling results corresponding to all characters in the to-be-processed bilingual text by a classification layer based on the fusion characteristic representation of the to-be-processed bilingual text.

Further, as shown with reference to fig. 3, the feature extraction layer may include a first feature extraction layer and a second feature extraction layer.

And acquiring the coding characteristics of each character in the to-be-processed bilingual text by utilizing a first characteristic extraction layer.

The first feature extraction layer can adopt a character embedding layer for extracting the coding features of each character in the to-be-processed bilingual text.

And acquiring character features of each character in the to-be-processed whisper text in a set length window by utilizing a second feature extraction layer.

The second feature extraction layer may adopt a CNN network structure, for example, may be one or more layers of 1-dimensional convolutional neural networks, and is configured to extract information of a string with a set length, for example, the string may be slid on a string corresponding to the to-be-processed japanese text according to a window with the set length, and for each string in the sliding window, the second feature extraction layer is used to extract information features of the string in the window, that is, to extract character features in the window with the set length.

And further combining the coding features and the character features into text features of the characters, and inputting the text features into a feature fusion layer.

For the feature fusion layer, it may adopt many different network architectures, such as bidirectional LSTM network, GRU network, etc. Taking the example of fig. 3 as an example, the feature fusion layer may include two bidirectional LSTM networks, defined as a first bidirectional LSTM network and a second bidirectional LSTM network, respectively.

The text features of the characters extracted by the feature extraction layer are input into the first bidirectional LSTM network, and the features of the whole sentence of text can be extracted by the first bidirectional LSTM network. The output of the first bidirectional LSTM network is connected to the second bidirectional LSTM network.

The feature representation of the word to which the character output by the pre-trained language model belongs is input to a second bi-directional LSTM network.

In an alternative mode, considering that the dimension of the feature representation of the word to which the character output by the language model belongs may be different from the dimension of the feature representation output by the first bidirectional LSTM network, in order to facilitate the second bidirectional LSTM network to fuse the two features, a fully-connected network may be further added between the language model and the second bidirectional LSTM network, so as to adjust the dimension of the feature representation of the word to which the character output by the language model belongs to match the dimension of the feature representation output by the first bidirectional LSTM network, and in general, the dimension of the feature representation of the word to which the character output by the language model belongs is higher, and the feature representation of the word to which the character belongs may be subjected to dimension reduction processing through the fully-connected network.

Further, by adding a fully connected network, the network learning capability of the vowel restoration model can be further increased.

The second bidirectional LSTM network performs a fusion process, for example, a fusion process in the form of feature concatenation, on the feature representation of each character output by the first bidirectional LSTM network and the feature representation of the word to which the character output by the language model belongs. The second bidirectional LSTM network ultimately outputs a fused feature representation of the to-be-processed aloud text.

Optionally, in order to prevent the overfitting problem during model training, a dropout layer may be added after the feature fusion layer, and specifically, a dropout layer may be added after each of the first and second bidirectional LSTM networks.

And for the classification layer, the classification layer can be composed of a full connection layer and a softmax classifier, and the classification layer predicts a vowel labeling result corresponding to each character in the to-be-processed bilingual text based on the fusion characteristic representation of the to-be-processed bilingual text.

In some embodiments of the present application, an alternative implementation of the method for recovering an aloud vowel is presented. Compared with the solution of the corresponding embodiment in fig. 1, in the embodiment of the present application, in order to solve the problem of digital symbol lattice transformation in the to-be-processed aloud text, before determining the feature representation of each word in the to-be-processed aloud text by using the pre-trained language model in the foregoing step S110, the following processing steps may be further added:

Specifically, because the positions of the words are generally determined by the vowel form of the last character, in order to accurately predict the position change of the numeric symbols, in this embodiment, the numeric symbols in the to-be-processed bilingual text are first converted into bilingual pronunciation words, and other characters except the last character in the words are labeled with vowels, so that only the vowels of the last character need to be concentrated on restoring, and then the positions of the bilingual pronunciation words corresponding to the numeric symbols can be more accurately obtained, thereby providing a basis for the subsequent phonetic synthesis of the bilingual text.

In some embodiments of the present application, there is further provided another method of recovering an aloud vowel.

On the basis of the foregoing embodiments, in this embodiment, a process of correcting a vowel labeling result corresponding to each character in the determined to-be-processed japanese text by using a set vowel labeling rule may be further added.

In order to avoid errors in the vowel labeling result recovered from the to-be-processed bilingual text in the scheme, the embodiment of the application can summarize vowel labeling rules of some specific words in advance, and then correct the vowel labeling result according to the rules.

In this embodiment, vowel labeling rules for three types of words are illustrated, which are respectively introduced as follows:

the first type of word: words with fixed vowel notation.

The second type of word: the last character determines the word position and only words whose positions change.

The third type of word: words formed by combinations of articles and nouns.

Based on this, referring to the set vowel labeling rule, the process of correcting the determined vowel labeling result corresponding to each character in the to-be-processed aloud text may include any one or more of the following three ways:

1) and detecting whether the to-be-processed whisper text contains a first type word, if so, replacing the vowel labeling result of each character of the first type word in the vowel labeling result of the to-be-processed whisper text with the configured vowel labeling result corresponding to each character of the first type word.

Specifically, for the first type of word with the fixed vowel recovery form, the configured vowel labeling result corresponding to the first type of word may be directly used to replace the vowel labeling result of each character of the first type of word in the vowel labeling result of the to-be-processed aloud text obtained in the previous step.

2) And detecting whether the to-be-processed whisper text contains a second type word, if so, replacing the vowel labeling result of the corresponding character in the second type word in the vowel labeling result of the to-be-processed whisper text by using the configured vowel labeling result of each character except the tail character in the second type word.

Specifically, for the second type of word whose last character determines the word position and only the position changes, the vowel recovery form of the characters of other parts except the last character is also fixed, so for the second type of word, the vowel labeling result of each character except the last character can be configured in advance, and further for the second type of word in the to-be-processed aloud text, the configured vowel labeling result of each character except the last character in the second type of word is used to replace the vowel labeling result of the corresponding character in the second type of word in the vowel labeling result of the to-be-processed aloud text.

3) Detecting whether a combined word formed by combining articles and nouns is contained in the to-be-processed bilingual text, if so, processing the articles and the nouns in the combined word respectively according to the following modes:

for the noun, processing is performed with reference to a vowel tagging rule for a first type of word and a second type of word.

For the articles, determining whether the compound word in which the articles are located is in the middle of a sentence or in the beginning of a sentence;

and if the text is in the middle position of the sentence, replacing the vowel annotation result of the corresponding character of the article in the to-be-processed bilingual text by using the configured vowel annotation form of the first article.

In particular, the articles in Argan are

The combined word in which the article is located has different vowel notation forms when in the sentence and when in the sentence, the first vowel notation form exists:

when at the beginning of a sentence, there is a second meta-note form:

and

of course, besides the three cases of the above example, some other vowel labeling rules may be set according to the rule of the bilingual pronunciation, and then the vowel labeling result of the bilingual text to be processed is corrected according to the set vowel labeling rules. For example, words of the article combined with the reading-through characters, words containing two muteins, and the like, may be set according to the summarized vowel tagging rules. So as to further improve the accuracy of vowel recovery.

Further, considering that the vowel recovery result can be used in speech synthesis, and an arabic generally has a habit of swallowing when speaking, the further processing of the tail sound can be realized by the aid of the method, for example, vowels of characters at the tail of a word can be directly marked as silent characters, and although the processing method cannot reflect the word position problem, the actual pronunciation does not influence understanding.

The arabic vowel restoration apparatus provided in the embodiments of the present application is described below, and the arabic vowel restoration apparatus described below and the arabic vowel restoration method described above may be referred to in correspondence.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an arabic vowel restoration apparatus disclosed in the embodiment of the present application.

As shown in fig. 4, the apparatus may include:

the text acquisition unit 11 is used for acquiring an bilingual text to be processed;

the word feature representation determining unit 12 is configured to determine feature representation of each word in the to-be-processed bilingual text by using a pre-trained language model, where the language model is obtained by training a bilingual training text without vowel labels;

a character text feature obtaining unit 13, configured to obtain a text feature of each character in the to-be-processed whisper text;

and the vowel labeling result determining unit 14 is configured to determine a vowel labeling result corresponding to each character in the to-be-processed bilingual text based on the text feature of each character in the to-be-processed bilingual text and the feature representation of the word to which each character belongs.

Optionally, the apparatus of the present application may further include: the language model training unit is configured to train to obtain the language model, and the process of obtaining the language model by the training of the language model training unit may include:

obtaining an aphasian training text without vowel labels;

Alternatively, the language model may be a mask language model based on a BERT structure.

Optionally, the above-mentioned japanese training texts may include modern japanese training texts and/or classical japanese training texts.

Optionally, the process of determining the feature representation of each word in the to-be-processed aloud text by the word feature representation determining unit using the pre-trained language model may include:

and inputting the to-be-processed whisper text into the language model to obtain a word vector of each word in the to-be-processed whisper text output by the language model.

Optionally, the process of determining the vowel labeling result corresponding to each character in the to-be-processed bilingual text by the vowel labeling result determining unit based on the text feature of each character in the to-be-processed bilingual text and the feature representation of the word to which each character belongs may include:

Optionally, the implementation processes of the character text feature obtaining unit and the vowel labeling result determining unit may be specifically implemented by a model processing unit, where the model processing unit is configured to:

Optionally, the process of processing the to-be-processed aloud text and the feature representation of each word in the to-be-processed aloud text by using the pre-trained vowel restoration model by the model processing unit may include:

Optionally, the process of obtaining the text feature of each character in the to-be-processed whisper text by the model processing unit using the feature extraction layer of the vowel restoration model may include:

Optionally, the apparatus of the present application may further include:

a digital symbol processing unit for performing the following steps before the processing by the word feature representation determining unit: if the fact that the to-be-processed whisper text contains the digital symbols is detected, the digital symbols are converted into whisper pronunciation words, and other characters except the tail character in the converted whisper pronunciation words are marked with vowels.

Optionally, the apparatus of the present application may further include: the process of obtaining the vowel recovery model by training with the vowel recovery model training unit may include:

Optionally, the process of acquiring a modern ananas training text with a vowel annotation result by the vowel restoration model training unit may include:

and converting the digital symbols contained in the modern Alphasic training text into Alphasic pronunciation words, and labeling vowels of characters except the positions in the converted Alphasic pronunciation words.

Optionally, the apparatus of the present application may further include: and the rule correction unit is used for correcting the determined vowel marking result corresponding to each character in the to-be-processed bilingual text by referring to the set vowel marking rule.

Optionally, the vowel labeling rules may include vowel labeling rules for a first type of words, and/or vowel labeling rules for a second type of words, and/or vowel labeling rules for a third type of words, where the first type of words are words with fixed vowel labeling formats, the second type of words are words with end characters determining word positions and only the position is changed, and the third type of words are words formed by combining articles and nouns. Based on this, the process of the rule correcting unit referring to the set vowel labeling rule to correct the determined vowel labeling result corresponding to each character in the to-be-processed aloud text may include:

and/or the presence of a gas in the gas,

The arabic vowel restoration apparatus provided in the embodiments of the present application may be applied to arabic vowel restoration devices, such as a terminal: mobile phones, computers, etc. Alternatively, fig. 5 shows a block diagram of a hardware structure of the arabic vowel restoration device, and referring to fig. 5, the hardware structure of the arabic vowel restoration device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

acquiring an Artisian text to be processed;

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a storage medium, where a program suitable for execution by a processor may be stored, where the program is configured to:

acquiring an Artisian text to be processed;

Further, an embodiment of the present application further provides a computer program product, which when running on a terminal device, causes the terminal device to execute any implementation manner of the above-mentioned arabic vowel restoration method.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, the embodiments may be combined as needed, and the same and similar parts may be referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An arabic vowel restoration method, comprising:

acquiring an Artisian text to be processed;

2. The method of claim 1, wherein the training process of the language model comprises:

obtaining an aphasian training text without vowel labels;

3. The method according to claim 1, wherein the determining the vowel labeling result corresponding to each character in the to-be-processed whisper text based on the text feature of each character in the to-be-processed whisper text and the feature representation of the word to which each character belongs comprises:

4. The method according to claim 1, wherein the process of obtaining the text features of each character in the to-be-processed whisper text, and determining the vowel labeling result corresponding to each character in the to-be-processed whisper text based on the text features and the feature representation of the word to which each character belongs comprises:

5. The method of claim 4, wherein the processing of the feature representation of the to-be-processed whisper text and each word thereof using a pre-trained vowel recovery model comprises:

6. The method according to claim 5, wherein the obtaining the text feature of each character in the to-be-processed whisper text by using the feature extraction layer of the vowel restoration model comprises:

7. The method of claim 1, wherein prior to said determining the feature representation for each word in the to-be-processed whisper text using the pre-trained language model, the method further comprises:

8. The method of claim 4, wherein the training process of the vowel restoration model comprises:

9. The method of claim 8, wherein the process of obtaining the modern Alphasic training text with the vowel annotation result comprises:

10. The method according to any one of claims 1-9, further comprising:

11. The method of claim 10,

the vowel labeling rules comprise vowel labeling rules for first type words, and/or vowel labeling rules for second type words, and/or vowel labeling rules for third type words, wherein the first type words are words with fixed vowel labeling forms, the second type words are words with tail characters determining word positions and only the positions can be changed, and the third type words are words formed by combining articles and nouns;

and/or the presence of a gas in the gas,

12. An arabic vowel restoration device, comprising:

13. An arabic vowel restoration device, comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the arabic vowel restoration method according to any one of claims 1 to 11.

14. A storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, performs the steps of the arabic vowel recovery method according to any one of claims 1 to 11.