CN118471266B

CN118471266B - Pronunciation prediction method, pronunciation prediction device, electronic apparatus, and storage medium

Info

Publication number: CN118471266B
Application number: CN202410921678.7A
Authority: CN
Inventors: 王冠博; 邵志明; 黄宇凯; 李科; 廖晓玲
Original assignee: Beijing Speechocean Technology Co ltd
Current assignee: Beijing Speechocean Technology Co ltd
Priority date: 2024-07-10
Filing date: 2024-07-10
Publication date: 2024-09-03
Anticipated expiration: 2044-07-10
Also published as: CN118471266A

Abstract

The present disclosure relates to the field of speech processing in computer technology, and more particularly to a pronunciation prediction method, a pronunciation prediction device, an electronic apparatus, and a storage medium. Aiming at the problem that the data of a single mode cannot cope with complex pronunciation, the pronunciation prediction method comprises the following steps: acquiring data to be predicted, and analyzing a data mode of the data to be predicted; based on a pronunciation prediction model and the data modality, carrying out pronunciation prediction on the data to be predicted to obtain a pronunciation prediction result; the input of the pronunciation prediction model is multimodal data, and the output of the pronunciation prediction model is a pronunciation prediction result. According to the method and the device, the multi-mode data is input into the model, pronunciation prediction is carried out on the data to be predicted, and accuracy of pronunciation prediction is improved.

Description

Pronunciation prediction method, pronunciation prediction device, electronic apparatus, and storage medium

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to a pronunciation prediction method, a pronunciation prediction device, an electronic apparatus, and a storage medium.

Background

In speech processing techniques, pronunciation predictions typically rely on a single modality of data input. In some cases single modality data entry, pronunciation prediction for a particular word remains challenging. And under different contexts, the same word may have multiple pronunciation modes to influence the prediction effect. Therefore, how to implement pronunciation prediction based on multi-modal data input is a problem to be solved.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a pronunciation prediction method, a pronunciation prediction device, an electronic apparatus, and a storage medium.

According to a first aspect of embodiments of the present disclosure, a pronunciation prediction method is provided, including obtaining data to be predicted, and analyzing a data modality of the data to be predicted; based on a pronunciation prediction model and the data modality, carrying out pronunciation prediction on the data to be predicted to obtain a pronunciation prediction result; the input of the pronunciation prediction model is multimodal data, and the output of the pronunciation prediction model is a pronunciation prediction result.

In one embodiment, the pronunciation prediction is performed on the data to be predicted based on a pronunciation prediction model and the data modality; if the data mode of the data to be predicted is consistent with the data mode corresponding to the multi-mode data, carrying out pronunciation prediction on the data to be predicted based on the data mode corresponding to the multi-mode data; if the data mode of the data to be predicted is inconsistent with the data mode corresponding to the multi-mode data, the input position of a first data mode is empty, and pronunciation prediction is performed on the data to be predicted based on a second data mode and the empty first data mode; the first data mode is a data mode inconsistent with the data mode of the data to be predicted in the data modes corresponding to the multi-mode data; the second data mode is the data mode consistent with the data mode of the data to be predicted in the data mode corresponding to the multi-mode data.

In one embodiment, the pronunciation prediction model is a model with an automatic encoder decoder architecture; the performing pronunciation prediction on the data to be predicted includes: constructing an input sequence of a decoder of the pronunciation prediction model based on the data to be predicted; during the first decoding, the first input sequence is constructed by sequentially comprising the following contents: the method comprises the steps of carrying out pronunciation prediction on data to be predicted based on a first input sequence to obtain a first output sequence, wherein the first output sequence sequentially comprises the data to be predicted, the second label, the third label, the fourth label and a pronunciation prediction result containing one bit; during the ith decoding, a second input sequence is constructed based on the first output sequence, and the second input sequence sequentially comprises the following contents: the method comprises the steps of performing pronunciation prediction on data to be predicted based on a second input sequence to obtain a second output sequence, wherein the second output sequence sequentially comprises the data to be predicted, the second label, the third label, the fourth label and a pronunciation prediction result containing i bits, and the i bits are integers larger than 1; repeatedly executing the above process until the decoder of the pronunciation prediction model outputs an output sequence containing a fifth label to obtain a pronunciation prediction result, wherein the output sequence sequentially comprises data to be predicted, a second label, a third label, a fourth label, a pronunciation prediction result and the fifth label; the first label is used for identifying the starting position of the data to be predicted, the second label is used for identifying the starting of pronunciation prediction, the third label is used for identifying the language corresponding to the data to be predicted, the fourth label is used for identifying the data mode corresponding to the data to be predicted, and the fifth label is used for identifying the ending of pronunciation prediction.

In one embodiment, the multimodal data corresponding data modalities include an audio modality and a text modality; the predicting the data to be predicted comprises the following steps: taking the input sequence as a prompting word of a decoder and taking the audio characteristic as an encoder to obtain multi-mode joint information encoding, wherein if the data mode of the data to be predicted comprises an audio mode and does not comprise a text mode, the input audio signal of the encoder is the audio characteristic of the audio data mode included in the data to be predicted, the first tag of the input sequence of the decoder and the position of the data to be predicted are empty, and if the data mode of the data to be predicted does not comprise the audio text mode and does not comprise the audio mode, the input of the decoder is the input sequence of the text mode included in the data to be predicted and the input position of the encoder is empty; and outputting the output sequence by the decoder based on the joint information coding to obtain a pronunciation prediction result.

In one embodiment, the pronunciation prediction model is pre-trained by: acquiring a training sample data set, wherein the training sample data set comprises training sample data pairs, and the training sample data pairs comprise modal data and pronunciation results in multi-modal data; respectively extracting data features corresponding to the modal data to obtain multi-modal data features; and taking the first modal data characteristic of the multi-modal data characteristic as an encoder input and/or taking the second modal data characteristic as a decoder prompt word, taking the pronunciation result as a decoder output, and performing autoregressive training on a prediction model of an automatic encoder decoder framework to obtain the pronunciation prediction model.

According to a second aspect of embodiments of the present disclosure, there is provided a pronunciation prediction device, including an obtaining unit, configured to obtain data to be predicted, and parse a data modality of the data to be predicted; the prediction unit is used for performing pronunciation prediction on the data to be predicted based on a pronunciation prediction model and the data modality to obtain a pronunciation prediction result; the input of the pronunciation prediction model is multimodal data, and the output of the pronunciation prediction model is a pronunciation prediction result.

In one embodiment, the prediction unit performs pronunciation prediction on the data to be predicted based on a pronunciation prediction model and the data modality in the following manner; if the data mode of the data to be predicted is consistent with the data mode corresponding to the multi-mode data, carrying out pronunciation prediction on the data to be predicted based on the data mode corresponding to the multi-mode data; if the data mode of the data to be predicted is inconsistent with the data mode corresponding to the multi-mode data, the input position of a first data mode is empty, and pronunciation prediction is performed on the data to be predicted based on a second data mode and the empty first data mode; the first data mode is a data mode inconsistent with the data mode of the data to be predicted in the data modes corresponding to the multi-mode data; the second data mode is the data mode consistent with the data mode of the data to be predicted in the data mode corresponding to the multi-mode data.

In one embodiment, the pronunciation prediction model is a model with an automatic encoder decoder architecture; the prediction unit predicts pronunciation of the data to be predicted in the following manner: constructing an input sequence of a decoder of the pronunciation prediction model based on the data to be predicted; during the first decoding, the first input sequence is constructed by sequentially comprising the following contents: the method comprises the steps of carrying out pronunciation prediction on data to be predicted based on a first input sequence to obtain a first output sequence, wherein the first output sequence sequentially comprises the data to be predicted, the second label, the third label, the fourth label and a pronunciation prediction result containing one bit; during the ith decoding, a second input sequence is constructed based on the first output sequence, and the second input sequence sequentially comprises the following contents: the method comprises the steps that a first label, data to be predicted, a second label, a third label, a fourth label and an i-1 bit pronunciation prediction result are used for carrying out pronunciation prediction on the data to be predicted based on a second input sequence to obtain a second output sequence, wherein the second output sequence sequentially comprises the data to be predicted, the second label, the third label, the fourth label and the pronunciation prediction result containing the i bit, and i is an integer larger than 1; repeatedly executing the above process until the decoder of the pronunciation prediction model outputs an output sequence containing a fifth label to obtain a pronunciation prediction result, wherein the output sequence sequentially comprises data to be predicted, a second label, a third label, a fourth label, a pronunciation prediction result and the fifth label; the first label is used for identifying the starting position of the data to be predicted, the second label is used for identifying the starting of pronunciation prediction, the third label is used for identifying the language corresponding to the data to be predicted, the fourth label is used for identifying the data mode corresponding to the data to be predicted, and the fifth label is used for identifying the ending of pronunciation prediction.

In one embodiment, the multimodal data corresponding data modalities include an audio modality and a text modality; the prediction unit predicts the voice data to be predicted in the following manner: taking the input sequence as a prompting word of a decoder and taking the audio characteristic as an encoder to obtain multi-mode joint information encoding, wherein if the data mode of the data to be predicted comprises an audio mode and does not comprise a text mode, the input of the encoder is the audio characteristic of the audio mode included in the data to be predicted, the first tag of the input sequence of the decoder and the position of the data to be predicted are empty, and if the data mode of the data to be predicted comprises the text mode and does not comprise the audio mode, the input of the decoder is the input sequence of the text mode included in the data to be predicted and the input position of the encoder is empty; and outputting the output sequence by the decoder based on the joint information coding to obtain a pronunciation prediction result.

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device, comprising: a processor: a memory for storing processor-executable instructions; wherein the processor is configured to: the pronunciation prediction method as described in the first aspect or any implementation manner of the first aspect is performed.

According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium having stored therein instructions which, when executed by a processor, enable the processor to perform the pronunciation prediction method as described in the first aspect or any one of the implementation manners of the first aspect.

The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects: and acquiring data to be predicted, analyzing a data mode of the data to be predicted, and performing pronunciation prediction on the data to be predicted based on the data to be predicted and the data mode, wherein the input of a pronunciation prediction model is multi-mode data. According to the method and the device for predicting the pronunciation of the data, a multi-mode data input model is realized, pronunciation prediction is carried out on the data to be predicted, and accuracy of pronunciation prediction is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow chart illustrating a pronunciation prediction method, according to an exemplary embodiment.

FIG. 2 is a flow chart illustrating a pronunciation prediction method, according to an exemplary embodiment.

FIG. 3 is a flowchart illustrating a pronunciation prediction method, according to an exemplary embodiment.

Fig. 4 is a schematic diagram illustrating an input sequence according to an exemplary embodiment.

FIG. 5 is a flowchart illustrating a pronunciation prediction method, according to an exemplary embodiment.

FIG. 6 is a schematic diagram illustrating pronunciation prediction based on an automatic encoder and decoder architecture, according to an exemplary embodiment.

FIG. 7 is a flowchart illustrating pronunciation prediction model training according to an exemplary embodiment.

Fig. 8 is a block diagram illustrating a pronunciation prediction device according to an exemplary embodiment.

FIG. 9 is a block diagram of an electronic device for pronunciation prediction, according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure.

In the process of data annotation, pronunciation predictions typically rely on a single modality of data input, such as using only audio data or using only text data. The pronunciation prediction method based on the voice data often has the problems of noise, accent and the like, and influences the prediction effect. Text-based pronunciation prediction methods, however, in some cases, pronunciation prediction of a particular word remains challenging. The same word may have multiple pronunciation patterns in different contexts. Single modality data entry typically operates independently, and the lack of efficient information exchange and integration results in limitations in processing complex speech and text relationships.

In view of this, the disclosure proposes a pronunciation prediction method, which predicts pronunciation of data to be predicted based on multi-modal data, thereby improving accuracy of pronunciation prediction and wide applicability of pronunciation prediction.

FIG. 1 is a flow chart illustrating a pronunciation prediction method, according to an exemplary embodiment. As shown in fig. 1, the method includes steps S11 to S12.

In step S11, data to be predicted is obtained, and a data modality of the data to be predicted is resolved.

In the embodiments of the present disclosure, the data modality of the data to be predicted may be understood as a data type or a data existence form of the data to be predicted. For example, the data to be predicted may be a text modality, or an audio modality, or a data modality of a text modality and an audio modality.

In step S12, based on the pronunciation prediction model and the data modality, pronunciation prediction is performed on the data to be predicted, and a pronunciation prediction result is obtained.

In the embodiment of the present disclosure, the pronunciation prediction model may be understood as a model for pronunciation prediction of data to be predicted based on a data modality of the data to be predicted. The input of the pronunciation prediction model is multi-modal data, for example, if the data mode of the data to be predicted is a text mode and an audio mode, the multi-modal data is the data of the text mode and the data of the audio mode, that is, the input of the pronunciation prediction model is the data of the text mode and the data of the audio mode. The output of the pronunciation prediction model is a pronunciation prediction result, and it can be understood that after the data mode of the data to be processed is analyzed to obtain multi-mode data, the multi-mode data is input into the pronunciation prediction model to obtain the pronunciation prediction result of the data to be predicted. The multi-modal data may be data of an audio mode, a text mode, or data of other modes such as an image mode, a video mode, etc., which is not limited herein.

In the embodiment of the disclosure, the pronunciation prediction result of the data to be processed is obtained based on the multimodal data input pronunciation prediction model obtained after the data to be processed is analyzed, so that the pronunciation prediction result is improved, and meanwhile, the prediction effect of the data to be predicted is improved based on the multimodal data input.

In the embodiment of the disclosure, different data are used for predicting data to be predicted based on the relationship between the data mode of the data to be processed and the multi-mode data.

FIG. 2 is a flow chart illustrating a pronunciation prediction method, according to an exemplary embodiment. As shown in fig. 2, the method includes steps S21 to S23b.

In step S21, pronunciation prediction is performed on the data to be predicted based on the pronunciation prediction model and the data modality.

In step S22a, if the data mode of the data to be predicted is consistent with the data mode corresponding to the multi-mode data.

In the embodiment of the disclosure, the data mode of the data to be predicted is consistent with the data mode corresponding to the multi-mode data, which can be understood as a text mode and an audio mode, the data mode of the data to be predicted is analyzed to obtain the text mode and the audio mode, that is, the data type of the data to be predicted is the same as the data type of the multi-mode data, and the data mode of the data to be predicted is consistent with the data mode corresponding to the multi-mode data.

In step S23a, pronunciation prediction is performed on the data to be predicted based on the data modality corresponding to the multi-modality data.

In the embodiment of the disclosure, if the data mode of the data to be predicted is consistent with the data mode corresponding to the multi-mode data, for example, the text mode and the audio mode corresponding to the multi-mode data, is based on the data mode and the audio mode, and the data to be predicted is predicted.

In step S22b, if the data mode of the data to be predicted is inconsistent with the data mode corresponding to the multi-mode data.

In the embodiment of the disclosure, the data mode of the data to be predicted is inconsistent with the data mode corresponding to the multi-mode data, which may be understood that if the data mode of the multi-mode data is a text mode and an audio mode, the data mode of the data to be predicted is a text mode, the data mode of the data to be predicted is inconsistent with the data mode of the multi-mode data.

In step S23b, the input position of the first data modality is nulled, and pronunciation prediction is performed on the data to be predicted based on the second data modality and the nulled first data modality.

In the embodiment of the disclosure, the first data mode is a data mode inconsistent with a data mode of the data to be predicted in the data modes corresponding to the multi-mode data. In an example, the multi-modal data includes a text mode and an audio mode, and the data mode of the data to be predicted is the text mode, and the first data mode is the audio mode. The second data mode is the data mode consistent with the data mode of the data to be predicted in the data mode corresponding to the multi-mode data. In an example, if the multimodal data includes a text modality and an audio modality, the data modality of the data to be predicted is the text modality, and the second data modality is the text modality.

In an example, if the data mode of the multi-mode data is a text mode and an audio mode, the data mode of the data to be predicted is a text mode, then the first data mode is an audio mode, the second data mode is a text mode, when the data to be processed is predicted, the first data mode is set to be empty, and based on the second data mode, namely the text mode and the empty first data mode, the data to be predicted is predicted in a sounding mode. Wherein a null may be understood as a state set to "null" or "no value" or ignoring the first data modality.

In an example, if the data mode of the multi-mode data is a text mode and an audio mode, the data mode of the data to be predicted is an audio mode, the first data mode is the text mode, the second data mode is the audio mode, when the data to be processed is predicted, the first data mode is set to be empty, and based on the second data mode, namely the audio mode and the empty first data mode, the data to be predicted is predicted in a sounding mode.

In the embodiment of the disclosure, pronunciation prediction is performed on data to be predicted based on a pronunciation model and a data modality. The input of the pronunciation prediction mode is multi-mode data, and the data to be predicted can be multi-mode data or single-mode data. And whether the data to be predicted is multi-mode data or single-mode data, the data to be predicted can be subjected to pronunciation prediction based on the pronunciation prediction model, so that multiple input formats are supported, and the pronunciation prediction model has wide practicability.

In an embodiment of the disclosure, an output sequence characterizing pronunciation predictions is obtained based on an input sequence of a pronunciation prediction model. FIG. 3 is a flowchart illustrating a pronunciation prediction method, according to an exemplary embodiment. As shown in fig. 3, the method includes steps S31 to S32.

In step S31, an input sequence of a decoder of the pronunciation prediction model is constructed based on the data to be predicted.

In an embodiment of the present disclosure, the pronunciation prediction model may be an Automatic Encoder Decoder (AED) architecture.

In the embodiment of the disclosure, at the time of decoding 1 st time, a first input sequence is constructed, wherein the first input sequence includes the following contents: the data to be predicted comprises a first label, data to be predicted, a second label, a third label and a fourth label. Wherein the first tag is a start position identifying data to be predicted, and may be a text Start (SOT) tag, for example. The data to be predicted is data obtained when pronunciation prediction is performed. The second tag, which may be a start of utterance (SOP) tag, predicts the beginning of utterance for the identification. The third Tag is a Language corresponding to the data to be predicted, for example, may be a target Language Tag, where the third Tag may include a chinese (ZH) Tag, an English (EN) Tag, and so on. The fourth Tag is a data mode corresponding to the data to be predicted, for example, the fourth Tag may be a Task Tag (Task Tag), wherein the Task Tag may be understood as the data mode of the data to be predicted, and the fourth Tag may be an audio prediction pronunciation (S), a text prediction pronunciation (T), an audio and text prediction pronunciation (ST), and the like. And performing pronunciation prediction on the data to be predicted based on the first input sequence to obtain a first output sequence. The first output sequence comprises data to be predicted, a second tag, a third tag, a fourth tag and a pronunciation prediction result containing 1 bit. Illustratively, at the first decoding time, there is no pronunciation prediction result in the first input sequence of the decoder. The first output sequence has the first bit of the pronunciation prediction result, i.e., the 1-bit pronunciation prediction result.

In the embodiment of the disclosure, during the ith decoding, based on the output sequence of the ith-1 th decoding, the last bit of the output sequence of the ith-1 th decoding is taken as an input, and the i-1 bit pronunciation prediction result is taken as an input, so that a second input sequence is constructed. The second input sequence comprises a first label, data to be predicted, a second label, a third label, a fourth label and an i-1 bit pronunciation prediction result. And performing pronunciation prediction on the data to be predicted based on the second input sequence to obtain a second output sequence. The second output sequence comprises data to be predicted, a second tag, a third tag, a fourth tag and a pronunciation prediction result containing i bits. Illustratively, at the second decoding time, i.e., i=2, the second output sequence of the decoder includes the second bit of the voicing prediction result, where i is an integer greater than 1.

In step S32, an output sequence representing the pronunciation prediction result is obtained based on the input sequence and the pronunciation prediction model.

In an embodiment of the disclosure, the above-described process is repeatedly performed until the decoder output of the pronunciation prediction model includes the output sequence of the fifth tag. Wherein the fifth tag is a tag identifying the end of pronunciation prediction, which may be, for example, an end of pronunciation (EOP) tag. And the output sequence comprises data to be predicted, a second label, a third label, a fourth label, a pronunciation prediction result and a fifth label. The pronunciation prediction result is a pronunciation result corresponding to the data to be predicted.

In embodiments of the present disclosure, an output sequence characterizing a pronunciation prediction result may be obtained based on the input sequence and a pronunciation prediction model.

In one example, FIG. 4 is a schematic diagram illustrating an input sequence according to an example embodiment. As shown in fig. 4, the first tag corresponds to the SOT, identifying the beginning of text. The data to be predicted identifies data acquired when pronunciation predictions are made, and the data to be predicted can include a text modality and an audio modality. The second tag corresponds to the SOP and identifies the beginning of pronunciation. And the third label is used for marking the target language label, which can be Chinese, english and other languages, can be a target language selected by a user to be predicted, and can also be used for judging the target language by the model according to the language type of the data to be predicted. And the fourth label is used for identifying the task label, and if the currently input data to be predicted is in a text mode and an audio mode, the fourth label can be an ST label, namely audio and text prediction pronunciation, wherein the task label can be selected by a user based on the data mode of the data to be predicted. The pronunciation prediction result may be understood as a pronunciation prediction result that predicts the pronunciation of the original text data. And fifth label, marking pronunciation end.

In an example, if the data to be predicted includes an audio modality, the input sequence handles the first tag, the data to be predicted, i.e. the SOT, the original text, or may be understood as ignoring the first tag, the data to be predicted. And the input sequence is input from the second label and the pronunciation start position, and the target language and task are selected to obtain the pronunciation prediction result of the current data to be predicted until the pronunciation is finished.

In the embodiment of the disclosure, the input sequence of the pronunciation prediction model is constructed based on the data to be predicted, and the output sequence representing the pronunciation prediction result is obtained based on the input sequence and the pronunciation prediction model, so that the accuracy of pronunciation prediction is enhanced. Meanwhile, based on the input of the multi-modal data, the multi-modal data are mutually supervised, and the robustness of pronunciation prediction is improved.

In the embodiment of the disclosure, the multimodal data corresponding data modalities include an audio modality and a text modality. FIG. 5 is a flowchart illustrating a pronunciation prediction method, according to an exemplary embodiment. As shown in fig. 5, the method includes steps S41 to S42.

In step S41, the input sequence is used as a decoder prompt word, and the audio feature is used as an encoder input, so as to obtain the multi-mode joint information code.

In the embodiment of the disclosure, the data corresponding to data to be predicted includes an audio mode and a text mode, and the data corresponding to multi-mode data includes an audio mode and a text mode. An input sequence constructed based on the data to be predicted can be used as a decoder prompt word, and the audio features can be used as an encoder input to obtain the multi-mode joint information code in the AED architecture based on the input sequence and the audio features.

In the embodiment of the disclosure, if the data mode of the data to be predicted includes an audio mode and does not include a text mode, the audio signal input by the encoder is an audio feature of the audio data included in the data to be predicted, and the first tag of the decoder and the input of the data to be predicted are null. If the data mode of the data to be predicted comprises a text mode and does not comprise an audio mode, the input of the decoder is an input sequence constructed based on the text mode, and the input position of the input of the encoder is null.

In the above embodiments of the present disclosure, if the data modes of the data to be predicted include an audio mode and a text mode, the data modes may be understood as double-supervision prediction, the audio feature is used as the input of the encoder, and the input sequence including the text mode is used as the decoder prompt word, so as to obtain the multi-mode joint information code. If the data of the data to be predicted includes an audio mode, it can be understood as single audio prediction, the audio feature is used as the input of the encoder, the decoder prompts that the word is empty, i.e. the first tag of the input sequence of the decoder and the position of the data to be predicted are empty, and the corresponding pronunciation is used as the output of the decoder. If the data to be predicted contains text modes, single text prediction can be understood, an input sequence containing the text modes is used as a prompting word of a decoder, the input of the encoder is emptied, and the corresponding pronunciation is used as the output of the decoder.

In step S42, the output sequence is obtained by the decoder based on the joint information encoding, and the pronunciation prediction result is obtained.

In the disclosed embodiment, joint information is encoded and input into an AED architecture, and a decoder outputs an output sequence corresponding to the input sequence to obtain a pronunciation prediction result corresponding to data to be predicted.

In one example, FIG. 6 is a schematic diagram illustrating a pronunciation prediction based on an automatic encoder and decoder architecture, according to an exemplary embodiment. As shown in fig. 6, the left side in fig. 6 is an encoder, and the right side is a decoder. Wherein the encoder comprises a stack of convolutional layers, position coding, and N-layer encoder blocks consisting of a self-attention layer, a forward layer. The decoder includes an input sequence, a position code, an N-layer decoder module consisting of N self-attention layers, a cross-attention layer, a forward layer, and an output sequence.

For example, if the data modes of the data to be predicted include an audio mode and a text mode, the data modes can be understood as double-supervision prediction, the audio feature is used as input of an encoder, an input sequence including the text mode is used as a prompting word of a decoder, a multi-mode joint information code is obtained, and based on the joint information code, a corresponding output sequence is output by the decoder, so that a pronunciation prediction result is obtained. Wherein the decoder portion corresponds to the lower sequence as an input sequence and the decoder portion corresponds to the upper sequence as an output sequence. Illustratively, as shown in fig. 6, the input sequence includes an SOT text start tag, the data to be predicted is "hello", the SOP pronunciation start tag, the target language tag is ZH, the task tag is ST, and the resulting pronunciation prediction result is "ni". The input sequence is input to a decoder to obtain an output sequence. The output sequence includes data to be predicted "hello", SOP pronunciation start label, target language label is ZH, task label ST, pronunciation prediction result "nihao", that is, pronunciation prediction result "nihao" of data to be predicted "hello" can be obtained through AED architecture.

For example, if the data of the data to be predicted includes an audio mode, it can be understood as single audio prediction, where the audio feature is used as an input of an encoder, and the decoder hint word is set to be blank, and the corresponding pronunciation is used as an output of the decoder. If the data to be predicted contains text modes, single text prediction can be understood, an input sequence containing the text modes is used as a prompting word of a decoder, the input position of the encoder is empty, and the corresponding pronunciation is used as the output of the decoder.

In the embodiment of the disclosure, the AED architecture is used as a pronunciation prediction model, so that audio mode information and text mode information can be effectively fused, and the accuracy of pronunciation prediction is improved. Meanwhile, the multi-mode data input mode is adopted, so that the correlation between the audio mode and the text mode can be fully utilized, and the robustness of the AED architecture is improved. And the method can receive single-mode data input, predict pronunciation of the data to be processed, and can cope with various complex conditions.

In the embodiment of the disclosure, the pronunciation prediction model is trained in the following manner. FIG. 7 is a flowchart illustrating pronunciation prediction model training according to an exemplary embodiment. As shown in fig. 7, the method includes steps S51 to S53.

In step S51, a training sample data set is acquired, the training sample data set including training sample data pairs including each of the multi-modal data and a pronunciation result.

In an embodiment of the disclosure, the training sample data set includes training sample data pairs including modal data of the multi-modal data and pronunciation results. For example, taking an audio modality and a text modality as an example, the training sample dataset may include an audio modality and a corresponding pronunciation result, a text modality and a corresponding pronunciation result, an audio modality, a corresponding text modality, and a pronunciation result. The training sample data set may be derived from a corpus of various audio recognition, audio synthesis, and text processing tasks. By way of example, megahour data may be used to train a 1B-level large model, facilitating the promotion of model modeling capabilities, while including multilingual, accent, solutions, exowords, noise, polyphonic, homonym, etc., facilitating the processing of multiple languages and dialects, and new words that have not been seen by pronunciation prediction models using training sample data sets.

In step S52, data features corresponding to the respective modal data are extracted, respectively, to obtain multi-modal data features.

In the embodiment of the disclosure, the audio mode in the data to be predicted can be extracted to obtain the audio feature. Other spectral features, such as mel-frequency cepstral coefficients or Filter banks (fbank) are extracted for the audio mode and the resulting audio features are input into the encoder. And extracting text features, such as word vectors or statistical-based features, from the text mode by adopting a natural language processing technology or other language processing technologies. And respectively extracting data features corresponding to the modal data to obtain multi-modal data features.

In step S53, the first modality data feature is input as an encoder, and/or the second modality data feature is used as a decoder prompt word, and the pronunciation result is output as a decoder, and the prediction model of the automatic encoder decoder architecture is subjected to autoregressive training to obtain a pronunciation prediction model.

In the embodiment of the disclosure, a first mode data feature of the multi-mode data feature is input as an encoder, a second mode data feature of the multi-mode data feature is used as a decoder prompt word, the first mode data and/or the second mode data is input into an AED architecture to obtain a pronunciation result, and the pronunciation result is output as a decoder. And performing autoregressive training on the prediction model of the AED architecture to obtain a pronunciation prediction model.

In the embodiment of the disclosure, based on the rich training sample data set, the modeling capability of the model can be improved, and various complex conditions can be dealt with. While using an AED architecture that exhibits excellent performance in terms of end-to-end audio recognition, such that multimodal data, as well as multilingual information, is supported. And fully utilizes the autoregressive capability of the pronunciation prediction device to combine with text information to enhance the accuracy of pronunciation prediction.

The following describes a pronunciation prediction method according to an embodiment of the present disclosure. When the pronunciation prediction is carried out, the data to be predicted is obtained, the data mode of the data to be predicted is analyzed, and the pronunciation prediction is carried out on the data to be predicted based on the pronunciation prediction model and the data mode, so that a pronunciation prediction result is obtained. If the multimodal data comprises a text mode and an audio mode, and the data to be predicted is consistent with the data mode corresponding to the multimodal data, audio features related to the audio mode are input into an encoder of the pronunciation prediction model based on the pronunciation prediction model, an input sequence related to the text mode is input into a decoder of the pronunciation prediction model, an output sequence related to the pronunciation prediction result is obtained from the decoder, and the pronunciation prediction result of the data to be predicted is determined. Wherein the pronunciation prediction model may be an AED architecture.

In the embodiment of the disclosure, if the multimodal data includes a text mode and an audio mode, the data to be predicted includes the text mode, an input sequence related to the text mode is input into a decoder of a pronunciation prediction model, and a pronunciation prediction result is output at the decoder. If the data to be predicted includes an audio mode, audio features related to the audio mode are input into an encoder of a pronunciation prediction model, and a pronunciation prediction result is output in a decoder of the pronunciation prediction model.

In the embodiment of the disclosure, the pronunciation prediction model can be used for carrying out pronunciation prediction on the acquired audio mode and/or text mode, and the pronunciation prediction result is obtained by fusing the multi-mode information of the audio and the text, so that the accuracy of pronunciation prediction is improved, and the robustness of the pronunciation prediction model is improved based on a double-supervision learning mode of the audio mode and the text mode. For a given text modality, the pronunciation prediction model outputs a pronunciation prediction result with high accuracy. For a given audio modality, the pronunciation prediction model may output pronunciation prediction results that correspond to high accuracy. The method realizes that accurate pronunciation prediction results can be obtained in a pronunciation prediction model in both multimode and single mode, and has wide applicability.

Based on the same conception, the embodiment of the present disclosure also provides a pronunciation prediction device 100.

It should be understood that, in order to implement the above-described functions, the pronunciation prediction device 100 provided in the embodiments of the present disclosure includes corresponding hardware structures and/or software modules that perform the respective functions. The disclosed embodiments may be implemented in hardware or a combination of hardware and computer software, in combination with the various example elements and algorithm steps disclosed in the embodiments of the disclosure. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application, but such implementation is not to be considered as beyond the scope of the embodiments of the present disclosure.

Fig. 8 is a block diagram illustrating a pronunciation prediction device 100 according to an exemplary embodiment. Referring to fig. 8, the apparatus includes an acquisition unit 101, a prediction unit 102.

The obtaining unit 101 is configured to obtain data to be predicted, and analyze a data modality of the data to be predicted.

The prediction unit 102 is configured to perform pronunciation prediction on the data to be predicted based on the pronunciation prediction model and the data modality, so as to obtain a pronunciation prediction result; the input of the pronunciation prediction model is multimodal data, and the output of the pronunciation prediction model is a pronunciation prediction result.

In one embodiment, the prediction unit 102 performs pronunciation prediction on the data to be predicted based on the pronunciation prediction model and the data modality in the following manner; if the data mode of the data to be predicted is consistent with the data mode corresponding to the multi-mode data, carrying out pronunciation prediction on the data to be predicted based on the data mode corresponding to the multi-mode data; if the data mode of the data to be predicted is inconsistent with the data mode corresponding to the multi-mode data, the input position of the first data mode is empty, and pronunciation prediction is carried out on the data to be predicted based on the second data mode and the empty first data mode; the first data mode is a data mode which is inconsistent with the data mode of the data to be predicted in the data modes corresponding to the multi-mode data; the second data mode is the data mode consistent with the data mode of the data to be predicted in the data mode corresponding to the multi-mode data.

In one embodiment, the pronunciation prediction model is a model with an automatic encoder and decoder AED architecture; the prediction unit 102 performs pronunciation prediction on the data to be predicted in the following manner: constructing an input sequence of a decoder of a pronunciation prediction model based on data to be predicted; in the decoding process of the 1 st time, the first input sequence is constructed by sequentially comprising the following contents: the method comprises the steps of carrying out pronunciation prediction on data to be predicted based on a first input sequence to obtain a first output sequence, wherein the first output sequence sequentially comprises the data to be predicted, the second label, the third label, the fourth label and a pronunciation prediction result containing 1 bit; in the ith decoding, a second input sequence is constructed based on the first output sequence, and the second input sequence sequentially comprises the following contents: the method comprises the steps of performing pronunciation prediction on data to be predicted based on a second input sequence to obtain a second output sequence, wherein the second output sequence sequentially comprises the data to be predicted, the second label, the third label, the fourth label and a pronunciation prediction result containing i bits, and i is an integer greater than 1; repeating the above process until the decoder of the pronunciation prediction model outputs an output sequence containing a fifth label to obtain a pronunciation prediction result, wherein the output sequence sequentially comprises data to be predicted, a second label, a third label, a fourth label, a pronunciation prediction result and the fifth label; the method comprises the steps of marking a starting position of data to be predicted, marking a starting pronunciation prediction, marking a language corresponding to the data to be predicted, marking a data mode corresponding to the data to be predicted, marking a pronunciation prediction ending, and marking a fifth label.

In one embodiment, the multimodal data correspondence data modalities include an audio modality and a text modality; the prediction unit 102 predicts the voice data to be predicted in the following manner: taking the input sequence as a prompting word of a decoder and taking the audio characteristic as an encoder to obtain multi-mode joint information encoding, wherein if the data modes of the data to be predicted comprise audio modes and do not comprise text modes, the input of the encoder is the audio characteristic of the audio mode included in the data to be predicted, the first tag of the input sequence of the decoder and the position of the data to be predicted are empty, and if the data modes of the data to be predicted comprise text modes and do not comprise audio modes, the input of the decoder is the input sequence of the text modes included in the data to be predicted, and the input position of the encoder is empty; based on the joint information coding, the decoder outputs an output sequence to obtain a pronunciation prediction result.

In one embodiment, the pronunciation prediction model is pre-trained in the following manner: acquiring a training sample data set, wherein the training sample data set comprises training sample data pairs, and the training sample data pairs comprise modal data and pronunciation results in multi-modal data; respectively extracting data characteristics corresponding to each mode data to obtain multi-mode data characteristics; and (3) taking the first modal data characteristic of the multi-modal data characteristic as an encoder input and/or taking the second modal data characteristic as a decoder prompt word, taking a pronunciation result as a decoder output, and performing autoregressive training on a prediction model of an AED architecture of the automatic encoder to obtain a pronunciation prediction model.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 9 is a block diagram of an electronic device 200 for pronunciation prediction, according to an exemplary embodiment.

As shown in fig. 9, one embodiment of the present disclosure provides an electronic device 200. The electronic device 200 includes, among other things, a memory 201, a processor 202, and an Input/Output (I/O) interface 203. Wherein the memory 201 is used for storing instructions. The processor 202 is configured to invoke the instructions stored in the memory 201 to execute the voice detection method according to the embodiment of the present disclosure. Wherein the processor 202 is coupled to the memory 201, the I/O interface 203, respectively, such as via a bus system and/or other form of connection mechanism (not shown). The memory 201 may be used to store programs and data, including programs of the voice detection method related in the embodiments of the present disclosure, and the processor 202 performs various functional applications of the electronic device 200 and data processing by running the programs stored in the memory 201.

The processor 202 in the embodiments of the present disclosure may be implemented in at least one hardware form of a digital signal processor (DIGITAL SIGNAL Processing, DSP), field programmable gate array (Field Programmable GATE ARRAY, FPGA), programmable logic array (Programmable Logic Array, PLA), the processor 202 may be one or a combination of several of a central Processing unit (Central Processing Unit, CPU) or other form of Processing unit having data Processing and/or instruction execution capabilities.

The memory 201 in embodiments of the present disclosure may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random access memory (Random Access Memory, RAM) and/or cache memory (cache), and the like. The nonvolatile Memory may include, for example, a Read Only Memory (ROM), a Flash Memory (Flash Memory), a hard disk (HARD DISK DRIVE, HDD), a Solid state disk (Solid STATE DRIVE, SSD), or the like.

In the embodiment of the present disclosure, the I/O interface 203 may be used to receive an input instruction (e.g., numeric or character information, and generate key signal input related to user setting and function control of the electronic apparatus 200, etc.), and may also output various information (e.g., image or sound, etc.) to the outside. The I/O interface 203 in embodiments of the present disclosure may include one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a mouse, joystick, trackball, microphone, speaker, touch panel, etc.

In some embodiments, the present disclosure provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, perform any of the methods described above.

In some embodiments, the present disclosure provides a computer program product comprising a computer program that, when executed by a processor, performs any of the methods described above.

Although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous.

The methods and apparatus of the present disclosure can be implemented using standard programming techniques with various method steps being performed using rule-based logic or other logic. It should also be noted that the words "apparatus" and "module" as used herein and in the claims are intended to include implementations using one or more lines of software code and/or hardware implementations and/or equipment for receiving inputs.

Any of the steps, operations, or procedures described herein may be performed or implemented using one or more hardware or software modules alone or in combination with other devices. In one embodiment, the software modules are implemented using a computer program product comprising a computer readable medium containing computer program code capable of being executed by a computer processor for performing any or all of the described steps, operations, or programs.

The foregoing description of implementations of the present disclosure has been presented for purposes of illustration and description. The foregoing description is not intended to be exhaustive or to limit the disclosure to the precise form disclosed, and modifications and variations are possible in light of the above teachings or may be acquired from practice of the disclosure. The embodiments were chosen and described in order to explain the principles of the present disclosure and its practical application to enable one skilled in the art to utilize the present disclosure in various embodiments and with various modifications as are suited to the particular use contemplated.

It is understood that the term "plurality" in this disclosure means two or more, and other adjectives are similar thereto. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It is further understood that the terms "first," "second," and the like are used to describe various information, but such information should not be limited to these terms. These terms are only used to distinguish one type of information from another and do not denote a particular order or importance. Indeed, the expressions "first", "second", etc. may be used entirely interchangeably. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure.

It will be further understood that "connected" includes both direct connection where no other member is present and indirect connection where other element is present, unless specifically stated otherwise.

It will be further understood that although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the scope of the appended claims.

Claims

1. A pronunciation prediction method, comprising:

acquiring data to be predicted, and analyzing a data mode of the data to be predicted;

Based on a pronunciation prediction model and the data modality, carrying out pronunciation prediction on the data to be predicted to obtain a pronunciation prediction result;

the input of the pronunciation prediction model is multimodal data, and the output of the pronunciation prediction model is a pronunciation prediction result;

The pronunciation prediction model is a model with an automatic encoder decoder architecture;

the performing pronunciation prediction on the data to be predicted includes:

Constructing an input sequence of a decoder of the pronunciation prediction model based on the data to be predicted;

During the first decoding, the first input sequence is constructed by sequentially comprising the following contents: the method comprises the steps of carrying out pronunciation prediction on data to be predicted based on a first input sequence to obtain a first output sequence, wherein the first output sequence sequentially comprises the data to be predicted, the second label, the third label, the fourth label and a pronunciation prediction result containing one bit;

During the ith decoding, a second input sequence is constructed based on the first output sequence, and the second input sequence sequentially comprises the following contents: the method comprises the steps that a first label, data to be predicted, a second label, a third label, a fourth label and an i-1 bit pronunciation prediction result are used for carrying out pronunciation prediction on the data to be predicted based on a second input sequence to obtain a second output sequence, wherein the second output sequence sequentially comprises the data to be predicted, the second label, the third label, the fourth label and the pronunciation prediction result containing the i bit, and i is an integer larger than 1;

Repeatedly executing the above process until the decoder of the pronunciation prediction model outputs an output sequence containing a fifth label to obtain a pronunciation prediction result, wherein the output sequence sequentially comprises data to be predicted, a second label, a third label, a fourth label, a pronunciation prediction result and the fifth label;

The first label is used for identifying the starting position of the data to be predicted, the second label is used for identifying the starting of pronunciation prediction, the third label is used for identifying the language corresponding to the data to be predicted, the fourth label is used for identifying the data mode corresponding to the data to be predicted, and the fifth label is used for identifying the ending of pronunciation prediction.

2. The method of claim 1, wherein the pronunciation prediction is performed on the data to be predicted based on a pronunciation prediction model and the data modality;

If the data mode of the data to be predicted is consistent with the data mode corresponding to the multi-mode data, carrying out pronunciation prediction on the data to be predicted based on the data mode corresponding to the multi-mode data;

if the data mode of the data to be predicted is inconsistent with the data mode corresponding to the multi-mode data, the input position of a first data mode is empty, and pronunciation prediction is performed on the data to be predicted based on a second data mode and the empty first data mode;

The first data mode is a data mode inconsistent with the data mode of the data to be predicted in the data modes corresponding to the multi-mode data;

The second data mode is the data mode consistent with the data mode of the data to be predicted in the data mode corresponding to the multi-mode data.

3. The method of claim 1, wherein the multimodal data correspondence data modalities include an audio modality and a text modality;

The predicting the data to be predicted comprises the following steps:

Taking the input sequence as a prompting word of a decoder and taking the audio characteristic as an encoder to obtain multi-mode joint information encoding, wherein if the data mode of the data to be predicted comprises an audio mode and does not comprise a text mode, the input of the encoder is the audio characteristic of the audio mode included in the data to be predicted, the first tag of the input sequence of the decoder and the position of the data to be predicted are empty, and if the data mode of the data to be predicted comprises the text mode and does not comprise the audio mode, the input of the decoder is the input sequence of the text mode included in the data to be predicted and the input position of the encoder is empty;

And outputting the output sequence by the decoder based on the joint information coding to obtain a pronunciation prediction result.

4. The method of claim 1, wherein the pronunciation predictive model is pre-trained by:

Acquiring a training sample data set, wherein the training sample data set comprises training sample data pairs, and the training sample data pairs comprise modal data and pronunciation results in multi-modal data;

Respectively extracting data features corresponding to the modal data to obtain multi-modal data features;

And taking the first modal data characteristic of the multi-modal data characteristic as an encoder input and/or taking the second modal data characteristic as a decoder prompt word, taking the pronunciation result as a decoder output, and performing autoregressive training on a prediction model of an automatic encoder decoder framework to obtain the pronunciation prediction model.

5. A pronunciation prediction device, comprising:

the device comprises an acquisition unit, a prediction unit and a prediction unit, wherein the acquisition unit is used for acquiring data to be predicted and analyzing a data mode of the data to be predicted;

the prediction unit is used for performing pronunciation prediction on the data to be predicted based on a pronunciation prediction model and the data modality to obtain a pronunciation prediction result;

The prediction unit predicts pronunciation of the data to be predicted in the following manner:

6. The apparatus according to claim 5, wherein the prediction unit performs pronunciation prediction on the data to be predicted based on a pronunciation prediction model and the data modality in such a manner that;

7. The apparatus of claim 5, wherein the multimodal data correspondence data modalities include an audio modality and a text modality;

The prediction unit predicts the voice data to be predicted in the following manner:

8. The apparatus of claim 5, wherein the pronunciation prediction model is pre-trained by:

9. An electronic device, comprising:

A processor:

a memory for storing processor-executable instructions;

Wherein the processor is configured to: performing the pronunciation prediction method of any one of claims 1 to 4.

10. A storage medium having instructions stored therein that, when executed by a processor, enable the processor to perform the pronunciation prediction method of any one of claims 1 to 4.