CN113111652B

CN113111652B - Data processing method and device and computing equipment

Info

Publication number: CN113111652B
Application number: CN202010050687.5A
Authority: CN
Inventors: 陈梦喆; 陈谦; 李博
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-01-13
Filing date: 2020-01-13
Publication date: 2024-02-13
Anticipated expiration: 2040-01-13
Also published as: CN113111652A

Abstract

The embodiment of the application provides a data processing method, a data processing device and computing equipment. The method comprises the following steps: performing voice recognition on the acquired user voice, and converting the acquired user voice into a voice recognition text; determining a plurality of elements obtained by segmentation of the voice recognition text; executing task processing of a plurality of task types on the plurality of elements by using a text processing model; and obtaining a target text corresponding to the voice recognition text according to task processing results of the plurality of task types respectively corresponding to each element. The text processing model is obtained based on training labels which are obtained by cutting training samples and correspond to the task types respectively. According to the technical scheme, the calculated amount is reduced, and the processing accuracy is improved.

Description

Data processing method and device and computing equipment

Technical Field

The embodiment of the application relates to the technical field of data processing, in particular to a data processing method, a data processing device and computing equipment.

Background

In practical applications, there is a need to perform different classification processing on a data object to obtain a data object meeting the expected requirement, for example, in the field of natural language processing, there is often a need for different processing of text or voice.

In the current data processing mode, taking a voice transcription scene as an example, text obtained by voice recognition often does not have punctuation, and contains problems of unsmooth sentences caused by some words, numbers are usually Chinese numbers instead of Arabic numbers, and the like, voice recognition text is directly output, so that user experience is affected, reading cost is improved, and therefore, text processing, also called post-processing, is required to be carried out on the voice recognition text. Because of various problems in the speech recognition text, at present, post-processing of the speech recognition text often needs a plurality of processing models, for example, the speech recognition text is firstly processed in a smooth mode, punctuation marks are added to the text after the text is processed in the smooth mode, and the text after the punctuation marks are added is processed in a reverse text normalization mode, so that the target text is finally obtained.

However, this data processing method has a long processing link, a large calculation amount, and a processing result is not accurate enough.

Disclosure of Invention

The embodiment of the application provides a data processing method, a data processing device and computing equipment, which are used for solving the technical problems of large calculated amount and low accuracy of processing results in the prior art.

In a first aspect, an embodiment of the present application provides a data processing method, including:

Performing voice recognition on the acquired user voice, and converting the acquired user voice into a voice recognition text;

determining a plurality of elements obtained by segmentation of the voice recognition text;

executing task processing of a plurality of task types on the plurality of elements by using a text processing model;

obtaining a target text corresponding to the voice recognition text according to task processing results of the plurality of task types respectively corresponding to each element;

and outputting the target text.

In a second aspect, an embodiment of the present application provides a data processing method, including:

collecting user voice and sending the user voice to a server side, so that the server side can perform voice recognition on the user voice to convert the user voice into voice recognition texts, determine a plurality of elements obtained by segmentation of the voice recognition texts, respectively execute task processing of a plurality of task types on the plurality of elements by using a text processing model, and obtain target texts corresponding to the voice recognition texts according to task processing results of the plurality of task types respectively corresponding to each element;

acquiring the target text sent by the server;

and displaying the target text.

In a third aspect, an embodiment of the present application provides a data processing method, including:

Inputting the voice to be processed into a voice processing model;

performing task processing of a plurality of task types on the user voice by utilizing the voice processing model;

obtaining target voice according to the task processing results of the voice to be processed corresponding to the task types;

and outputting the target voice.

In a fourth aspect, an embodiment of the present application provides a data processing method, including:

inputting the text to be processed into a speech synthesis model;

executing task processing of a plurality of task types on the text to be processed by utilizing the voice synthesis model;

obtaining target voice according to task processing results of the text to be processed corresponding to the task types;

and outputting the target voice.

In a fifth aspect, an embodiment of the present application provides a data processing method, including:

inputting the text to be processed into a speech synthesis model;

respectively executing at least one group of task processing on the text to be processed by utilizing the voice synthesis model; wherein each set of task processes corresponds to a plurality of task types;

obtaining target voice corresponding to each group of task processing according to task processing results of the text to be processed corresponding to a plurality of task types in each group;

At least one target voice is output respectively.

In a sixth aspect, an embodiment of the present application provides a data processing method, including:

determining a plurality of elements obtained by segmenting a text to be processed;

and obtaining a target text corresponding to the text to be processed according to task processing results of the plurality of task types respectively corresponding to each element.

In a seventh aspect, an embodiment of the present application provides a data processing method, including:

determining a plurality of sample elements obtained by cutting a training sample;

determining training labels of the plurality of sample elements corresponding to a plurality of task types respectively;

training the text processing model by using the plurality of sample elements and training labels of the plurality of sample elements;

the text processing model is used for respectively executing task processing of a plurality of task types for a plurality of elements obtained by segmenting a text to be processed; and obtaining a target text corresponding to the text to be processed according to task processing results of the plurality of task types respectively corresponding to each element.

In an eighth aspect, in an embodiment of the present application, there is provided a data processing apparatus, including:

The voice recognition module is used for carrying out voice recognition on the acquired user voice and converting the acquired user voice into voice recognition text;

the first determining module is used for determining a plurality of elements obtained by segmentation of the voice recognition text;

the first processing module is used for respectively executing task processing of a plurality of task types on the plurality of elements by utilizing a text processing model;

the second processing module is used for obtaining a target text corresponding to the voice recognition text according to task processing results of the plurality of task types respectively corresponding to each element;

and the first output module is used for outputting the target text.

In a ninth aspect, an embodiment of the present application provides a data processing apparatus, including:

the voice acquisition module is used for acquiring user voice and sending the user voice to the server side so that the server side can perform voice recognition and convert the user voice into voice recognition texts, determine a plurality of elements obtained by segmentation of the voice recognition texts, respectively execute task processing of a plurality of task types on the plurality of elements by using a text processing model, and respectively obtain target texts corresponding to the voice recognition texts according to task processing results of the plurality of task types corresponding to each element;

The text acquisition module is used for acquiring the target text sent by the server;

and the text display module is used for displaying the target text.

In a tenth aspect, an embodiment of the present application provides a data processing apparatus, including:

the third processing module is used for inputting the voice to be processed into the voice processing model; performing task processing of a plurality of task types on the user voice by utilizing the voice processing model;

the fourth processing module is used for obtaining target voice according to task processing results of the user voice corresponding to the task types;

and the second output module is used for outputting the target voice.

In an eleventh aspect, in an embodiment of the present application, there is provided a data processing apparatus, including:

the fifth processing module is used for inputting the text to be processed into the voice synthesis model; executing task processing of a plurality of task types on the text to be processed by utilizing the voice synthesis model;

a sixth processing module, configured to obtain a target voice according to a task processing result of the text to be processed corresponding to the plurality of task types;

and the third output module is used for outputting the target voice.

In a twelfth aspect, in an embodiment of the present application, there is provided a data processing apparatus, including:

The seventh processing module is used for inputting the text to be processed into the voice synthesis model; respectively executing at least one group of task processing on the text to be processed by utilizing the voice synthesis model; wherein each set of task processes corresponds to a plurality of task types;

the eighth processing module is used for obtaining target voice corresponding to each group of task processing according to task processing results of the text to be processed corresponding to the task types in each group;

and the fourth output module is used for outputting at least one target voice.

In a thirteenth aspect, in an embodiment of the present application, there is provided a computing device including a processing component and a storage component;

the storage component stores one or more computer instructions; the one or more computer instructions are to be invoked for execution by the processing component;

the processing assembly is configured to:

And outputting the target text.

In a fourteenth aspect, an embodiment of the present application provides an electronic device, including a processing component, a display component, an acquisition component, and a storage component;

the processing assembly is configured to:

collecting user voice by utilizing the collecting component and sending the user voice to a server side so that the server side can perform voice recognition and convert the user voice into voice recognition texts, determining a plurality of elements obtained by segmentation of the voice recognition texts, respectively executing task processing of a plurality of task types on the plurality of elements by utilizing a text processing model, and obtaining target texts corresponding to the voice recognition texts according to task processing results of the plurality of task types respectively corresponding to each element;

acquiring the target text sent by the server;

and displaying the target text in the display component.

In a fifteenth aspect, embodiments of the present application provide a computing device, including a processing component and a storage component;

The processing assembly is configured to:

inputting the voice to be processed into a voice processing model;

obtaining target voice according to task processing results of the user voice corresponding to the task types;

and outputting the target voice.

In a sixteenth aspect, embodiments of the present application provide a computing device comprising a processing component and a storage component;

the processing assembly is configured to:

inputting the text to be processed into a speech synthesis model;

and outputting the target voice.

In a seventeenth aspect, in an embodiment of the present application, there is provided a computing device including a processing component and a storage component;

The processing assembly is configured to:

inputting the text to be processed into a speech synthesis model;

at least one target voice is output.

In the embodiment of the application, the object to be processed, such as text, voice or the like, performs task processing of a plurality of task types by using the corresponding processing model, and can obtain the target object, such as text, voice or the like, according to the task processing results of the object to be processed, which correspond to the task types. The processing model is obtained by training the training labels corresponding to the task types respectively on the basis of the training samples and the training samples in advance, so that the multi-task parallel processing can be performed simultaneously, the data object can be comprehensively processed according to the task processing results corresponding to the task types, the processing link is simplified, the calculated amount is reduced, and the accuracy of the processing results is ensured through the comprehensive processing.

These and other aspects of the present application will be more readily apparent from the following description of the embodiments.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, a brief description will be given below of the drawings that are needed in the embodiments or the prior art descriptions, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 illustrates a flow chart of one embodiment of a data processing method provided herein;

FIG. 2 is a flow chart illustrating yet another embodiment of a data processing method provided herein;

FIG. 3 is a flow chart illustrating yet another embodiment of a data processing method provided herein;

FIG. 4 is a schematic diagram of a text processing model in a practical application according to an embodiment of the present application;

FIG. 5 is a flow chart illustrating yet another embodiment of a data processing method provided herein;

FIG. 6a shows a schematic representation of a text display in one practical application of an embodiment of the present application;

FIG. 6b shows a schematic representation of a text display in yet another practical application of an embodiment of the present application;

FIG. 7 is a flow chart illustrating yet another embodiment of a data processing method provided herein;

FIG. 8 is a flow chart illustrating yet another embodiment of a data processing method provided herein;

FIG. 9 is a flow chart illustrating yet another embodiment of a data processing method provided herein;

FIG. 10 is a flow chart illustrating yet another embodiment of a data processing method provided herein;

FIG. 11 is a schematic diagram illustrating the construction of one embodiment of a data processing apparatus provided herein;

FIG. 12 illustrates a schematic diagram of one embodiment of a computing device provided herein;

FIG. 13 is a schematic diagram of a data processing apparatus according to another embodiment of the present application;

FIG. 14 illustrates a schematic diagram of one embodiment of an electronic device provided herein;

FIG. 15 is a schematic view showing the structure of a further embodiment of a data processing apparatus provided in the present application;

FIG. 16 illustrates a schematic diagram of a configuration of yet another embodiment of a computing device provided herein;

FIG. 17 is a schematic diagram of a data processing apparatus according to another embodiment of the present application;

FIG. 18 illustrates a schematic diagram of a configuration of yet another embodiment of a computing device provided herein;

FIG. 19 is a schematic view showing the structure of a further embodiment of a data processing apparatus provided herein;

fig. 20 illustrates a schematic diagram of a configuration of yet another embodiment of a computing device provided herein.

Detailed Description

In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application.

In some of the flows described in the specification and claims of this application and in the foregoing figures, a number of operations are included that occur in a particular order, but it should be understood that the operations may be performed in other than the order in which they occur or in parallel, that the order of operations such as 101, 102, etc. is merely for distinguishing between the various operations, and that the order of execution is not by itself represented by any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.

The technical scheme of the embodiment of the application can be suitable for various data processing scenes, and aims at the data objects which have various defect problems and need to be processed, such as a voice transcription scene, a text processing scene, a text-to-voice scene, a voice-to-voice scene and the like. .

Taking speech recognition text as an example, speech recognition text obtained by directly performing ASR (Automatic Speech Recognition ) processing has at least the following problems: the text is not marked with marks, contains words which affect the smoothness of sentences, does not adopt standard format to express date, time, address or amount, and the like, so that the reading of a user is affected, and therefore, at least post-processing operations such as smooth processing, punctuation adding, ITN (Inverse Text Normalization, anti-text normalization) processing and the like are needed to be carried out on the voice recognition text. As described in the background art, different processing is currently performed on the speech recognition text, different processing models are adopted for implementation, each processing model needs to be trained and obtained independently, and the different processing is performed on the speech recognition text by a link, for example, the speech recognition text is first processed in a smooth mode, punctuation marks are added to the text after the text is processed in a smooth mode, and ITN processing is performed to the text after the punctuation marks are added to the text, so that the target text is finally obtained. The inventor finds that the information needed by each processing model is often repeated in the process of implementing the invention, and the information is processed by a plurality of different processing models respectively, so that the overall processing link is lengthened, the calculated amount is increased, and the different processing modes may be interdependent, for example, when the punctuation is added to the text which is not processed in a smoothing way, more ambiguity can be generated, and the smoothing way is not performed before the punctuation is added, and certain words cannot determine whether the words need to be smoothed, so that the final processing result is inaccurate, and the accuracy of the target text is affected.

In order to reduce the calculation amount and improve the accuracy of the processing result, the inventor proposes a technical scheme of the application through a series of researches, in the embodiment of the application, the object to be processed, such as text, voice or the like, performs task processing of a plurality of task types by using corresponding processing models, and can obtain a target object, such as text, voice or the like, according to the task processing results of the object to be processed, which correspond to the task types. The processing model is obtained by training the training labels corresponding to the task types respectively on the basis of the training samples and the training samples in advance, so that the multi-task parallel processing can be performed simultaneously, the data object can be comprehensively processed according to the task processing results corresponding to the task types, the processing link is simplified, the calculated amount is reduced, and the accuracy of the processing results is ensured through the comprehensive processing.

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

Fig. 1 is a flowchart of one embodiment of a data processing method provided in the embodiments of the present application, where the method may include the following steps:

101: inputting an object to be processed into a processing model;

102: and respectively executing task processing of a plurality of task types on the object to be processed by utilizing the processing model.

The processing model can be obtained in advance based on training samples and training labels corresponding to a plurality of task types of the training samples, so that the processing model can process the tasks of the task types of the object to be processed at the same time.

103: and obtaining a target object corresponding to the object to be processed according to task processing results of the plurality of task types respectively corresponding to the object to be processed.

According to the task processing results of the objects to be processed, which correspond to the task types, the objects to be processed can be synthesized, so that a target object is obtained.

In this embodiment, the processing model is obtained by training based on training samples and training labels respectively corresponding to the task types in advance, so that multitasking parallel processing can be performed simultaneously, and according to task processing results corresponding to a plurality of task types of the object to be processed, comprehensive processing can be performed on the object to be processed, thereby obtaining a target object, simplifying a processing link, reducing the calculated amount, and ensuring the logic relevance between different task types and improving the accuracy of the processing result through comprehensive processing.

In practical application, the object to be processed may refer to text or voice, and the target object may refer to text or voice, so that the technical scheme of the embodiment of the application may be suitable for processing scenes such as text-to-text, voice-to-voice, text-to-text, and the like, and the following embodiments respectively describe the technical scheme of the application from different processing scenes.

Fig. 2 is a flowchart of another embodiment of a data processing method according to an embodiment of the present application, where the method may include the following steps:

201: and determining a plurality of elements obtained by segmentation of the text to be processed.

Wherein the plurality of elements may be obtained by word segmentation or word segmentation of the text to be processed, so each element may be a word or a single word.

The text to be processed may be a speech recognition text.

202: and respectively executing task processing of a plurality of task types on the plurality of elements by utilizing a text processing model.

The text processing model can be obtained through training based on a plurality of sample elements obtained through segmentation of training samples and training labels respectively corresponding to the plurality of task types, so that the text processing model can respectively process the task of the plurality of task types for each element at the same time. The specific training of the text processing model will be described in detail in the following corresponding embodiments.

203: and obtaining a target text corresponding to the text to be processed according to task processing results of the plurality of task types respectively corresponding to each element.

The text processing model is utilized to obtain the task processing result of each element corresponding to each task type, so that the final target text can be obtained by comprehensively processing each element according to the task processing result of each element corresponding to each task type.

The task processing result of each task type may include whether to perform task processing of the task type, a processing manner, and the like. For example, the task type is a smoothing process, the task processing result of a certain element may be a determination result of whether or not to perform the smoothing process, and if it is determined to perform the smoothing process based on the determination result, the element may be deleted, or the like.

Therefore, optionally, obtaining the target text corresponding to the text to be processed according to the task processing results of the task types corresponding to each element respectively may include:

and combining task processing results of the plurality of task types respectively corresponding to each element, and comprehensively processing each element to obtain a target text corresponding to the text to be processed.

In this embodiment, the text processing model is obtained in advance based on a plurality of sample elements obtained by segmenting a training sample and training labels respectively corresponding to the plurality of task types, so that multitasking parallel processing can be performed simultaneously, and each element can be comprehensively processed according to task processing results respectively corresponding to the plurality of task types by each element, thereby obtaining a target text, simplifying a processing link, reducing the calculation amount, ensuring the logic relevance among different task types by comprehensive processing, and improving the accuracy of the processing results.

The text to be processed can be a voice recognition text obtained by acquiring and obtaining user voice recognition conversion, and in practical application, a plurality of situations of voice conversion into text exist, for example, in the current instant messaging scene, two communication parties can mutually send voice messages, and under the condition of inconvenient listening to the voice messages, the voice messages can be converted into the text, and at the moment, voice recognition is needed; for another example, various search engines can support voice input at present and then transfer the voice input into text for searching; for another example, the input method software can also support voice input and transfer the voice into texts for display and the like in real time, and under the scene that the voice is transferred into the texts, as the voice recognition texts often have a plurality of different types of defect problems to be solved, the accuracy of transferring the texts can be improved and the calculated amount can be reduced by adopting the technical scheme of the application. As shown in fig. 3, a flowchart of yet another embodiment of a data processing method according to an embodiment of the present application may include the following steps:

301: and carrying out voice recognition on the collected user voice, and converting the collected user voice into voice recognition text.

The technical scheme of the embodiment can be executed by a client for voice acquisition or by a server.

302: and determining a plurality of elements obtained by segmentation of the voice recognition text.

303: and respectively executing task processing of a plurality of task types on the plurality of elements by utilizing a text processing model.

304: and obtaining a target text corresponding to the voice recognition text according to task processing results of the plurality of task types respectively corresponding to each element.

305: and outputting the target text.

Wherein, alternatively, the target text is output by displaying the target text in a voice transcription page.

The voice transcription page can be a message list page or an input box page of input method software, for example.

In some embodiments, the comprehensively processing each element according to the task processing results of the task types corresponding to each element, and obtaining the target text corresponding to the speech recognition text may include:

and combining task processing results of the plurality of task types corresponding to each element respectively, and comprehensively processing each element according to conflict processing rules corresponding to the plurality of task types to obtain target texts corresponding to the voice recognition texts.

Since there may be processing conflicts between different task types, such as processing speech recognition text, including smoothing processing, punctuation processing, ITN processing, etc., if an element needs to be smoothed off, ITN processing is not needed, if the element is the first element in the text, punctuation processing is also not needed, etc., a conflict processing rule between multiple task types may be preconfigured, where the conflict processing rule may include, for example, a processing priority or a processing order between multiple task types, etc.

In one practical application, the plurality of task types may include: smooth task, punctuation task, and reverse text normalized ITN task.

Thus, in some embodiments, the performing task processing of a plurality of task types on the plurality of elements, respectively, using the text processing model includes:

judging whether to carry out smooth processing, ITN processing and punctuation types to be added on the plurality of elements by using the text processing model;

the obtaining the target text corresponding to the voice recognition text according to the task processing results of the task types respectively corresponding to each element comprises the following steps:

And carrying out smooth processing on each element, adding the punctuation type and/or converting according to the corresponding ITN rule according to the judgment result of whether the smooth processing is carried out on each element, the judgment result of whether the ITN processing is carried out or not and the punctuation type, so as to obtain the target text corresponding to the voice recognition text.

Based on the judging result of whether to carry out the smoothing processing, if the smoothing processing is determined, the element needs to be deleted from the text, and if the smoothing processing is not carried out, the element is reserved; based on the judgment of whether ITN processing is performed, if the ITN processing is determined, the element can be converted according to ITN rules, and if the ITN processing is not performed, the element is not required to be converted; the punctuation types added may include, for example, commas, periods, question marks, exclamation marks, and no punctuations, so that punctuation marks, etc. that need to be added after the element can be determined from the punctuation types.

The text processing model can respectively process a smooth task, a punctuation task and a reverse text normalized ITN task for each element, and the task processing results comprise a judgment result of whether to perform smooth processing, a judgment result of whether to perform ITN processing and a punctuation type to be added.

The text processing model can respectively calculate the probability of performing smoothing processing and the probability of not performing smoothing processing, and the probability of performing ITN processing and not performing ITN processing, and correspond to the probability with a plurality of punctuation types, so that based on the corresponding probability values, whether to perform smoothing processing, whether to perform ITN processing and the punctuation type to be added can be determined.

As can be seen from the above description, there may be processing conflicts between different task types. In some embodiments, therefore,

and according to the judgment result of whether each element is subjected to smooth processing, the judgment result of whether ITN processing is performed or not and the punctuation type, performing smooth processing on each element according to conflict processing rules corresponding to the task types, adding the punctuation type and/or converting according to the corresponding ITN rules, and obtaining the target text corresponding to the voice recognition text.

The conflict handling rules may include a handling priority or handling order of task handling results for a plurality of task types, and so on.

For example, the task processing result of the smooth task is higher in priority than the punctuation task and the ITN task, and if the text processing model judges that a certain element needs to be subjected to smooth processing, ITN processing, punctuation mark adding and the like are not needed. Each element needs to execute punctuation tasks, task processing results of ITN tasks, and the like.

Of course, the conflict processing rules may be determined according to the actual application situation and the type of the task type, which is not specifically limited in the present application.

In some embodiments, the performing task processing of a plurality of task types on the plurality of elements using the text processing model may include:

inputting the plurality of elements as an input sequence into the text processing model;

encoding the input sequence by utilizing the text processing model to obtain element vectors of the plurality of elements, so that the element vector of each element contains element information of the rest elements;

and respectively executing task processing of a plurality of task types on the element vectors of the plurality of elements.

The text processing model may be implemented by a neural network model, for example, an RNN (Recurrent Neural Network ) model, and in addition, in order to learn the relevance between elements, the text processing model may optionally be implemented by a transducer model (a neural network model proposed by google). The text processing model may be composed of an input layer, at least one intermediate layer, and an output layer, and the at least one intermediate layer may be a transducer layer, where the at least one transducer layer may include a Self-Attention layer and a Point-wise feed-forward network layer.

The input layer may encode a plurality of elements in the input sequence into enabling vectors, learn correlations between different elements through at least one transducer layer, and obtain an element vector of each element, where the element vector of each element includes element information of remaining elements, and then process a plurality of task types through the output layer, so as to obtain task processing results corresponding to different task types of each element.

The text processing model may be a feedforward neural network, where each layer contains several neurons for performing different element processing, etc.

For convenience of understanding, as shown in fig. 4, a schematic structural diagram of a transducer model in an actual application is shown, and assuming that a speech recognition text is "hiccup me ten years old", 4 single words are obtained through word segmentation, the corresponding embedding vectors of the 4 single words can be obtained through embedding processing of an input layer, the 4 embedding vectors are processed through at least one transducer layer, each transducer layer comprehensively learns the relevance between the 4 single words, and finally word vectors of the 4 single words are obtained, so that each word vector contains word information of other words. 4 word vector input/output layers, which perform task processing of a plurality of task types, and are assumed to include three tasks: and (3) a smooth task, a punctuation task and an ITN task, so that a task processing result of three task types corresponding to each single word can be obtained. Assuming that in the task processing result, the single word hiccup is judged to need to be subjected to smooth processing, comma punctuation marks are required to be added, and ITN processing is not required to be performed; the single words I and II judge that the smooth processing is not needed, punctuation marks are not added, and ITN processing is not needed; the word "age" judges that the smooth processing is not needed, and the punctuation mark of the "period" is added, so that the ITN processing is not needed. And combining conflict processing rules, carrying out smooth processing on the hiccup word without adding punctuation marks, and obtaining a final target text which is ten years old. "

As can be seen from the above description, the text processing model may be obtained by training based on a plurality of sample elements obtained by segmenting a training sample and training labels respectively corresponding to the plurality of task types.

That is, each sample element is set with a training tag corresponding to each task type, the technical solution of the present application is described below from the viewpoint of model training, and fig. 5 is a flowchart of another embodiment of a data processing method provided in the embodiment of the present application, where the method may include the following steps:

501: a plurality of sample elements obtained by segmentation of the training sample is determined.

Optionally, the determining a plurality of sample elements obtained by slicing the training sample may include:

a plurality of sample elements obtained by word segmentation or word segmentation processing of the training samples are determined.

502: and determining training labels of the plurality of sample elements corresponding to the plurality of task types respectively.

Wherein, each sample element corresponds to each task type and a corresponding training label is set.

For example, the plurality of task types include a smooth task, a punctuation task, and an ITN task. For a certain sample element, the training label corresponding to the smooth task can be smooth or not, the training label corresponding to the punctuation task can be punctuation mark added after the element, such as a period or comma, and the training label corresponding to the ITN task can be ITN or not. Therefore, when the text processing model processes the text to be processed, a task processing result of each element corresponding to each task type in the text to be processed can be obtained.

503: and training the text processing model by using the plurality of sample elements and the training labels of the plurality of sample elements.

The text processing model is used for respectively executing task processing of a plurality of task types for a plurality of elements obtained by segmenting a text to be processed; and acquiring target texts corresponding to the texts to be processed based on task processing results of the plurality of task types respectively corresponding to each element.

The specific operation of processing the text to be processed by using the text processing model can be shown in the embodiment of fig. 1, and will not be described herein.

In this embodiment, the text processing model is obtained in advance based on a plurality of sample elements obtained by segmenting a training sample and training labels respectively corresponding to the plurality of task types, so that multitasking parallel processing can be performed simultaneously, and each element can be comprehensively processed according to task processing results respectively corresponding to the plurality of task types by each element, thereby obtaining a target text, simplifying a processing link, reducing the calculated amount, ensuring the logic relevance among different task types by comprehensive processing, and improving the accuracy of the processing results.

The text processing model may be a feedforward neural network, and in some embodiments, the training the text processing model using the plurality of sample elements and training labels of each of the plurality of sample elements may include:

inputting the plurality of sample elements into the text processing model to obtain actual processing results of the plurality of sample elements corresponding to the plurality of task types respectively;

and carrying out parameter optimization on the text processing model based on the comparison result of the actual processing results of the plurality of task types and the training labels, which are respectively corresponding to each sample element.

The actual processing result of each sample element corresponding to each task type is compared with the training label, and a comparison result is obtained. The actual processing result is usually a probability value smaller than 1, the probability value corresponding to the training label is 0 or 1, and the actual processing result is the expected processing result, and the actual processing result can be compared with the training label to obtain a difference value, namely a comparison result.

And carrying out parameter optimization on the text processing model based on the comparison result until the actual processing result and the expected processing result corresponding to the training label meet the optimization condition.

Optionally, the performing parameter optimization on the text processing model based on the comparison result of the actual processing result and the training label, where each sample element corresponds to each of the plurality of task types, includes:

Comparing the actual processing results of each sample element corresponding to the task types with training labels to obtain comparison values corresponding to the task types;

weighting a plurality of comparison values corresponding to each sample element to obtain a return value corresponding to each sample element;

and carrying out parameter optimization on the text processing model by using the returned numerical value corresponding to each sample element.

Because each sample element performs task processing of a plurality of task types, a comparison value is obtained for each task type, where the comparison value refers to a difference between an actual processing result and an expected processing result corresponding to the training label. Therefore, the comparison values corresponding to each sample element can be weighted to obtain the returned value of each sample element.

The parameter optimization may be performed on the text processing model based on the returned value corresponding to each sample element until the actual processing result corresponding to each task type and the expected processing result corresponding to the training label corresponding to each sample element meet the optimization condition, for example, the comparison value is smaller than a certain value.

As can be seen from the foregoing description, in an actual application, the technical solution of the present application may be applied to a speech transcription scenario, where a speech recognition text obtained by collecting and obtaining user speech recognition is processed, and for convenience of understanding, as shown in fig. 6a, a display schematic diagram for transcribing a speech message into a text on a message list page is shown, ASR recognition is first performed on the user speech to obtain an ASR result, and then the ASR result is converted into a target text by adopting the technical solution of the present application to output.

As shown in fig. 6b, a display diagram of converting real-time collected speech into text in an input box page of input method software is shown, ASR recognition is first performed on user speech to obtain an ASR result, and then the ASR result is converted into a target text by adopting the technical scheme of the application to output.

Of course, fig. 6a to fig. 6b are only illustrative of possible application scenarios of the technical solution of the present application, and it can be understood that the present application is not limited thereto, and the technical solution of the embodiment of the present application can implement accurate processing of speech recognition text, improve accuracy of text obtained by speech conversion, and reduce calculation amount.

Furthermore, because different users may have different processing requirements for speech recognition text, in some embodiments, the performing task processing for the plurality of task types on the plurality of elements, respectively, using the text processing model may include:

determining a plurality of corresponding task types according to the user characteristics of the user;

and respectively executing task processing of the task types on the elements by using a text processing model.

Thereby personalized task processing can be realized for different users. The user characteristics may include, for example, attribute information such as historical reading habits, age, gender, and the like, or may be obtained by combining actual demands.

FIG. 7 is a flowchart of another embodiment of a data processing method provided herein, which may include the steps of:

701: collecting user voice and sending the user voice to a server side, so that the server side can perform voice recognition on the user voice to convert the user voice into voice recognition texts, determine a plurality of elements obtained by segmentation of the voice recognition texts, respectively execute task processing of a plurality of task types on the plurality of elements by using a text processing model, and obtain target texts corresponding to the voice recognition texts according to task processing results of the plurality of task types respectively corresponding to each element;

702: acquiring the target text sent by the server;

703: and displaying the target text.

In this embodiment, the description of the technical solution of the present application is performed from the perspective of collecting the user voice by the client, and the processing of the user voice by the server may be detailed in the embodiment shown in fig. 3, which is not described herein again.

Fig. 8 is an embodiment of another embodiment of a data processing method provided in the embodiments of the present application, where the embodiment is applicable to a processing scenario of converting speech into speech, and the method may include the following steps:

801: the speech to be processed is input into a speech processing model.

802: and executing task processing of a plurality of task types on the user voice by utilizing the voice processing model.

The speech processing model can be obtained through training based on training samples and training labels of the training samples corresponding to the task types. The training sample may specifically be a training speech or the like.

The plurality of task types may include, for example, language conversion, mood conversion, spoken or standard language conversion, and the like.

803: and obtaining target voice according to the task processing results of the voice to be processed corresponding to the task types.

According to the task processing results of the to-be-processed voice corresponding to the task types, the to-be-processed voice can be comprehensively processed, and therefore target voice is obtained.

For example, according to the task processing results of the to-be-processed voice corresponding to the task types, whether to convert the language, whether to add the word of the language, whether to convert the language into the spoken language expression or the standard language expression, and the like can be determined, so that the to-be-processed voice is further comprehensively processed, and the target voice is obtained.

804: and outputting the target voice.

In this embodiment, the speech processing model is obtained in advance based on training samples and training labels corresponding to a plurality of task types, so that multitasking parallel processing can be performed simultaneously, and comprehensive processing can be performed on the speech to be processed according to task processing results corresponding to a plurality of task types respectively, thereby obtaining target text, simplifying processing links, reducing calculation amount, ensuring logic relevance among different task types through comprehensive processing, and improving accuracy of processing results.

In some embodiments, the performing task processing of a plurality of task types on the speech to be processed using the speech processing model includes:

determining a voice scene type corresponding to the voice to be processed and determining a plurality of task types corresponding to the voice scene type;

and executing task processing of the task types on the voice to be processed by utilizing the voice processing model.

The voice scene type may include, for example, a news broadcast scene, an educational training scene, a rural propaganda scene, and the like.

For news broadcast type scenes, it is desirable to be able to convert the speech to be processed into standardized speech, e.g., without spoken language expressions, without dialect vocabulary, serious language, etc.

For educational training scenarios, it is desirable to be able to convert the speech to be processed into targeted speech, such as intonation and temperature, individual words require the addition of tutorials, and so on.

For rural advertising scenarios, it is desirable to be able to convert the speech to be processed into more humanized expressions, such as spoken expressions, language humor, etc., for individual words.

Different voice scene types can preset a plurality of corresponding task types and the like.

Further, the outputting the target voice may include:

And playing the target voice in the user equipment.

Alternatively, the technical solution of the embodiment shown in fig. 8 may be executed by the ue, or of course, may also be executed by the server, and the target voice is sent to the ue for playing.

In an actual application, the user equipment may refer to, for example, an intelligent speaker, a television, or other devices with a speaker function, and the voice to be processed may be sent from a signal source or transmitted by a network side.

In some embodiments, the performing task processing of a plurality of task types on the user speech using the speech processing model comprises:

determining user characteristics corresponding to the user equipment and a plurality of task types corresponding to the user characteristics;

Wherein inputting the speech to be processed into the speech processing model may include:

acquiring voice to be processed transmitted to the user equipment;

and inputting the voice to be processed into a voice processing model.

The voice to be processed can be transmitted to the user equipment by the signal source or the network side, alternatively, the voice to be processed requested by the user can be obtained from the signal source or the network side according to the user request.

Based on a plurality of task types corresponding to different user characteristics, personalized processing of different users can be realized, and different target voices can be output to different users for the same voice to be processed. For example, the voice to be processed is news broadcasting voice, and the voice to be processed can be converted into corresponding dialect voice by combining information such as user native place, so that the voice to be processed is convenient for users to listen to, user experience is improved, and the like.

Fig. 9 is a flowchart of another embodiment of a data processing method provided in the embodiment of the present application, where the technical solution of the present embodiment is applicable To a processing scenario of Text To Speech (Text To Speech, abbreviated as TTS), and the method may include the following steps:

901: the text to be processed is input into a speech synthesis model.

902: and executing task processing of a plurality of task types on the text to be processed by utilizing the voice synthesis model.

The speech synthesis model can be obtained through training based on training samples and training labels of the training samples corresponding to the task types. The training sample may be referred to as training text.

The plurality of task types may include, for example, text analysis, speech synthesis, timbre clarity, timbre naturalness, timbre consistency, and the like.

Wherein, the text analysis may include word segmentation processing and the like, the speech synthesis may include converting linguistic descriptions of the text into speech waveforms, and the tone quality clarity may refer to the percentage of words that are of interest to the correct hearing; the tone quality naturalness is used for evaluating whether the tone quality of the synthesized voice is close to the voice of the human speaking, and whether the intonation of the synthesized word is natural; tone quality consistency is used to evaluate whether a synthesized sentence is smooth, etc.

903: and obtaining target voice according to the task processing results of the text to be processed corresponding to the task types.

And comprehensively processing the text to be processed according to the task processing results of the text to be processed corresponding to the task types, and obtaining the target voice.

904: and outputting the target voice.

In practical application, for example, in an instant messaging scenario, when an instant messaging message is a text message and a user is inconvenient to read, the technical scheme of the embodiment can be adopted to convert the instant messaging message into voice.

The outputting the target voice may be outputting the target voice in the user equipment.

The text to be processed may be text determined by the user device upon a user request. The technical solution of the embodiment shown in fig. 9 may be executed by the user equipment, or of course, the user equipment may also send the text to be processed to the server side for execution by the server side, and the server side then sends the target voice to the user equipment, and the user equipment plays the target voice.

In this embodiment, the speech synthesis model is obtained in advance based on training samples and training labels corresponding to a plurality of task types, so that multitasking parallel processing can be performed simultaneously, and according to task processing results corresponding to a plurality of task types respectively for the text to be processed, comprehensive processing can be performed on the text to be processed, thereby obtaining target speech, simplifying processing links, reducing the calculated amount, and ensuring logic relevance among different task types and improving accuracy of the processing results through comprehensive processing.

In some embodiments, the task of performing a plurality of task types on the text to be processed using the speech synthesis model comprises:

acquiring user characteristics and determining a plurality of task types corresponding to the user characteristics;

and executing task processing of the task types on the text to be processed by utilizing the voice synthesis model.

Thus, personalized processing and the like aiming at different users can be realized.

The task of performing a plurality of task types on the text to be processed by using the voice synthesis model may include:

determining a use scene corresponding to the text to be processed;

determining a plurality of corresponding task types according to the use scene type;

The usage scenario types can include, for example, news broadcasting scenarios, education training scenarios, and the like, so that the task types can also include language processing, and the like, so as to meet different usage scenario requirements.

Furthermore, there may be multiple TTS requirements for the same text, and thus, as shown in fig. 10, a flowchart of yet another embodiment of a data processing method is provided for an embodiment of the present application, where the method may include the following steps:

1001: the text to be processed is input into a speech synthesis model.

1002: respectively executing at least one group of task processing on the text to be processed by utilizing the voice synthesis model; wherein each set of task processes corresponds to a plurality of task types.

Wherein each set of tasks processes a corresponding task type may be different.

1003: and obtaining target voice corresponding to each group of task processing according to task processing results of the text to be processed corresponding to the task types in each group.

The task processing process of each group can be detailed in the embodiment shown in fig. 9, and will not be described again.

1004: at least one target voice is output respectively.

Different user requirements, such as bilingual sound output requirements, etc., can be met by outputting at least one target voice.

The outputting of the at least one target voice respectively may be playing the at least one target voice respectively in the user equipment. The user equipment may be, for example, a mobile terminal such as a mobile phone.

The technical solution of the embodiment shown in fig. 10 may be executed by a user equipment or a server side that communicates with the user equipment, where when the technical solution is executed by the server side, the user equipment sends a text to be processed to the server side, and the server side sends at least one obtained target voice to the user equipment and plays the target voice by the user equipment.

Wherein the performing at least one set of task processing on the text to be processed by using the speech synthesis model respectively may include:

acquiring user characteristics and determining processing requirements corresponding to the user characteristics;

determining at least one set of tasks according to the processing requirements;

and respectively executing task processing on the text to be processed according to the task types in the at least one group of tasks by utilizing the voice synthesis model. Of course, at least one set of tasks may also be determined in connection with the type of usage scenario of the text to be processed. And each set of tasks may include a plurality of task types.

Fig. 11 is a schematic structural diagram of an embodiment of a data processing apparatus according to an embodiment of the present application, where the apparatus may include:

the voice recognition module 1101 is configured to perform voice recognition on the collected user voice, and convert the collected user voice into a voice recognition text;

a first determining module 1102, configured to determine a plurality of elements obtained by segmentation of the speech recognition text;

a first processing module 1103, configured to perform task processing of a plurality of task types on the plurality of elements by using a text processing model;

a second processing module 1104, configured to obtain a target text corresponding to the speech recognition text according to task processing results corresponding to the plurality of task types respectively for each element;

a first output module 1105, configured to output the target text. In some embodiments, the text processing model is obtained based on a plurality of sample elements obtained by segmentation of a training sample and training labels of the plurality of sample elements corresponding to the plurality of task types, respectively.

In some embodiments, the second processing module is specifically configured to perform comprehensive processing on each element according to task processing results corresponding to the multiple task types respectively to each element, so as to obtain a target text corresponding to the speech recognition text.

In some embodiments, the second processing module is specifically configured to combine task processing results corresponding to the multiple task types respectively with each element, and comprehensively process each element according to conflict processing rules corresponding to the multiple task types, so as to obtain a target text corresponding to the speech recognition text.

In some embodiments, the plurality of task types includes a smooth task, a punctuation task, and a reverse text normalized ITN task; the first processing module is specifically configured to determine, by using the text processing model, whether to perform smoothing processing, whether to perform ITN processing, and a punctuation type to be added on the plurality of elements;

the second processing module is specifically configured to perform smoothing processing on each element, add the punctuation type, and/or convert according to a corresponding ITN rule according to a result of determining whether each element performs smoothing processing, a result of determining whether each element performs ITN processing, and the punctuation type, so as to obtain a target text corresponding to the speech recognition text.

In some embodiments, the first determining module is specifically configured to determine a plurality of elements obtained by word segmentation or word segmentation processing of the speech recognition text.

In some embodiments, the first processing module is specifically configured to input the plurality of elements as an input sequence into the text processing model; encoding the input sequence by utilizing the text processing model to obtain element vectors of the plurality of elements, so that the element vector of each element contains element information of the rest elements; and respectively executing task processing of a plurality of task types on the element vectors of the plurality of elements.

The apparatus may further include:

the model training module is used for determining a plurality of sample elements obtained by cutting training samples; determining training labels of the plurality of sample elements corresponding to the plurality of task types respectively; and training the text processing model by using the plurality of sample elements and the training labels of the plurality of sample elements.

In some embodiments, the model training module trains the text processing model by using the plurality of sample elements and training labels of the plurality of sample elements, wherein the training module inputs the plurality of sample elements into the text processing model to obtain actual processing results of the plurality of sample elements respectively corresponding to the plurality of task types; and carrying out parameter optimization on the text processing model based on the comparison result of the actual processing results of the plurality of task types and the training labels, which are respectively corresponding to each sample element.

In some embodiments, the model training module performs parameter optimization on the text processing model based on a comparison result of actual processing results and training labels, each sample element corresponding to the plurality of task types, respectively, including: comparing the actual processing results of each sample element corresponding to the task types with training labels to obtain comparison values corresponding to the task types; weighting a plurality of comparison values corresponding to each sample element to obtain a return value corresponding to each sample element; and carrying out parameter optimization on the text processing model by using the returned numerical value corresponding to each sample element.

In some embodiments, the model training module determining a plurality of sample elements obtained by training sample segmentation comprises: a plurality of sample elements obtained by word segmentation or word segmentation processing of the training samples are determined.

In an actual application, when the voice recognition text is a voice recognition text, the first determining module is specifically configured to determine a plurality of elements obtained by segmenting the voice recognition text;

the second processing module is specifically configured to obtain a target text corresponding to the speech recognition text according to task processing results corresponding to the plurality of task types respectively for each element.

The data processing apparatus shown in fig. 11 may perform the data processing method described in the embodiment shown in fig. 3, and its implementation principle and technical effects are not repeated. The specific manner in which the respective modules and units of the data processing apparatus in the above embodiments perform operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

In one possible design, the data processing apparatus of the embodiment shown in FIG. 11 may be implemented as a computing device, as shown in FIG. 12, which may include a storage component 1201 and a processing component 1202;

the storage component 1201 stores one or more computer instructions for execution by the processing component 1202.

The processing component 1202 is configured to:

And outputting the target text.

Of course, the computing device may necessarily include other components, such as input/output interfaces, communication components, and the like. The input/output interface provides an interface between the processing component and a peripheral interface module, which may be an output device, an input device, etc. The communication component is configured to facilitate wired or wireless communication between the computing device and other devices, and the like.

As used herein, a "computing device" may be a remote server, a computer networking device, a chipset, a desktop computer, a notebook computer, a workstation, or any other processing device or equipment.

Wherein the computing device may be a remote server, the processing component, storage component, etc. may be a base server resource rented or purchased from a cloud computing platform.

The embodiment of the present application further provides a computer readable storage medium storing a computer program, where the computer program when executed by a computer may implement the data processing method of the embodiment shown in fig. 3.

Fig. 13 is a schematic structural diagram of an embodiment of a data processing apparatus according to an embodiment of the present application, where the apparatus may include:

The voice collection module 1301 is configured to collect a user voice and send the user voice to a server, so that the server performs voice recognition on the user voice to convert the user voice into a voice recognition text, determines a plurality of elements obtained by segmentation of the voice recognition text, respectively performs task processing of a plurality of task types on the plurality of elements by using a text processing model, and obtains a target text corresponding to the voice recognition text according to task processing results of the plurality of task types respectively corresponding to each element;

a text obtaining module 1302, configured to obtain the target text sent by the server;

the text display module 1303 is configured to display the target text.

The data processing apparatus shown in fig. 13 may perform the data processing method described in the embodiment shown in fig. 7, and its implementation principle and technical effects are not repeated. The specific manner in which the respective modules and units of the data processing apparatus in the above embodiments perform operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

In one possible design, the data processing apparatus of the embodiment shown in fig. 13 may be implemented as an electronic device, which may include a storage component 1401, a processing component 1402, a collection component 1403, and a display component 1403, as shown in fig. 14;

The storage component 1401 stores one or more computer instructions, wherein the one or more computer instructions are invoked for execution by the processing component 1402.

The processing component 1402 is configured to:

collecting user voice by utilizing the collection component 1403 and sending the user voice to a server side so that the server side can perform voice recognition and convert the user voice into voice recognition texts, determine a plurality of elements obtained by segmentation of the voice recognition texts, respectively execute task processing of a plurality of task types on the plurality of elements by utilizing a text processing model, and obtain target texts corresponding to the voice recognition texts according to task processing results of the plurality of task types respectively corresponding to each element;

acquiring the target text sent by the server;

the target text is displayed in the display component 1404.

The display component may be an Electroluminescent (EL) element, a liquid crystal display or a micro display having a similar structure, or a retina-directly displayable or similar laser scanning type display, among others.

The processing component may output the target text by displaying the target text via the display component.

Of course, the electronic device may necessarily also include other components, such as input/output interfaces, communication components, and the like. The input/output interface provides an interface between the processing component and a peripheral interface module, which may be an output device, an input device, etc. The communication component is configured to facilitate wired or wireless communication between the computing device and other devices, and the like.

The embodiment of the present application further provides a computer readable storage medium storing a computer program, where the computer program when executed by a computer may implement the data processing method of the embodiment shown in fig. 7.

Fig. 15 is a schematic structural diagram of another embodiment of a data processing apparatus according to an embodiment of the present application, where the apparatus may include:

a third processing module 1501 for inputting the voice to be processed into a voice processing model; performing task processing of a plurality of task types on the user voice by utilizing the voice processing model;

a fourth processing module 1502, configured to obtain a target voice according to a task processing result of the user voice corresponding to the plurality of task types;

a second output module 1503, configured to output the target voice.

The data processing apparatus shown in fig. 15 may perform the data processing method described in the embodiment shown in fig. 8, and its implementation principle and technical effects are not repeated. The specific manner in which the respective modules and units of the data processing apparatus in the above embodiments perform operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

In one possible design, the data processing apparatus of the embodiment shown in FIG. 15 may be implemented as a computing device, as shown in FIG. 16, which may include a storage component 1601 and a processing component 1602;

the storage component 1601 stores one or more computer instructions for execution by the processing component 1602 invocation.

The processing component 1602 is configured to:

inputting the voice to be processed into a voice processing model;

and outputting the target voice.

The embodiment of the present application further provides a computer readable storage medium storing a computer program, where the computer program when executed by a computer may implement the data processing method of the embodiment shown in fig. 8.

Fig. 17 is a schematic structural diagram of another embodiment of a data processing apparatus according to an embodiment of the present application, where the apparatus may include:

a fifth processing module 1701 for inputting text to be processed into a speech synthesis model; executing task processing of a plurality of task types on the text to be processed by utilizing the voice synthesis model;

a sixth processing module 1702 configured to obtain a target voice according to a task processing result of the text to be processed corresponding to the plurality of task types;

and a third output module 1703 for outputting the target voice.

The data processing apparatus shown in fig. 17 may perform the data processing method described in the embodiment shown in fig. 9, and its implementation principle and technical effects are not repeated. The specific manner in which the respective modules and units of the data processing apparatus in the above embodiments perform operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

In one possible design, the data processing apparatus of the embodiment shown in FIG. 17 may be implemented as a computing device, which may include a storage component 1801 and a processing component 1802, as shown in FIG. 18;

the storage component 1801 stores one or more computer instructions for execution by the processing component 1602.

The processing assembly 1802 is configured to:

inputting the text to be processed into a speech synthesis model;

and outputting the target voice.

The embodiment of the present application further provides a computer readable storage medium storing a computer program, where the computer program when executed by a computer may implement the data processing method of the embodiment shown in fig. 9.

Fig. 19 is a schematic structural diagram of another embodiment of a data processing apparatus according to an embodiment of the present application, where the apparatus may include:

a seventh processing module 1901 for inputting text to be processed into the speech synthesis model; respectively executing at least one group of task processing on the text to be processed by utilizing the voice synthesis model; wherein each set of task processes corresponds to a plurality of task types;

an eighth processing module 1902, configured to obtain, according to a task processing result of the text to be processed corresponding to a plurality of task types in each group, a target voice corresponding to each group of task processing;

a fourth output module 1903 for outputting at least one target voice.

The data processing apparatus shown in fig. 19 may perform the data processing method described in the embodiment shown in fig. 10, and its implementation principle and technical effects are not repeated. The specific manner in which the respective modules and units of the data processing apparatus in the above embodiments perform operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

In one possible design, the data processing apparatus of the embodiment shown in FIG. 19 may be implemented as a computing device, which may include a storage component 2001 and a processing component 2002, as shown in FIG. 20;

the memory component 2001 stores one or more computer instructions for execution by the processing component 2002.

The processing component 2002 is configured to:

inputting the text to be processed into a speech synthesis model;

at least one target voice is output.

The embodiment of the present application further provides a computer readable storage medium storing a computer program, where the computer program when executed by a computer may implement the data processing method of the embodiment shown in fig. 10.

Wherein in various embodiments described above, the processing component may include one or more processors to execute computer instructions to perform all or part of the steps of the methods described above. Of course, the processing component may also be implemented as one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors or other electronic elements for executing the methods described above.

The storage component is configured to store various types of data to support operations in the computing device. The memory component may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A method of data processing, comprising:

outputting the target text;

the text processing model is obtained through training based on a plurality of sample elements obtained through segmentation of training samples and training labels respectively corresponding to the plurality of task types;

combining task processing results of each element corresponding to the task types respectively, and comprehensively processing each element to obtain a target text corresponding to the voice recognition text;

the task types comprise a smooth task, a punctuation task and a reverse text normalized ITN task;

The task processing for executing a plurality of task types on the plurality of elements by using the text processing model respectively comprises:

according to the judgment result of whether each element is subjected to smooth processing, the judgment result of whether ITN processing is performed or not and the punctuation type, performing smooth processing on each element, adding the punctuation type and/or converting according to the corresponding ITN rule to obtain a target text corresponding to the voice recognition text;

the text processing model is specifically obtained by pre-training in the following way:

determining training labels of the plurality of sample elements corresponding to the plurality of task types respectively;

the training the text processing model using the plurality of sample elements and the training labels of each of the plurality of sample elements includes:

performing parameter optimization on the text processing model based on the comparison result of the actual processing results of the plurality of task types and the training labels, which are respectively corresponding to each sample element;

the performing parameter optimization on the text processing model based on the comparison result of the actual processing results of the plurality of task types and the training labels, wherein each sample element corresponds to one of the plurality of task types, includes:

2. The method of claim 1, wherein the combining task processing results that each element corresponds to the task types respectively, performing integrated processing on each element, and obtaining the target text corresponding to the speech recognition text includes:

3. The method of claim 1, wherein determining a plurality of elements obtained by segmentation of speech recognition text comprises:

and determining a plurality of elements obtained by word segmentation or word segmentation processing of the voice recognition text.

4. The method of claim 1, wherein performing task processing of a plurality of task types on the plurality of elements, respectively, using the text processing model comprises:

5. The method according to claim 1, characterized in that the text processing model is pre-trained in particular in the following way:

and training the text processing model by using the plurality of sample elements and the training labels of the plurality of sample elements.

6. The method of claim 1, wherein training the text processing model using the plurality of sample elements and training labels for each of the plurality of sample elements comprises:

7. The method of claim 1, wherein the performing parameter optimization on the text processing model based on the comparison result of the actual processing result and the training label for each sample element respectively corresponding to the plurality of task types comprises:

8. The method of claim 1, wherein determining a plurality of sample elements obtained by segmentation of a training sample comprises:

and determining a plurality of sample elements obtained by word segmentation or word segmentation processing on the training samples.

9. The method of claim 1, wherein the converting the collected user speech into speech recognition text comprises:

receiving user voice acquired by a client;

performing voice recognition on the user voice, and converting the user voice into voice recognition text;

the outputting the target text includes:

and sending the target text to the client so as to output the target text on a display interface of the client.

10. A method of data processing, comprising:

collecting user voice and sending the user voice to a server side, so that the server side can perform voice recognition on the user voice to convert the user voice into voice recognition texts, determine a plurality of elements obtained by segmentation of the voice recognition texts, respectively execute task processing of a plurality of task types on the plurality of elements by using a text processing model, and obtain target texts corresponding to the voice recognition texts according to task processing results of the plurality of task types respectively corresponding to each element; the text processing model is obtained through training based on a plurality of sample elements obtained through segmentation of training samples and training labels respectively corresponding to the plurality of task types;

Acquiring the target text sent by the server;

displaying the target text;

11. A method of data processing, comprising:

inputting the voice to be processed into a voice processing model;

outputting the target voice;

performing voice recognition on the target voice and converting the target voice into voice recognition text;

outputting the target text;

12. The method of claim 11, wherein the speech processing model is obtained based on training samples and training labels for the plurality of task types, respectively.

13. The method of claim 12, wherein performing task processing of a plurality of task types on the speech to be processed using the speech processing model comprises:

14. The method of claim 12, wherein the outputting the target speech comprises:

and playing the target voice in the user equipment.

15. The method of claim 14, wherein inputting the speech to be processed into the speech processing model comprises:

Acquiring voice to be processed transmitted to the user equipment;

and inputting the voice to be processed into a voice processing model.

16. A method of data processing, comprising:

inputting the text to be processed into a speech synthesis model;

outputting the target voice;

outputting the target text;

17. The method of claim 16, wherein the speech synthesis model is obtained based on training samples and training labels for the plurality of task types, respectively.

18. The method of claim 16, wherein performing tasks of a plurality of task types on the text to be processed using the speech synthesis model comprises:

determining a use scene corresponding to the text to be processed;

19. The method of claim 16, wherein the outputting the target speech comprises:

and playing the target voice in the user equipment.

20. A method of data processing, comprising:

inputting the text to be processed into a speech synthesis model;

respectively outputting at least one target voice;

outputting the target text;

21. A method of data processing, comprising:

obtaining a target text corresponding to the text to be processed according to task processing results of the plurality of task types respectively corresponding to each element;

the obtaining the target text corresponding to the text to be processed according to the task processing results of the task types respectively corresponding to each element comprises the following steps:

combining task processing results of each element corresponding to the task types respectively, and comprehensively processing each element to obtain a target text corresponding to the text to be processed;

according to the judgment result of whether each element is subjected to smooth processing, the judgment result of whether ITN processing is performed or not and the punctuation type, performing smooth processing on each element, adding the punctuation type and/or converting according to the corresponding ITN rule, and obtaining a target text corresponding to the text to be processed;

22. A method of data processing, comprising:

training a text processing model by using the plurality of sample elements and training labels of the plurality of sample elements;

the text processing model is used for respectively executing task processing of a plurality of task types for a plurality of elements obtained by segmenting a text to be processed; according to task processing results of the plurality of task types, which correspond to each element, obtaining a target text corresponding to the text to be processed;

the text processing model is used for respectively executing task processing of a plurality of task types on a plurality of elements obtained by segmenting the text to be processed, and comprises the following steps:

23. A data processing apparatus, comprising:

the first processing module is used for respectively executing task processing of a plurality of task types on the plurality of elements by utilizing a text processing model, wherein the plurality of task types comprise a smooth task, a punctuation task and a reverse text normalized ITN task;

the first output module is used for outputting the target text;

the first processing module is specifically configured to: judging whether to carry out smooth processing, ITN processing and punctuation types to be added on the plurality of elements by using the text processing model;

The second processing module is specifically configured to: according to the judgment result of whether each element is subjected to smooth processing, the judgment result of whether ITN processing is performed or not and the punctuation type, performing smooth processing on each element, adding the punctuation type and/or converting according to the corresponding ITN rule to obtain a target text corresponding to the voice recognition text;

24. A data processing apparatus, comprising:

the voice acquisition module is used for acquiring user voice and sending the user voice to the server side so that the server side can perform voice recognition and convert the user voice into voice recognition texts, determine a plurality of elements obtained by segmentation of the voice recognition texts, respectively execute task processing of a plurality of task types on the plurality of elements by using a text processing model, and respectively obtain target texts corresponding to the voice recognition texts according to task processing results of the plurality of task types corresponding to each element; the text processing model is obtained through training based on a plurality of sample elements obtained through segmentation of training samples and training labels respectively corresponding to a plurality of task types, wherein the plurality of task types comprise a smooth task, a punctuation task and a reverse text normalized ITN task;

the text display module is used for displaying the target text;

the voice acquisition module is specifically used for: combining task processing results of each element corresponding to the task types respectively, and comprehensively processing each element to obtain a target text corresponding to the voice recognition text;

the voice acquisition module is specifically used for: judging whether to carry out smooth processing, ITN processing and punctuation types to be added on the plurality of elements by using the text processing model;

25. A data processing apparatus, comprising:

The third processing module is used for inputting the voice to be processed into the voice processing model; performing task processing of a plurality of task types on the user voice by utilizing a voice processing model;

the second output module is used for outputting the target voice;

the first recognition module is used for carrying out voice recognition on the target voice and converting the target voice into voice recognition texts;

outputting the target text;

26. A data processing apparatus, comprising:

the third output module is used for outputting the target voice;

the second recognition module is used for carrying out voice recognition on the target voice and converting the target voice into voice recognition texts;

outputting the target text;

27. A data processing apparatus, comprising:

a fourth output module for outputting at least one target voice;

the third recognition module is used for carrying out voice recognition on the target voice and converting the target voice into voice recognition texts;

Outputting the target text;

28. A computing device comprising a processing component and a storage component;

the processing assembly is configured to:

outputting the target text;

29. An electronic device is characterized by comprising a processing component, a display component, an acquisition component and a storage component;

the processing assembly is configured to:

collecting user voice by utilizing the collecting component and sending the user voice to a server side so that the server side can perform voice recognition and convert the user voice into voice recognition texts, determining a plurality of elements obtained by segmentation of the voice recognition texts, respectively executing task processing of a plurality of task types on the plurality of elements by utilizing a text processing model, and obtaining target texts corresponding to the voice recognition texts according to task processing results of the plurality of task types respectively corresponding to each element; the text processing model is obtained through training based on a plurality of sample elements obtained through segmentation of training samples and training labels respectively corresponding to the plurality of task types;

Acquiring the target text sent by the server;

displaying the target text in the display component;

30. A computing device comprising a processing component and a storage component;

the processing assembly is configured to:

inputting the voice to be processed into a voice processing model;

outputting the target voice;

Outputting the target text;

31. A computing device comprising a processing component and a storage component;

the processing assembly is configured to:

inputting the text to be processed into a speech synthesis model;

outputting the target voice;

Outputting the target text;

32. A computing device comprising a processing component and a storage component;

the processing assembly is configured to:

inputting the text to be processed into a speech synthesis model;

outputting at least one target voice;

outputting the target text;