CN113420121A

CN113420121A - Text processing model training method, voice text processing method and device

Info

Publication number: CN113420121A
Application number: CN202110704938.1A
Authority: CN
Inventors: 周军; 张震; 李成章; 李鹏; 刘建; 石瑾; 刘睿霖; 颜永红
Original assignee: Institute of Acoustics CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Acoustics CAS; National Computer Network and Information Security Management Center
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2021-09-21
Anticipated expiration: 2041-06-24
Also published as: CN113420121B

Abstract

The application provides a text processing model training method, a voice text processing method and a voice text processing device, and relates to the technical field of natural language processing. The method comprises the following steps: crawling a dialog text from the Internet to obtain a positive sample; performing transformation operation on statements in the dialog text to obtain a negative sample and first label information of the negative sample; correspondingly inputting the positive sample and the negative sample into a pre-trained first text processing model and a to-be-trained second text processing model to generate a first feature vector of a target layer of the first text processing model and a second feature vector of the target layer of the second text processing model; and carrying out knowledge distillation on the second text processing model according to the first feature vector and the second feature vector to obtain a trained second text processing model. According to the embodiment of the application, the problems that the efficiency of proofreading the voice text is low, the time consumption is long and the occupation of computing resources is large in the related technology can be solved.

Description

Text processing model training method, voice text processing method and device

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a text processing model training method, a speech text processing method, and a speech text processing device.

Background

With the development of natural language processing technology and the demand of people for high efficiency, speech recognition technology has been widely applied to various fields of life, such as recording conference contents and converting texts into conference summaries; the recorded contents of the lectures for the teachers are converted into texts as classroom notes and the like.

At present, in order to accurately recognize and convert speech into text and enable the converted text to be easily understood by a user, the recognized speech needs to be converted into text, and then the text needs to be checked, so that the text which is easy to understand by the user is obtained. However, when the model for text proofreading in the related art is trained, due to the lack of sufficient training samples, the model is difficult to train, and even if the model is successful, the obtained text proofreading model also needs to be iterated for multiple times to complete text proofreading, so that the time consumption is long, the efficiency is low, and the occupation of computing resources is large.

Disclosure of Invention

The embodiment of the application provides a text processing model training method, a voice text processing method and a voice text processing device, and can solve the problems that in the related art, the efficiency of proofreading a voice text is low, the time consumption is long, and the occupation of computing resources is large.

In a first aspect, an embodiment of the present application provides a text processing model training method, including:

crawling a dialog text from the Internet to obtain a positive sample; the sentences in the dialog text are sentences with correct grammars, and the positive samples are the sentences in the dialog text;

performing conversion operation on the sentences in the dialog text to obtain negative examples and first label information of the negative examples, wherein the sentences in the negative examples are sentences with wrong syntax, and the first label information represents a conversion sequence for converting the positive examples into the negative examples;

correspondingly inputting the positive sample and the negative sample into a pre-trained first text processing model and a to-be-trained second text processing model to generate a first feature vector of a target layer of the first text processing model and a second feature vector of the target layer of the second text processing model; the dimensionality of the second text processing model is smaller than that of the first text processing model, and the first text processing model is obtained by training according to the transformation sequence of the positive sample, the negative sample and the negative sample;

and carrying out knowledge distillation on the second text processing model according to the first feature vector and the second feature vector to obtain a trained second text processing model.

In a possible implementation manner, under the condition that the number of layers of the first text processing model is the same as that of the second text processing model, the first feature vector includes a first input layer feature vector, a first hidden layer feature vector, a first attention vector, and a first prediction correction vector for correcting the negative samples, and the second feature vector includes a second input layer feature vector, a second hidden layer feature vector, a second attention vector, and a second prediction correction vector for correcting the negative samples.

In one possible implementation manner, performing knowledge distillation on the second text processing model according to the first feature vector and the second feature vector to obtain a trained second text processing model, including:

determining a projection matrix according to the dimension of the first text processing model and the dimension of the second text processing model;

calculating a first mean square error loss between the input layer of the first text processing model and the input layer of the second text processing model according to the projection matrix, the first input layer feature vector and the second input layer feature vector;

calculating a second mean square error loss between the hidden layer of the first text processing model and the hidden layer of the second text processing model according to the projection matrix, the first hidden layer eigenvector and the second hidden layer eigenvector;

calculating a third mean square error loss between the first attention vector and the second attention vector;

calculating the cross entropy loss of the first prediction correction vector and the second prediction correction vector according to a preset temperature parameter;

and updating the second text processing model according to the first mean square error loss, the second mean square error loss, the third mean square error loss and the cross entropy loss.

In a possible implementation manner, when the number of layers of the first text processing model is M, the number of layers of the second text processing model is N, and M is not equal to N, the first feature vector includes a first attention vector of each of the M layers of the first text processing model, a first hidden layer feature vector of each hidden layer, a first input layer feature vector, and a first prediction and correction vector that corrects the negative sample, and the second feature vector includes a second attention vector of each of the N layers of the second text processing model, a second hidden layer feature vector of each hidden layer, a first input layer feature vector, and a first prediction and correction vector that corrects the negative sample.

comparing the first attention vector of each layer in the M layers with the second attention vector of each layer in the N layers pairwise to obtain an attention loss matrix between the first text processing model and the second text processing model;

comparing the first hidden layer characteristic vector of each of the M layers with the second hidden layer characteristic vector of each of the N layers pairwise to obtain a hidden layer loss matrix between the first text processing model and the second text processing model;

calculating a first land mobile distance EMD matrix according to the weight of each layer in the first text processing model, the weight of each layer in the second text processing model and the attention loss matrix;

calculating a second EMD matrix according to the weight of each layer in the first text processing model, the weight of each layer in the second text processing model and the hidden layer loss matrix;

calculating a fourth mean square error loss between a first attention vector of the M layers in the first text processing model and a second attention vector of the N layers in the second text processing model according to the first EMD matrix and the attention loss matrix;

calculating a fifth mean square error loss between the first hidden layer feature vector of the M layers in the first text processing model and the second hidden layer feature vector of the N layers in the second text processing model according to the second EMD matrix and the hidden layer loss matrix;

and updating the weight of each layer in the first text processing model and the weight of each layer in the second text processing model according to the first mean square error loss, the cross entropy loss, the fourth mean square error loss and the fifth mean square error loss until the first mean square error loss, the cross entropy loss, the fourth mean square error loss and the fifth mean square error loss converge.

In one possible implementation, the method further includes:

inputting the positive sample and the negative sample into a second text processing model to be trained, and generating a prediction proofreading sequence of the negative sample;

and training the text processing model according to the prediction proofreading sequence of the negative sample and the first label information.

In one possible implementation, the training samples further include a positive sample pair and second label information of the positive sample pair, the second label information represents a conversion sequence for converting the positive sample into the positive sample, and two positive samples in the positive sample pair are the same, and the method further includes:

inputting the positive sample pair into a trained text processing model to generate a prediction proofreading sequence of the positive sample;

and training the text processing model according to the prediction proofreading sequence of the positive sample and the second label information.

In one possible implementation, inputting the positive sample and the negative sample into a second text processing model to be trained, and generating a prediction proofreading sequence of the negative sample includes:

under the condition that the number of the characters in the positive sample is larger than the preset number, inputting the characters of the preset number in the positive sample and the characters corresponding to the characters of the preset number in the positive sample in the negative sample into a second text processing model to be trained according to the sequence from front to back to obtain a prediction proofreading sequence of the characters of the preset number in the negative sample;

and taking the characters remaining in the positive sample and the characters remaining in the negative sample as training samples of the next model training process.

In a second aspect, an embodiment of the present application provides a method for processing a speech text, where the method includes:

recognizing a voice text corresponding to the target voice;

inputting the voice text into a second text processing model according to the first aspect or any one of the possible implementation manners of the first aspect, and determining a collation sequence of the voice text, where the collation sequence represents a collation rule for each character in the voice text;

and correcting the voice text according to the correction sequence to obtain a correction text corresponding to the target voice.

In a third aspect, an embodiment of the present application provides a text processing model training apparatus, where the apparatus includes:

the acquisition module is used for crawling the dialog text from the Internet to obtain a positive sample; the sentences in the dialog text are sentences with correct grammars, and the positive samples are the sentences in the dialog text;

the transformation module is used for matching first label information of the negative sample, wherein the statement in the negative sample is a statement with a wrong syntax, and the first label information represents a transformation sequence for transforming the positive sample into the negative sample;

the generating module is used for correspondingly inputting the positive sample and the negative sample into a pre-trained first text processing model and a to-be-trained second text processing model to generate a first feature vector of a target layer of the first text processing model and a second feature vector of the target layer of the second text processing model; the dimensionality of the second text processing model is smaller than that of the first text processing model, and the first text processing model is obtained by training according to the transformation sequence of the positive sample, the negative sample and the negative sample;

and the training module is used for carrying out knowledge distillation on the second text processing model according to the first characteristic vector and the second characteristic vector to obtain a trained second text processing model.

In one possible implementation, the training module is configured to:

In one possible implementation, the apparatus further includes:

the determining module is used for determining a transformation sequence corresponding to the negative sample according to the transformation operation to obtain first label information of the negative sample; wherein the first tag information represents a transform sequence for transforming positive samples into negative samples;

the generating module is also used for inputting the positive sample and the negative sample into a second text processing model to be trained and generating a prediction proofreading sequence of the negative sample;

the training module is also used for training the text processing model according to the prediction proofreading sequence of the negative sample and the first label information.

In a possible implementation manner, the training samples further include a positive sample pair and second label information of the positive sample pair, the second label information represents a conversion sequence for converting the positive sample into the positive sample, two positive samples in the positive sample pair are the same, and the generation module is further configured to input the positive sample pair into a trained text processing model, and generate a prediction proofreading sequence of the positive sample;

the training module is also used for training the text processing model according to the prediction proofreading sequence of the positive sample and the second label information.

In one possible implementation, the generation module is configured to:

In a fourth aspect, an embodiment of the present application provides a speech text processing apparatus, and the method includes:

the recognition module is used for recognizing the voice text corresponding to the target voice;

a determining module, configured to input the voice text into a second text processing model according to the first aspect or any one of the possible implementation manners of the first aspect, and determine a collation sequence of the voice text, where the collation sequence represents a collation rule for each character in the voice text;

and the proofreading module is used for proofreading the voice text according to the proofreading sequence to obtain a proofreading text corresponding to the target voice.

In a fifth aspect, embodiments of the present application provide an electronic device, including a processor, a memory, and a computer program stored on the memory and executable on the processor, where the computer program, when executed by the processor, implements the method provided in the first aspect or any one of the possible implementations of the first aspect, or implements the method provided in the second aspect.

In a sixth aspect, embodiments of the present application provide a computer storage medium, which stores instructions that, when executed on a computer, cause the computer to perform the method provided in the first aspect or any one of the possible implementation manners of the first aspect, or implement the method provided in the second aspect.

According to the text processing model training method, the voice text processing method and the voice text processing device, the positive samples are obtained by crawling dialogue texts with correct grammars from the Internet, such as texts related to scene dialogs, conference summary texts and the like, and thus a large number of positive samples can be obtained by crawling the dialogue texts from the Internet; then, the sentences in the dialog text are subjected to transformation operations, such as character deletion, homophone word replacement, natural segment combination and the like, so that the transformed sentences are all sentences with wrong grammars, and negative examples are obtained, and label information of the negative examples and the negative examples is obtained through the transformation operations, so that a large number of negative examples with labeling information can be obtained. And then, respectively inputting the positive sample and the negative sample to the trained first text processing model and the second text processing model to be trained to obtain a first feature vector of a target layer of the first text processing model and a second feature vector of the target layer of the second text processing model, wherein the dimensionality of the second text processing model is smaller than that of the first text processing model. And performing knowledge distillation on the second text processing model based on the first feature vector and the second feature vector, so that the second text processing model can learn the features of the first text processing model, and then proofreading the text. Therefore, the second text processing model with the light weight can be obtained to correct the text, and the occupation of resources is reduced. Iteration is not needed in the process of text proofreading by using the second text processing model, so that the text proofreading efficiency is improved, and meanwhile, the occupation of computing resources is reduced.

Drawings

FIG. 1 is a flowchart illustrating a text processing model training method according to an embodiment of the present disclosure;

FIG. 2 is a flow chart illustrating a method for processing a speech text according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram illustrating a text processing model training apparatus according to an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a speech text processing apparatus according to an embodiment of the present application;

fig. 5 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the embodiments of the present application will be described below with reference to the accompanying drawings.

In the description of the embodiments of the present application, the words "exemplary," "for example," or "for instance" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary," "e.g.," or "e.g.," is not to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the words "exemplary," "e.g.," or "exemplary" is intended to present relevant concepts in a concrete fashion.

In the description of the embodiments of the present application, the term "and/or" is only one kind of association relationship describing an associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, B exists alone, and A and B exist at the same time. In addition, the term "plurality" means two or more unless otherwise specified. For example, the plurality of systems refers to two or more systems, and the plurality of screen terminals refers to two or more screen terminals.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicit indication of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

The dialog content is usually lengthy, tortuous, informal, repetitive, in which cases such as sentences, rewinding, repetition, reconfirmation, hesitation, speaker interruption, etc. occur, and important information is dispersed among a plurality of characters and a plurality of time points. And the recognition in the voice recognition process is wrong, and the readability of the generated voice recognition text is often poor, so that the content is not easy to review, summarize and arrange afterwards.

Based on this, the embodiment of the application provides a text processing model training method, a voice text processing method and a voice text processing device, which can obtain sufficient training texts, and the trained models are lightweight models, so that the occupation of storage resources is reduced, the text proofreading efficiency is improved, and the occupation of computing resources is reduced. The following describes the text processing model training method provided in the embodiment of the present application in detail.

Fig. 1 is a schematic flowchart of a text processing model training method according to an embodiment of the present application. As shown in fig. 1, the text processing model training method provided in the embodiment of the present application may include S101-S104.

S101: crawling a dialog text from the Internet to obtain a positive sample; the sentences in the dialog text are sentences with correct grammar, and the positive samples are the sentences in the dialog text.

To obtain a large number of training samples, dialog text, such as forum dialog text, scenario dialog text, video captions, scripts, etc., may be crawled from the internet. Wherein, the sentences in the dialogue texts in the Internet are sentences with correct grammar. And taking the sentence in the dialog text as a positive sample in the training sample.

In some embodiments, to ensure the accuracy of the positive sample, the crawled dialog text may also be data washed, such as to remove special characters and meaningless spaces, links, pictures, etc. in the dialog text.

S102: and performing conversion operation on the sentences in the dialog text to obtain negative examples and first label information of the negative examples, wherein the sentences in the negative examples are sentences with wrong syntax, and the first label information represents a conversion sequence for converting the positive examples into the negative examples.

The negative sample corresponding to the positive sample is the sentence with grammar error, thus the positive sample and the negative sample are made up into the error correction text parallel corpus pair. The embodiment of the application converts the sentences in the dialogue text into the sentences with wrong grammar by converting the sentences in the dialogue text. For example, punctuation is deleted, homophones are replaced, natural segments are merged, random deletions, characters are added, and so on. As such, negative samples corresponding to positive samples may be generated. For example, a positive sample is the sentence "i am on your beach street, you are coming home", which can be converted into "i am then on your beach street, etc.

A transform sequence that transforms the negative examples into the positive examples may also be determined based on the transform operation, resulting in first label information for the negative examples. Wherein the transformation sequence represents a transformation operation corresponding to each character in a sentence. For example, the sentence "i go back to your fast-forwarding bar on the beach street" is converted into the sentence "i go back to your fast-forwarding bar on the beach street", and the corresponding conversion sequence is "hold, replace with, hold, delete, hold".

S103: correspondingly inputting the positive sample and the negative sample into a pre-trained first text processing model and a to-be-trained second text processing model to generate a first feature vector of a target layer of the first text processing model and a second feature vector of the target layer of the second text processing model; the dimensionality of the second text processing model is smaller than that of the first text processing model, and the first text processing model is obtained by training according to the transformation sequence of the positive sample, the negative sample and the negative sample.

The first text processing model is obtained by training a transformation sequence of positive samples, negative samples and negative samples in advance. The second text processing model may be a pre-constructed untrained model, or may be a model generated based on the first text processing model, for example, by extracting a plurality of intermediate layers from the first text processing model to construct the second text processing model. Here, the dimensions of the second text processing model are smaller than the dimensions of the first text processing model.

And correspondingly inputting the positive sample and the negative sample into a pre-trained first text processing model and a to-be-trained second text processing model to generate a first feature vector of a target layer of the first text processing model and a second feature vector of the target layer of the second text processing model.

In some embodiments, the target layers of the first text-processing model include an input layer, a hidden layer, and an output layer, and the target layers of the second text-processing model include an input layer, a hidden layer, and an output layer.

The number of layers of the first text processing model and the number of layers of the second text processing model may be the same or different.

Under the condition that the number of layers of the first text processing model is the same as that of the second text processing model, the first feature vector comprises a first input layer feature vector, a first hidden layer feature vector, a first attention vector and a first prediction proofreading vector for proofreading negative samples. The first hidden layer vector refers to a feature vector determined by all hidden layers in the first text processing model, for example, if there are 3 hidden layers, the first hidden layer feature vector refers to a feature vector determined by all hidden layers in the 3 hidden layers. The first attention vector is an attention vector determined by all hidden layers in the first text processing model. The second feature vector comprises a second feature vector comprising a second input layer feature vector, a second hidden layer feature vector, a second attention vector, and a second prediction collation vector that collates negative samples. The second hidden layer feature vector is a feature vector determined by all hidden layers in the second text processing model. The second attention vector is an attention vector determined by all hidden layers in the second text processing model together.

And under the condition that the number of layers of the first text processing model is different from that of the second text processing model, setting the number of layers of the first text processing model as M and the number of layers of the second text processing model as N, wherein the first feature vector comprises a first attention vector of each of the M layers of the first text processing model, a first hidden layer feature vector of each hidden layer, a first input layer feature vector and a first prediction correction vector for correcting the negative sample, and the second feature vector comprises a second attention vector of each of the N layers of the second text processing model, a second hidden layer feature vector of each hidden layer, a first input layer feature vector and a first prediction correction vector for correcting the negative sample.

S104: and carrying out knowledge distillation on the second text processing model according to the first feature vector and the second feature vector to obtain a trained second text processing model.

In some embodiments, knowledge distillation is performed on the second text processing model based on the first feature vector and the second feature vector, such that the second text processing model learns the parametric features of the first text processing model, where the number of layers of the first text processing model is the same as the number of layers of the second text processing model.

In S104, a projection matrix is determined according to the dimension of the first text processing model and the dimension of the second text processing model.

And calculating a first mean square error loss between the input layer of the first text processing model and the input layer of the second text processing model according to the projection matrix, the first input layer feature vector and the second input layer feature vector. Wherein the first mean square error loss L_embdSatisfies the following formula (1):

L_embd＝MSE(E^SW,E^T) (1)

E^Sa second input layer vector representing a second text processing model, W representing a projection matrix, E^TA first input layer vector representing a second text processing model.

And calculating a second mean square error loss between the hidden layer of the first text processing model and the hidden layer of the second text processing model according to the projection matrix, the first hidden layer feature vector and the second hidden layer feature vector. Wherein the second mean square error loss L_hiddenSatisfies the following formula (2):

w represents a projection matrix and W represents a projection matrix,

a second hidden layer feature vector representing an ith layer of the second text processing model,

a first hidden layer feature vector representing an ith layer of the first text processing model.

A third mean square error loss between the first attention vector and the second attention vector is calculated. Wherein the third mean square error loss L_attenSatisfies the following formula (3):

h is the number of attention heads,

an attention vector representing the ith layer of the second text-processing model,

an attention vector representing the ith layer of the first text-processing model.

And calculating the cross entropy loss of the first prediction correction vector and the second prediction correction vector according to a preset temperature parameter. Wherein the cross entropy loss L_predSatisfies the following formula (4):

L_pred＝-softmax(z^T)·log_softmax(z^s/t) (4)

z^Trepresenting a first prediction correction vector, z^sRepresenting the second prediction correction vector, tgemperature.

In some embodiments, in the case where the number of layers of the first text processing model is different from the number of layers of the second text processing model, knowledge distillation is performed on the second text processing model based on the first feature vector and the second feature vector, thereby causing the second text processing model to learn the parameter features of the first text processing model. Here, each layer of the first text processing model and each layer of the second text processing model has a weight. The higher the layer role with the higher weight, the greater the weight of the second text processing model in learning to the first text processing model. The same weight is assigned during initialization, e.g. the first text processing model has M layers, the second text processing model has N layers, the weight of each layer in the first text processing model

1/M, weight of each layer in the second text processing model

Is 1/N.

In step S104, determining a projection matrix according to the dimension of the first text processing model and the dimension of the second text processing model; and calculating a first mean square error loss between the input layer of the first text processing model and the input layer of the second text processing model according to the projection matrix, the first input layer feature vector and the second input layer feature vector, and calculating a cross entropy loss of the first prediction correction vector and the second prediction correction vector according to a preset temperature parameter. Wherein the first mean square error matrix satisfies the above formula (1), and the cross entropy loss satisfies the above formula (4), which will not be described in detail herein.

Then, the first attention vector of each of the M layers and the second attention vector of each of the N layers are compared pairwise to obtain an attention loss matrix between the first text processing model and the second text processing model. And comparing the first hidden layer characteristic vector of each layer in the M layers with the second hidden layer characteristic vector of each layer in the N layers in pairs to obtain a hidden layer loss matrix between the first text processing model and the second text processing model.

Here, the attention loss matrix refers to a loss matrix between all layers of the first text processing model and all layers of the second text processing model; the hidden layer loss matrix refers to a loss matrix between all hidden layers of the first text processing model and all hidden layers of the second text processing model.

After the attention loss matrix and the hidden layer loss matrix are determined, a first land mobile Distance (EMD) matrix is calculated according to the weight of each layer in the first text processing model, the weight of each layer in the second text processing model and the attention loss matrix. And calculating a second EMD matrix according to the weight of each layer in the first text processing model, the weight of each layer in the second text processing model and the hidden layer loss matrix.

And calculating a fourth mean square error loss between a first attention vector of the M layers in the first text processing model and a second attention vector of the N layers in the second text processing model according to the first EMD matrix and the attention loss matrix. Wherein the fourth mean square error loss L_attnSatisfies the following formula (5):

wherein M represents the number of layers of the first text processing model, N represents the number of layers of the second text processing model,

representing a first EMD matrix between the ith layer of the first text-processing model and the jth layer of the second text-processing model,

an attention loss matrix is represented.

And calculating a fifth mean square error loss between the first hidden layer feature vector of the M layers in the first text processing model and the second hidden layer feature vector of the N layers in the second text processing model according to the second EMD matrix and the hidden layer loss matrix. Wherein the fifth mean square error matrix satisfies the following formula (6):

a second EMD matrix representing between the ith layer of the first text processing model and the jth layer of the second text processing model,

representing the hidden layer loss matrix.

And updating the weight of each layer in the first text processing model and the weight of each layer in the second text processing model according to the fourth mean square error loss and the fifth mean square error loss until the fourth mean square error loss and the fifth mean square error loss converge.

In some embodiments, the method for training a text processing model provided by the embodiment of the present application further includes a training process of the first text processing model. Specifically, inputting a positive sample and a negative sample into a second text processing model to be trained, and generating a prediction proofreading sequence of the negative sample; and training the text processing model according to the prediction proofreading sequence of the negative sample and the first label information. Here, in order to ensure the accuracy of the first text processing model, manually labeled positive samples and negative samples may also be input into the first text processing model for training in the training process.

In some embodiments, to ensure accuracy of the first text processing model, the training samples further include a positive sample pair and second label information for the positive sample pair, the second label information representing a conversion sequence to convert the positive sample into the positive sample, the two positive samples in the positive sample pair being the same. In the training process, the positive sample pair can be input into a trained text processing model to generate a prediction proofreading sequence of the positive sample; and training the text processing model according to the prediction proofreading sequence of the positive sample and the second label information.

As one example, a sufficient number of pairs of positive and negative examples are combined with real artificial annotation data to perform a three-stage training of the transition from synthetic data to real data. Firstly, only sentence pairs which only contain grammar errors and have correct grammar are trained. Then, fine-tuning parameters of the trained first text processing model by using a small number of manually labeled positive samples and negative samples; and finally, fine tuning of parameters of the trained first text processing model is further performed by using a small number of manually labeled positive samples and negative samples and positive samples, so that the performance of the model is improved.

In some embodiments, since the second text input model may have a limitation on input characters, in order to enable the second text processing model to perform long text segmentation, in S103, first, when the number of characters in the positive sample is greater than the preset number, the preset number of characters in the positive sample and the characters corresponding to the preset number of characters in the positive sample in the negative sample may be input into the second text processing model to be trained in a sequence from front to back, so as to obtain a predicted collation sequence of the preset number of characters in the negative sample; then, the characters remaining in the positive sample and the characters remaining in the negative sample are used as training samples for the next model training process.

According to the text processing model training method provided by the embodiment of the application, positive samples are obtained by crawling dialogue texts with correct grammars from the Internet, such as texts related to scene dialogue, conference summary texts and the like, and a large number of positive samples can be obtained by crawling the dialogue texts from the Internet; then, the sentences in the dialog text are subjected to transformation operations, such as character deletion, homophone word replacement, natural segment combination and the like, so that the transformed sentences are all sentences with wrong grammars, and negative examples are obtained, and label information of the negative examples and the negative examples is obtained through the transformation operations, so that a large number of negative examples with labeling information can be obtained. And then, respectively inputting the positive sample and the negative sample to the trained first text processing model and the second text processing model to be trained to obtain a first feature vector of a target layer of the first text processing model and a second feature vector of the target layer of the second text processing model, wherein the dimensionality of the second text processing model is smaller than that of the first text processing model. And performing knowledge distillation on the second text processing model based on the first feature vector and the second feature vector, so that the second text processing model can learn the features of the first text processing model, and then proofreading the text. Therefore, the second text processing model with the light weight can be obtained to correct the text, and the occupation of resources is reduced. Iteration is not needed in the process of text proofreading by using the second text processing model, so that the text proofreading efficiency is improved, and meanwhile, the occupation of computing resources is reduced.

The embodiment of the present application further provides an application scheme of the second text processing model, which is described in detail below.

Fig. 2 is a schematic flowchart of a speech text processing method provided in an embodiment of the present application, and as shown in fig. 2, the speech text processing method provided in the embodiment of the present application may include S201 to S203.

S201: and recognizing the voice text corresponding to the target voice.

The target voice may be any voice obtained by any means, for example, a telephone recording, a conference recording, a voice generated during a voice chat. And after the target voice is acquired, identifying the target voice so as to determine a voice text corresponding to the target voice.

S202: and inputting the voice text into a second text processing model, and determining a proofreading sequence of the voice text, wherein the proofreading sequence represents a proofreading rule of each character in the voice text.

The phonetic text is input into the second text processing model, and a collation sequence of the phonetic text can be determined. For example, if the phonetic text is "like fireworks lost in the wind", the collation sequence is "hold, replace with like, hold,".

S203: and performing proofreading on the voice text according to the proofreading sequence to obtain a proofreading text corresponding to the target voice.

For example, the phonetic text is "as if fireworks were lost in the wind", the collation sequence is "hold, replace as, hold,", and the collated text is "as if fireworks were lost in the wind".

According to the method for processing the voice text, the positive samples are obtained by crawling dialogue texts with correct grammars from the Internet, such as texts related to scene dialogue, conference summary texts and the like, and a large number of positive samples can be obtained by crawling the dialogue texts from the Internet; then, the sentences in the dialog text are subjected to transformation operations, such as character deletion, homophone word replacement, natural segment combination and the like, so that the transformed sentences are all sentences with wrong grammars, and negative examples are obtained, and label information of the negative examples and the negative examples is obtained through the transformation operations, so that a large number of negative examples with labeling information can be obtained. And then, respectively inputting the positive sample and the negative sample to the trained first text processing model and the second text processing model to be trained to obtain a first feature vector of a target layer of the first text processing model and a second feature vector of the target layer of the second text processing model, wherein the dimensionality of the second text processing model is smaller than that of the first text processing model. And performing knowledge distillation on the second text processing model based on the first feature vector and the second feature vector, so that the second text processing model can learn the features of the first text processing model, and then proofreading the text. Therefore, the second text processing model with the light weight can be obtained to correct the text, and the occupation of resources is reduced. Iteration is not needed in the process of text proofreading by using the second text processing model, so that the text proofreading efficiency is improved, and meanwhile, the occupation of computing resources is reduced.

Based on the text processing model training method in the above embodiment, the embodiment of the present application further provides a text processing model training device. Fig. 3 is a schematic structural diagram of a text processing model training apparatus 300 according to an embodiment of the present disclosure, and as shown in fig. 3, the apparatus 300 may include an obtaining module 301, a transforming module 302, a generating module 303, and a training module 304.

The acquisition module 301 is configured to crawl a dialog text from the internet to obtain a positive sample; the sentences in the dialog text are sentences with correct grammars, and the positive samples are the sentences in the dialog text;

a transformation module 302, configured to compare first tag information of a negative example, where a statement in the negative example is a syntax error statement, and the first tag information represents a transformation sequence for transforming a positive example into the negative example;

the generating module 303 is configured to correspondingly input the positive sample and the negative sample into a pre-trained first text processing model and a to-be-trained second text processing model, and generate a first feature vector of a target layer of the first text processing model and a second feature vector of the target layer of the second text processing model; the dimensionality of the second text processing model is smaller than that of the first text processing model, and the first text processing model is obtained by training according to the transformation sequence of the positive sample, the negative sample and the negative sample;

and the training module 304 is configured to perform knowledge distillation on the second text processing model according to the first feature vector and the second feature vector to obtain a trained second text processing model.

In one possible implementation, the training module 304 is configured to:

In one possible implementation, the apparatus further includes:

the generating module 303 is further configured to input the positive sample and the negative sample into a second text processing model to be trained, and generate a prediction proofreading sequence of the negative sample;

the training module 304 is further configured to train the text processing model according to the prediction proofreading sequence of the negative examples and the first label information.

In a possible implementation manner, the training samples further include a positive sample pair and second label information of the positive sample pair, where the second label information represents a conversion sequence for converting the positive sample into the positive sample, two positive samples in the positive sample pair are the same, and the generating module 303 is further configured to input the positive sample pair into the trained text processing model, and generate a prediction proofreading sequence of the positive sample;

the training module 304 is further configured to train the text processing model according to the prediction proofreading sequence of the positive sample and the second label information.

In one possible implementation, the generating module 303 is configured to:

The text processing model training device provided in the embodiment of the present application can execute the method steps in the embodiment shown in fig. 1, and achieve the same technical effect, and for avoiding repetition, details are not repeated here.

The text processing model training device provided by the embodiment of the application obtains positive samples by crawling dialogue texts with correct grammars from the internet, such as texts related to scene dialogue, conference summary texts and the like, and can obtain a large number of positive samples by crawling the dialogue texts from the internet; then, the sentences in the dialog text are subjected to transformation operations, such as character deletion, homophone word replacement, natural segment combination and the like, so that the transformed sentences are all sentences with wrong grammars, and negative examples are obtained, and label information of the negative examples and the negative examples is obtained through the transformation operations, so that a large number of negative examples with labeling information can be obtained. And then, respectively inputting the positive sample and the negative sample to the trained first text processing model and the second text processing model to be trained to obtain a first feature vector of a target layer of the first text processing model and a second feature vector of the target layer of the second text processing model, wherein the dimensionality of the second text processing model is smaller than that of the first text processing model. And performing knowledge distillation on the second text processing model based on the first feature vector and the second feature vector, so that the second text processing model can learn the features of the first text processing model, and then proofreading the text. Therefore, the second text processing model with the light weight can be obtained to correct the text, and the occupation of resources is reduced. Iteration is not needed in the process of text proofreading by using the second text processing model, so that the text proofreading efficiency is improved, and meanwhile, the occupation of computing resources is reduced.

Based on the voice text processing method in the above embodiment, the embodiment of the present application further provides a voice text processing apparatus. Fig. 4 is a schematic structural diagram of a speech text processing apparatus 400 according to an embodiment of the present application, and as shown in fig. 4, the apparatus 400 may include a recognition module 401, a determination module 402, and a collation module 403.

The recognition module 401 is configured to recognize a speech text corresponding to the target speech.

A determining module 402, configured to input the phonetic text into the second text processing model according to the first aspect or any one of the possible implementation manners of the first aspect, and determine a collation sequence of the phonetic text, where the collation sequence represents a collation rule for each character in the phonetic text.

And the proofreading module 403 is configured to perform proofreading on the voice text according to the proofreading sequence to obtain a proofreading text corresponding to the target voice.

The speech text processing apparatus provided in the embodiment of the present application can perform the method steps in the embodiment shown in fig. 2, and achieve the same technical effect, and for avoiding repetition, details are not described here again.

The speech text processing device provided by the embodiment of the application obtains the positive samples by crawling the dialogue texts with correct grammars from the internet, such as texts related to scene dialogue, conference summary texts and the like, and can obtain a large number of positive samples by crawling the dialogue texts from the internet; then, the sentences in the dialog text are subjected to transformation operations, such as character deletion, homophone word replacement, natural segment combination and the like, so that the transformed sentences are all sentences with wrong grammars, and negative examples are obtained, and label information of the negative examples and the negative examples is obtained through the transformation operations, so that a large number of negative examples with labeling information can be obtained. And then, respectively inputting the positive sample and the negative sample to the trained first text processing model and the second text processing model to be trained to obtain a first feature vector of a target layer of the first text processing model and a second feature vector of the target layer of the second text processing model, wherein the dimensionality of the second text processing model is smaller than that of the first text processing model. And performing knowledge distillation on the second text processing model based on the first feature vector and the second feature vector, so that the second text processing model can learn the features of the first text processing model, and then proofreading the text. Therefore, the second text processing model with the light weight can be obtained to correct the text, and the occupation of resources is reduced. Iteration is not needed in the process of text proofreading by using the second text processing model, so that the text proofreading efficiency is improved, and meanwhile, the occupation of computing resources is reduced.

An electronic device provided in an embodiment of the present application is described below.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 5, the electronic device provided in the embodiment of the present application may be used to implement the text processing model training method or the speech text processing method described in the foregoing method embodiment.

The electronic device may comprise a processor 501 and a memory 502 in which computer program instructions are stored.

Specifically, the processor 501 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.

Memory 502 may include mass storage for data or instructions. By way of example, and not limitation, memory 502 may include a Hard Disk Drive (HDD), a floppy Disk Drive, flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 502 may include removable or non-removable (or fixed) media, where appropriate. The memory 502 may be internal or external to the integrated gateway disaster recovery device, where appropriate. In a particular embodiment, the memory 502 is non-volatile solid-state memory.

The memory may include Read Only Memory (ROM), Random Access Memory (RAM), magnetic disk storage media devices, optical storage media devices, flash memory devices, electrical, optical, or other physical/tangible memory storage devices. Thus, in general, the memory includes one or more tangible (non-transitory) computer-readable storage media (e.g., memory devices) encoded with software comprising computer-executable instructions and when the software is executed (e.g., by one or more processors), it is operable to perform operations described with reference to methods in accordance with the present application.

The processor 501 reads and executes the computer program instructions stored in the memory 502 to implement any one of the text processing model training methods or the speech text processing method in the above embodiments.

In one example, the electronic device can also include a communication interface 505 and a bus 510. As shown in fig. 5, the processor 501, the memory 502, and the communication interface 505 are connected via a bus 510 to complete communication therebetween.

The communication interface 505 is mainly used for implementing communication between modules, apparatuses, units and/or devices in the embodiments of the present application.

Bus 510 includes hardware, software, or both to couple the components of the electronic device to each other. By way of example, and not limitation, a bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a Hypertransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an infiniband interconnect, a Low Pin Count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a video electronics standards association local (VLB) bus, or other suitable bus or a combination of two or more of these. Bus 510 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.

In addition, in combination with the above embodiments, the embodiments of the present application may be implemented by providing a computer storage medium. The computer storage medium having computer program instructions stored thereon; the computer program instructions, when executed by a processor, implement any of the text processing model training methods or speech-to-text processing methods of the above embodiments.

The functional blocks shown in the above-described structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the present application are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.

It should also be noted that the exemplary embodiments mentioned in this application describe some methods or systems based on a series of steps or devices. However, the present application is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.

Aspects of the present application are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, enable the implementation of the functions/acts specified in the flowchart and/or block diagram block or blocks. Such a processor may be, but is not limited to, a general purpose processor, a special purpose processor, an application specific processor, or a field programmable logic circuit. It will also be understood that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware for performing the specified functions or acts, or combinations of special purpose hardware and computer instructions.

As described above, only the specific embodiments of the present application are provided, and it can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the module and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. It should be understood that the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered within the scope of the present application.

Claims

1. A method for training a text processing model, the method comprising:

performing transformation operation on the sentences in the dialog text to obtain a negative sample and first label information of the negative sample, wherein the sentences in the negative sample are sentences with wrong syntax, and the first label information represents a transformation sequence for transforming the positive sample into the negative sample;

and carrying out knowledge distillation on the second text processing model according to the first characteristic vector and the second characteristic vector to obtain a trained second text processing model.

2. The method of claim 1, wherein in a case that the number of layers of the first text processing model is the same as the number of layers of the second text processing model, the first feature vector comprises the first input layer feature vector, a first hidden layer feature vector, a first attention vector, and a first prediction correction vector that corrects the negative examples, and the second feature vector comprises a second input layer feature vector, a second hidden layer feature vector, a second attention vector, and a second prediction correction vector that corrects the negative examples.

3. The method of claim 2, wherein knowledge distillation of the second text processing model based on the first feature vector and the second feature vector to obtain a trained second text processing model comprises:

calculating a second mean square error loss between the hidden layer of the first text processing model and the hidden layer of the second text processing model according to the projection matrix, the first hidden layer feature vector and the second hidden layer feature vector;

updating the second text processing model according to the first mean square error loss, the second mean square error loss, the third mean square error loss, and the cross entropy loss.

4. The method according to claim 1, wherein in a case where the number of layers of the first text processing model is M, the number of layers of the second text processing model is N, and M is not equal to N, the first feature vector comprises a first attention vector of each of the M layers of the first text processing model, a first hidden layer feature vector of each hidden layer, a first input layer feature vector, and a first prediction correction vector that corrects the negative sample, and the second feature vector comprises a second attention vector of each of the N layers of the second text processing model, a second hidden layer feature vector of each hidden layer, a first input layer feature vector, and a first prediction correction vector that corrects the negative sample.

5. The method of claim 4, wherein knowledge distillation of the second text processing model based on the first feature vector and the second feature vector to obtain a trained second text processing model comprises:

comparing the first attention vector of each of the M layers with the second attention vector of each of the N layers pairwise to obtain an attention loss matrix between the first text processing model and the second text processing model;

comparing the first hidden layer feature vector of each of the M layers with the second hidden layer feature vector of each of the N layers pairwise to obtain a hidden layer loss matrix between the first text processing model and the second text processing model;

calculating a first land mobile distance (EMD) matrix according to the weight of each layer in the first text processing model, the weight of each layer in the second text processing model and the attention loss matrix;

calculating a fourth mean square error loss between a first attention vector of M layers in the first text processing model and a second attention vector of N layers in the second text processing model according to the first EMD matrix and the attention loss matrix;

calculating a fifth mean square error loss between a first hidden layer feature vector of M layers in the first text processing model and a second hidden layer feature vector of N layers in the second text processing model according to the second EMD matrix and the hidden layer loss matrix;

updating the weight of each layer in the first text processing model and the weight of each layer in the second text processing model according to the first mean square error loss, the cross entropy loss, the fourth mean square error loss and the fifth mean square error loss until the first mean square error loss, the cross entropy loss, the fourth mean square error loss and the fifth mean square error loss converge.

6. The method according to any one of claims 1-5, further comprising:

7. The method of claim 6, wherein the training samples further include a positive sample pair and second label information for the positive sample pair, the second label information indicating a conversion sequence to convert the positive sample into the positive sample, two positive samples in the positive sample pair being the same, the method further comprising:

8. The method of any of claims 1-5, wherein inputting the positive and negative examples into a second text processing model to be trained, generating a predictive collation sequence for the negative example comprises:

under the condition that the number of the characters in the positive sample is larger than the preset number, inputting the preset number of the characters in the positive sample and the characters corresponding to the preset number of the characters in the positive sample in the negative sample into the second text processing model to be trained according to the sequence from front to back to obtain a prediction proofreading sequence of the preset number of the characters in the negative sample;

9. A method for speech text processing, the method comprising:

recognizing a voice text corresponding to the target voice;

inputting the phonetic text into a second text processing model according to any one of claims 1-9, determining a collation sequence of the phonetic text, the collation sequence representing collation rules for each character in the phonetic text;

and performing proofreading on the voice text according to the proofreading sequence to obtain a proofreading text corresponding to the target voice.

10. A text processing model training apparatus, the apparatus comprising:

a transformation module, configured to compare with first tag information of the negative examples, where a statement in the negative examples is a syntax error statement, and the first tag information represents a transformation sequence for transforming the positive examples into the negative examples;

the generating module is used for correspondingly inputting the positive sample and the negative sample into a pre-trained first text processing model and a to-be-trained second text processing model, and generating a first feature vector of a target layer of the first text processing model and a second feature vector of a target layer of the second text processing model; the dimensionality of the second text processing model is smaller than that of the first text processing model, and the first text processing model is obtained by training according to the transformation sequence of the positive sample, the negative sample and the negative sample;