CN111639477B

CN111639477B - Text reconstruction training method and system

Info

Publication number: CN111639477B
Application number: CN202010484716.9A
Authority: CN
Inventors: 王丙栋; 游世学
Original assignee: Beijing Zhongke Huilian Technology Co ltd
Current assignee: Beijing Zhongke Huilian Technology Co ltd
Priority date: 2020-06-01
Filing date: 2020-06-01
Publication date: 2023-04-18
Anticipated expiration: 2040-06-01
Also published as: CN111639477A

Abstract

The invention discloses a text reconstruction training method and a system, wherein the method comprises the following steps: constructing a training sample; adopting a neural network text sequence model to code the splicing of the original text and the pronunciation sequence; adopting a sigmoid feedforward neural network classifier to obtain a word segmentation classification result; circularly generating each character of the sequence from beginning to end by taking the preset GEN _ LENGTH as the maximum LENGTH of the generated sequence; optimizing parameters of the neural network, calculating word segmentation loss and text generation loss, generating loss weighted summation to obtain combined loss, and optimizing the parameters of the neural network by using a gradient descent algorithm for the combined loss. The text reconstruction training method and the text reconstruction training system provided by the invention use the text coding and the neural network generation to convert the given original text into the target text, correct the wrongly written characters in the original text, supplement the missing characters, remove redundant characters and standardize the words so as to achieve the purposes of eliminating text errors and improving the text quality.

Description

Text reconstruction training method and system

Technical Field

The invention relates to the technical field of text generation and deep learning, in particular to a text reconstruction training method and a text reconstruction training system.

Background

In the information age, texts are important information carriers in internet multimedia, and the data size is huge, and the sources and authors are numerous. Due to the problems of near-word form, near-word pronunciation, stroke error, mouth error, inaccurate voice recognition, irregular author level and the like, wrong words and irregular words in the text occur sometimes, and the wrong transmission and misunderstanding of information are caused. The existing text error correction method based on the tree model or the deep neural network cannot effectively solve the problems of missed characters, multiple characters, wrong pronunciation of a voice recognition result and irregular word use in a text.

Disclosure of Invention

The invention aims to provide a text reconstruction training method and a text reconstruction training system.

In order to achieve the purpose, the invention provides the following scheme:

a text reconstruction training method comprises the following steps:

s1, constructing a training sample, wherein the training sample comprises an original text, a pronunciation sequence, a word segmentation and annotation sequence and a target text;

s2, based on a training sample, adopting a neural network text sequence model to code splicing of an original text and a pronunciation sequence to obtain a first coding sequence, and then truncating the first coding sequence at the starting and stopping positions of the splicing sequence according to the original text to obtain a second coding sequence corresponding to the original text;

s3, performing secondary classification on each vector in the second coding sequence by adopting a sigmoid feedforward neural network classifier to obtain a word segmentation classification result, wherein the word segmentation classification result indicates whether the character at the corresponding position in the original text is an end character of a word;

s4, generating a text, circularly generating each character of the sequence from beginning to end by taking the preset GEN _ LENGTH as the maximum LENGTH of the generated sequence, and specifically comprising the following steps:

s401, converting a digital ID corresponding to a previous character into a first input vector;

s402, calculating attention of the first input vector to the participle ending character in the second coding sequence as a second input vector by using an attention mechanism, and normalizing the result of adding the first input vector and the second input vector by using a layer normalization technology to obtain a third input vector;

s403, obtaining an LSTM output vector based on the third input vector;

s404, classifying the LSTM output vector to generate a next character;

judging whether the number of the generated characters exceeds the maximum generated character number, if not, jumping to the step S401, taking the newly generated next character as the previous character, circularly generating the next character, otherwise, entering the next step;

s5, optimizing the neural network parameters, specifically comprising:

s501, calculating word segmentation loss according to the real word segmentation labeling sequence and the predicted word segmentation result by using cross entropy;

s502, calculating text generation loss according to a real target text and a generated text by using cross entropy;

s503, carrying out weighted summation on the word loss and the text generation loss to obtain a joint loss;

and S504, optimizing parameters of the neural network by using a gradient descent algorithm for the joint loss.

Optionally, in step S1, a training sample is constructed, which includes an original text, a pronunciation sequence, a segmentation and annotation sequence, and a target text, and specifically includes:

s101, collecting articles, paragraphs and sentences to assemble a text set, segmenting the text set into sentence subsets, and taking each sentence in the sentence subsets as an original text;

s102, marking pronunciation on the original text to obtain a pronunciation sequence;

s103, segmenting original text, marking the ending character of each word as 1 and marking the non-ending character as 0 to obtain a segmentation marking sequence;

and S102, if the standard text exists, taking the standard text as a target text, otherwise, copying the original text as the target text.

Optionally, the step S1 further includes:

s104, judging whether the original text is rewritten, and entering the next step if the original text is not rewritten;

s105, generating a random number between 0 and 1, and entering the next step if the random number is larger than a rewriting threshold;

s106, rewriting the original text to obtain a new text, taking the new text as the original text, and then jumping to the step S103, wherein the rewriting method is to delete a word or insert a random word or replace the word at the random position as a new random word at a certain random position of the original text.

Optionally, in step S504, optimizing parameters of the neural network by using a gradient descent algorithm for the joint loss specifically includes:

and (3) optimizing parameters of the neural network by using a text generation and word segmentation joint loss function and adopting a gradient descent algorithm, wherein the joint loss function is as follows:

wherein Loss (theta) is joint Loss, theta is a parameter of the neural network, D is a training data set, s is an original text, p is a pronunciation sequence, c is a segmentation labeling sequence, t is a target text, and Loss (theta) is ^g (θ) generates a Loss, for text ^c (theta) is the loss of participles, omega ^g Generating a lost weight, ω, for the text ^c Weight lost for word segmentation, omega ^g And omega ^c Is equal to 1,loss ^g (theta) and Loss ^c (θ) the losses are all calculated using cross entropy.

The invention also provides a text reconstruction system, which is applied to the text reconstruction training method and comprises the following steps:

the training sample construction module is used for constructing a training sample, and the training sample comprises an original text, a pronunciation sequence, a word segmentation labeling sequence and a target text;

the text editor is used for coding the splicing of the original text and the pronunciation sequence by adopting a neural network text sequence model based on a training sample to obtain a first coding sequence, and then truncating the first coding sequence at the starting and stopping positions of the splicing sequence according to the original text to obtain a second coding sequence corresponding to the original text;

the word segmentation classifier is used for carrying out secondary classification on each vector in the second coding sequence by adopting a sigmoid feedforward neural network classifier to obtain a word segmentation classification result, and the word segmentation classification result indicates whether the character at the corresponding position in the original text is an end character of a word or not;

the text generator is used for generating a text, taking the preset GEN _ LENGTH as the maximum LENGTH of a generated sequence, and circularly generating each character of the sequence from beginning to end;

and the neural network parameter optimization unit is used for optimizing the neural network parameters and learning the neural network model parameters of the text encoder, the word segmentation classifier and the text generator by using the text generation and word segmentation joint loss function.

Optionally, the text generator specifically includes:

the embedding layer is used for converting the digital ID corresponding to the previous character into a first input vector;

the attention layer is used for calculating the attention of the first input vector to the participle ending character in the second coding sequence as a second input vector by using an attention mechanism, and normalizing the result of adding the first input vector and the second input vector by using a layer normalization technology to obtain a third input vector;

the LSTM unit is used for obtaining an LSTM output vector based on the third input vector;

the softmax multi-classifier is used for classifying the LSTM output vector to generate a next character;

and the cycle judgment module is used for judging whether the number of the generated characters exceeds the maximum generated character number, if not, skipping to the step S401, taking the newly generated next character as the previous character, and cyclically generating the next character, otherwise, entering the next step.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the text reconstruction training method and the text reconstruction training system provided by the invention use text coding and a generated neural network, combine word segmentation tasks on the basis of a text generation task, convert a given original text into a target text by using the characteristics of the text and the pronunciation characteristics of the text, correct the wrongly written characters in the original text, supplement missing characters, remove redundant characters and standardize used words so as to achieve the purposes of eliminating text errors and improving the text quality, and support supervised learning and self-supervised learning.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a single sample illustration of the present invention;

FIG. 2 is a flow chart of the automatic generation of training samples in accordance with the present invention;

FIG. 3 is a diagram of a neural network architecture according to the present invention;

FIG. 4 is a flow chart of neural network model training in accordance with the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

As shown in fig. 1-4, the text reconstruction training method provided by the present invention includes the following steps:

s2, based on a training sample, adopting a neural network text sequence model to code the splicing of the original text and the pronunciation sequence to obtain a first coding sequence, and then truncating the first coding sequence at the start and stop positions of the splicing sequence according to the original text to obtain a second coding sequence corresponding to the original text;

s402, calculating the attention of the first input vector to the word segmentation end character in the second coding sequence by using an attention mechanism to serve as a second input vector, and normalizing the result of adding the first input vector and the second input vector by using a layer normalization technology to obtain a third input vector;

s403, obtaining an LSTM output vector based on the third input vector;

s404, classifying the LSTM output vector to generate a next character;

s5, optimizing the neural network parameters, specifically comprising:

s503, carrying out weighted summation on the word loss and the text generation loss to obtain combined loss;

As shown in fig. 2, in the step S1, a training sample is constructed, which includes an original text, a pronunciation sequence, a word segmentation tagging sequence, and a target text, and specifically includes:

The step S1 further includes:

As shown in fig. 1, a single training sample consists of 4 parts: original text, pronunciation sequence, word segmentation labeling sequence and target text. The training samples are either obtained by manually labeling the original text or are automatically generated by a computer program.

The original text takes the example of "airplane tomorrow, kah guo", wherein "kah guo" is a redundant word and "guo" is an incorrect homophone word. The pronunciation sequence corresponding to the original text is 'en guo hand mingan de fei ji'; the corresponding segmentation annotation sequence is "10101101", wherein "1" represents the ending character of the word and "0" represents the non-ending character; the corresponding target text is "national aviation tomorrow's plane" with no errors in specification.

In step S504, optimizing parameters of the neural network by using a gradient descent algorithm for the joint loss specifically includes:

the text editor is used for coding the splicing of the original text and the pronunciation sequence by adopting a neural network text sequence model based on a training sample to obtain a first coding sequence, and then truncating the first coding sequence at the starting and ending positions of the splicing sequence according to the original text to obtain a second coding sequence corresponding to the original text;

the word segmentation classifier is used for carrying out secondary classification on each vector in the second coding sequence by adopting a sigmoid feedforward neural network classifier to obtain a word segmentation classification result, and the word segmentation classification result indicates whether characters at corresponding positions in the original text are end characters of words or not; the number of layers of the sigmoid feedforward neural network classifier in the embodiment is 2, and the layer 1 does not use an activation function;

the text generator is used for generating a text, circularly generating each character of the sequence from beginning to end by taking preset GEN _ LENGTH as the maximum generation sequence LENGTH, and specifically comprises the following steps:

the attention layer is used for calculating the attention of the first input vector to the word segmentation end character in the second coding sequence as a second input vector by using an attention mechanism, and normalizing the result of adding the first input vector and the second input vector by using a layer normalization technology to obtain a third input vector;

a loop judgment module, configured to judge whether the number of generated characters exceeds a maximum generated character number, if not, skip to step S401, and take a newly generated next character as a previous character to generate the next character in a loop, otherwise, enter the next step;

Wherein the text encoder: the neural network text sequence model is used as a text encoder, and the application uses, but is not limited to, the ALBERT model. The input of the text encoder is obtained by splicing a pronunciation sequence behind an original text to obtain an input text sequence, and then obtaining an input ID sequence by looking up a coding vocabulary. The encoding vocabulary is a mapping of words and pronunciations to numeric IDs, with different words or pronunciations mapped to different numeric IDs. And the text encoder encodes the input ID sequence to obtain a first encoding sequence, and each position of the first encoding sequence stores an encoding vector corresponding to the ID of the input ID sequence at the same position. And according to the starting and stopping positions of the original text in the input text sequence, truncating the first coding sequence to obtain a second coding sequence. Optionally, after the input ID sequence is converted into the input vector sequence by the embedding layer of the text encoder during training, a word-level dropout is performed on the input vector sequence with a certain probability, where the word-level dropout is that if a position of a sequence is dropout, a vector corresponding to the position is entirely dropout.

A word segmentation classifier: and performing secondary classification on each vector in the second coding sequence by using a single-layer or multi-layer feedforward neural network as a word segmentation classifier, wherein the result of the secondary classification is used for indicating whether the character at the corresponding position in the original text is the end character of the word. The number of the neurons in the last layer of the word segmentation classifier is 1, and the sigmoid is used as an activation function in the last layer. The number of layers of the word segmentation classifier in the embodiment of the application is 2, and the layer 1 does not use an activation function.

A text generator: using LSTM as a sequence generator, taking < S > as an initial input character, obtaining a numerical ID corresponding to < S > as the initial input ID by looking up a decoding vocabulary table, taking a preset GEN _ LENGTH as the maximum generation sequence LENGTH, and circularly generating each bit character of the sequence from beginning to end. The decoding vocabulary and the encoding vocabulary use different tables to store a mapping of words and pronunciations to numeric IDs. In each step of circularly generating a character sequence, an embedding layer of the text generator converts an input ID into a first input vector of an LSTM, the attention of the first input vector to a participle ending character in a second coding sequence is calculated by using an attention mechanism to serve as a second input vector, the result of adding the first input vector and the second input vector is normalized by using a layer normalization technology to obtain a third input vector, the third input vector is input into an LSTM unit to obtain an LSTM output vector, then the LSTM output vector is input into a softmax multi-classifier to obtain an output ID probability, the output ID with the maximum probability is taken as a target ID of the circular step, and a target character corresponding to the target ID is obtained by searching a decoding vocabulary table. Generally, the target ID of the previous loop step is the input ID of the next loop step. Optionally, during training, starting from the second loop step, the input ID of the nth loop step is the corresponding numeric ID of the (n-1) th character of the target text in the decoding vocabulary.

The text reconstruction training method and the text reconstruction training system provided by the invention use text coding and a generated neural network, combine word segmentation tasks on the basis of a text generation task, convert a given original text into a target text by using the characteristics of the text and the pronunciation characteristics of the text, correct the wrongly written characters in the original text, supplement missing characters, remove redundant characters and standardize used words so as to achieve the purposes of eliminating text errors and improving the text quality, and support supervised learning and self-supervised learning.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A text reconstruction training method is characterized by comprising the following steps:

s403, obtaining an LSTM output vector based on the third input vector;

s404, classifying the LSTM output vector to generate a next character;

s5, optimizing the neural network parameters, specifically comprising:

2. The text reconstruction training method according to claim 1, wherein in the step S1, a training sample is constructed, which includes an original text, a pronunciation sequence, a segmentation and labeling sequence, and a target text, and specifically includes:

3. The text reconstruction training method according to claim 2, wherein the step S1 further comprises:

s106, rewriting the original text to obtain a new text, taking the new text as the original text, and then jumping to S103, wherein the rewriting method is to delete a word or insert a random word or replace the word at the random position of the original text as a new random word.

4. The text reconstruction training method according to claim 1, wherein in step S504, optimizing parameters of a neural network using a gradient descent algorithm for the joint loss specifically includes:

wherein Loss (theta) is joint Loss, theta is a parameter of the neural network, D is a training data set, s is an original text, p is a pronunciation sequence, c is a segmentation labeling sequence, t is a target text, and Loss (theta) is ^g (θ) generates a Loss, for text ^c (theta) is the loss of participles, omega ^g Generating a lost weight, ω, for the text ^c In order to lose the weight for the word segmentation,ω ^g and omega ^c Is equal to 1,loss ^g (theta) and Loss ^c (θ) the losses are all calculated using cross entropy.

5. A text reconstruction system applied to the text reconstruction training method according to any one of claims 1 to 4, comprising:

6. The text reconstruction system of claim 5, wherein the text generator specifically comprises:

the softmax multi-classifier is used for classifying the LSTM output vector to generate a next character; and a loop judgment module for judging whether the number of the generated characters exceeds the maximum generated character number, if not, skipping to the step S401, taking the newly generated next character as the previous character, and circularly generating the next character, otherwise, entering the next step.