CN111639477B - Text reconstruction training method and system - Google Patents

Text reconstruction training method and system Download PDF

Info

Publication number
CN111639477B
CN111639477B CN202010484716.9A CN202010484716A CN111639477B CN 111639477 B CN111639477 B CN 111639477B CN 202010484716 A CN202010484716 A CN 202010484716A CN 111639477 B CN111639477 B CN 111639477B
Authority
CN
China
Prior art keywords
text
sequence
character
neural network
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010484716.9A
Other languages
Chinese (zh)
Other versions
CN111639477A (en
Inventor
王丙栋
游世学
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongke Huilian Technology Co ltd
Original Assignee
Beijing Zhongke Huilian Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongke Huilian Technology Co ltd filed Critical Beijing Zhongke Huilian Technology Co ltd
Priority to CN202010484716.9A priority Critical patent/CN111639477B/en
Publication of CN111639477A publication Critical patent/CN111639477A/en
Application granted granted Critical
Publication of CN111639477B publication Critical patent/CN111639477B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a text reconstruction training method and a system, wherein the method comprises the following steps: constructing a training sample; adopting a neural network text sequence model to code the splicing of the original text and the pronunciation sequence; adopting a sigmoid feedforward neural network classifier to obtain a word segmentation classification result; circularly generating each character of the sequence from beginning to end by taking the preset GEN _ LENGTH as the maximum LENGTH of the generated sequence; optimizing parameters of the neural network, calculating word segmentation loss and text generation loss, generating loss weighted summation to obtain combined loss, and optimizing the parameters of the neural network by using a gradient descent algorithm for the combined loss. The text reconstruction training method and the text reconstruction training system provided by the invention use the text coding and the neural network generation to convert the given original text into the target text, correct the wrongly written characters in the original text, supplement the missing characters, remove redundant characters and standardize the words so as to achieve the purposes of eliminating text errors and improving the text quality.

Description

Text reconstruction training method and system
Technical Field
The invention relates to the technical field of text generation and deep learning, in particular to a text reconstruction training method and a text reconstruction training system.
Background
In the information age, texts are important information carriers in internet multimedia, and the data size is huge, and the sources and authors are numerous. Due to the problems of near-word form, near-word pronunciation, stroke error, mouth error, inaccurate voice recognition, irregular author level and the like, wrong words and irregular words in the text occur sometimes, and the wrong transmission and misunderstanding of information are caused. The existing text error correction method based on the tree model or the deep neural network cannot effectively solve the problems of missed characters, multiple characters, wrong pronunciation of a voice recognition result and irregular word use in a text.
Disclosure of Invention
The invention aims to provide a text reconstruction training method and a text reconstruction training system.
In order to achieve the purpose, the invention provides the following scheme:
a text reconstruction training method comprises the following steps:
s1, constructing a training sample, wherein the training sample comprises an original text, a pronunciation sequence, a word segmentation and annotation sequence and a target text;
s2, based on a training sample, adopting a neural network text sequence model to code splicing of an original text and a pronunciation sequence to obtain a first coding sequence, and then truncating the first coding sequence at the starting and stopping positions of the splicing sequence according to the original text to obtain a second coding sequence corresponding to the original text;
s3, performing secondary classification on each vector in the second coding sequence by adopting a sigmoid feedforward neural network classifier to obtain a word segmentation classification result, wherein the word segmentation classification result indicates whether the character at the corresponding position in the original text is an end character of a word;
s4, generating a text, circularly generating each character of the sequence from beginning to end by taking the preset GEN _ LENGTH as the maximum LENGTH of the generated sequence, and specifically comprising the following steps:
s401, converting a digital ID corresponding to a previous character into a first input vector;
s402, calculating attention of the first input vector to the participle ending character in the second coding sequence as a second input vector by using an attention mechanism, and normalizing the result of adding the first input vector and the second input vector by using a layer normalization technology to obtain a third input vector;
s403, obtaining an LSTM output vector based on the third input vector;
s404, classifying the LSTM output vector to generate a next character;
judging whether the number of the generated characters exceeds the maximum generated character number, if not, jumping to the step S401, taking the newly generated next character as the previous character, circularly generating the next character, otherwise, entering the next step;
s5, optimizing the neural network parameters, specifically comprising:
s501, calculating word segmentation loss according to the real word segmentation labeling sequence and the predicted word segmentation result by using cross entropy;
s502, calculating text generation loss according to a real target text and a generated text by using cross entropy;
s503, carrying out weighted summation on the word loss and the text generation loss to obtain a joint loss;
and S504, optimizing parameters of the neural network by using a gradient descent algorithm for the joint loss.
Optionally, in step S1, a training sample is constructed, which includes an original text, a pronunciation sequence, a segmentation and annotation sequence, and a target text, and specifically includes:
s101, collecting articles, paragraphs and sentences to assemble a text set, segmenting the text set into sentence subsets, and taking each sentence in the sentence subsets as an original text;
s102, marking pronunciation on the original text to obtain a pronunciation sequence;
s103, segmenting original text, marking the ending character of each word as 1 and marking the non-ending character as 0 to obtain a segmentation marking sequence;
and S102, if the standard text exists, taking the standard text as a target text, otherwise, copying the original text as the target text.
Optionally, the step S1 further includes:
s104, judging whether the original text is rewritten, and entering the next step if the original text is not rewritten;
s105, generating a random number between 0 and 1, and entering the next step if the random number is larger than a rewriting threshold;
s106, rewriting the original text to obtain a new text, taking the new text as the original text, and then jumping to the step S103, wherein the rewriting method is to delete a word or insert a random word or replace the word at the random position as a new random word at a certain random position of the original text.
Optionally, in step S504, optimizing parameters of the neural network by using a gradient descent algorithm for the joint loss specifically includes:
and (3) optimizing parameters of the neural network by using a text generation and word segmentation joint loss function and adopting a gradient descent algorithm, wherein the joint loss function is as follows:
Figure BDA0002518628170000031
wherein Loss (theta) is joint Loss, theta is a parameter of the neural network, D is a training data set, s is an original text, p is a pronunciation sequence, c is a segmentation labeling sequence, t is a target text, and Loss (theta) is g (θ) generates a Loss, for text c (theta) is the loss of participles, omega g Generating a lost weight, ω, for the text c Weight lost for word segmentation, omega g And omega c Is equal to 1,loss g (theta) and Loss c (θ) the losses are all calculated using cross entropy.
The invention also provides a text reconstruction system, which is applied to the text reconstruction training method and comprises the following steps:
the training sample construction module is used for constructing a training sample, and the training sample comprises an original text, a pronunciation sequence, a word segmentation labeling sequence and a target text;
the text editor is used for coding the splicing of the original text and the pronunciation sequence by adopting a neural network text sequence model based on a training sample to obtain a first coding sequence, and then truncating the first coding sequence at the starting and stopping positions of the splicing sequence according to the original text to obtain a second coding sequence corresponding to the original text;
the word segmentation classifier is used for carrying out secondary classification on each vector in the second coding sequence by adopting a sigmoid feedforward neural network classifier to obtain a word segmentation classification result, and the word segmentation classification result indicates whether the character at the corresponding position in the original text is an end character of a word or not;
the text generator is used for generating a text, taking the preset GEN _ LENGTH as the maximum LENGTH of a generated sequence, and circularly generating each character of the sequence from beginning to end;
and the neural network parameter optimization unit is used for optimizing the neural network parameters and learning the neural network model parameters of the text encoder, the word segmentation classifier and the text generator by using the text generation and word segmentation joint loss function.
Optionally, the text generator specifically includes:
the embedding layer is used for converting the digital ID corresponding to the previous character into a first input vector;
the attention layer is used for calculating the attention of the first input vector to the participle ending character in the second coding sequence as a second input vector by using an attention mechanism, and normalizing the result of adding the first input vector and the second input vector by using a layer normalization technology to obtain a third input vector;
the LSTM unit is used for obtaining an LSTM output vector based on the third input vector;
the softmax multi-classifier is used for classifying the LSTM output vector to generate a next character;
and the cycle judgment module is used for judging whether the number of the generated characters exceeds the maximum generated character number, if not, skipping to the step S401, taking the newly generated next character as the previous character, and cyclically generating the next character, otherwise, entering the next step.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the text reconstruction training method and the text reconstruction training system provided by the invention use text coding and a generated neural network, combine word segmentation tasks on the basis of a text generation task, convert a given original text into a target text by using the characteristics of the text and the pronunciation characteristics of the text, correct the wrongly written characters in the original text, supplement missing characters, remove redundant characters and standardize used words so as to achieve the purposes of eliminating text errors and improving the text quality, and support supervised learning and self-supervised learning.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a single sample illustration of the present invention;
FIG. 2 is a flow chart of the automatic generation of training samples in accordance with the present invention;
FIG. 3 is a diagram of a neural network architecture according to the present invention;
FIG. 4 is a flow chart of neural network model training in accordance with the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a text reconstruction training method and a text reconstruction training system.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As shown in fig. 1-4, the text reconstruction training method provided by the present invention includes the following steps:
s1, constructing a training sample, wherein the training sample comprises an original text, a pronunciation sequence, a word segmentation and annotation sequence and a target text;
s2, based on a training sample, adopting a neural network text sequence model to code the splicing of the original text and the pronunciation sequence to obtain a first coding sequence, and then truncating the first coding sequence at the start and stop positions of the splicing sequence according to the original text to obtain a second coding sequence corresponding to the original text;
s3, performing secondary classification on each vector in the second coding sequence by adopting a sigmoid feedforward neural network classifier to obtain a word segmentation classification result, wherein the word segmentation classification result indicates whether the character at the corresponding position in the original text is an end character of a word;
s4, generating a text, circularly generating each character of the sequence from beginning to end by taking the preset GEN _ LENGTH as the maximum LENGTH of the generated sequence, and specifically comprising the following steps:
s401, converting a digital ID corresponding to a previous character into a first input vector;
s402, calculating the attention of the first input vector to the word segmentation end character in the second coding sequence by using an attention mechanism to serve as a second input vector, and normalizing the result of adding the first input vector and the second input vector by using a layer normalization technology to obtain a third input vector;
s403, obtaining an LSTM output vector based on the third input vector;
s404, classifying the LSTM output vector to generate a next character;
judging whether the number of the generated characters exceeds the maximum generated character number, if not, jumping to the step S401, taking the newly generated next character as the previous character, circularly generating the next character, otherwise, entering the next step;
s5, optimizing the neural network parameters, specifically comprising:
s501, calculating word segmentation loss according to the real word segmentation labeling sequence and the predicted word segmentation result by using cross entropy;
s502, calculating text generation loss according to a real target text and a generated text by using cross entropy;
s503, carrying out weighted summation on the word loss and the text generation loss to obtain combined loss;
and S504, optimizing parameters of the neural network by using a gradient descent algorithm for the joint loss.
As shown in fig. 2, in the step S1, a training sample is constructed, which includes an original text, a pronunciation sequence, a word segmentation tagging sequence, and a target text, and specifically includes:
s101, collecting articles, paragraphs and sentences to assemble a text set, segmenting the text set into sentence subsets, and taking each sentence in the sentence subsets as an original text;
s102, marking pronunciation on the original text to obtain a pronunciation sequence;
s103, segmenting original text, marking the ending character of each word as 1 and marking the non-ending character as 0 to obtain a segmentation marking sequence;
and S102, if the standard text exists, taking the standard text as a target text, otherwise, copying the original text as the target text.
The step S1 further includes:
s104, judging whether the original text is rewritten, and entering the next step if the original text is not rewritten;
s105, generating a random number between 0 and 1, and entering the next step if the random number is larger than a rewriting threshold;
s106, rewriting the original text to obtain a new text, taking the new text as the original text, and then jumping to the step S103, wherein the rewriting method is to delete a word or insert a random word or replace the word at the random position as a new random word at a certain random position of the original text.
As shown in fig. 1, a single training sample consists of 4 parts: original text, pronunciation sequence, word segmentation labeling sequence and target text. The training samples are either obtained by manually labeling the original text or are automatically generated by a computer program.
The original text takes the example of "airplane tomorrow, kah guo", wherein "kah guo" is a redundant word and "guo" is an incorrect homophone word. The pronunciation sequence corresponding to the original text is 'en guo hand mingan de fei ji'; the corresponding segmentation annotation sequence is "10101101", wherein "1" represents the ending character of the word and "0" represents the non-ending character; the corresponding target text is "national aviation tomorrow's plane" with no errors in specification.
In step S504, optimizing parameters of the neural network by using a gradient descent algorithm for the joint loss specifically includes:
and (3) optimizing parameters of the neural network by using a text generation and word segmentation joint loss function and adopting a gradient descent algorithm, wherein the joint loss function is as follows:
Figure BDA0002518628170000071
wherein Loss (theta) is joint Loss, theta is a parameter of the neural network, D is a training data set, s is an original text, p is a pronunciation sequence, c is a segmentation labeling sequence, t is a target text, and Loss (theta) is g (θ) generates a Loss, for text c (theta) is the loss of participles, omega g Generating a lost weight, ω, for the text c Weight lost for word segmentation, omega g And omega c Is equal to 1,loss g (theta) and Loss c (θ) the losses are all calculated using cross entropy.
The invention also provides a text reconstruction system, which is applied to the text reconstruction training method and comprises the following steps:
the training sample construction module is used for constructing a training sample, and the training sample comprises an original text, a pronunciation sequence, a word segmentation labeling sequence and a target text;
the text editor is used for coding the splicing of the original text and the pronunciation sequence by adopting a neural network text sequence model based on a training sample to obtain a first coding sequence, and then truncating the first coding sequence at the starting and ending positions of the splicing sequence according to the original text to obtain a second coding sequence corresponding to the original text;
the word segmentation classifier is used for carrying out secondary classification on each vector in the second coding sequence by adopting a sigmoid feedforward neural network classifier to obtain a word segmentation classification result, and the word segmentation classification result indicates whether characters at corresponding positions in the original text are end characters of words or not; the number of layers of the sigmoid feedforward neural network classifier in the embodiment is 2, and the layer 1 does not use an activation function;
the text generator is used for generating a text, circularly generating each character of the sequence from beginning to end by taking preset GEN _ LENGTH as the maximum generation sequence LENGTH, and specifically comprises the following steps:
the embedding layer is used for converting the digital ID corresponding to the previous character into a first input vector;
the attention layer is used for calculating the attention of the first input vector to the word segmentation end character in the second coding sequence as a second input vector by using an attention mechanism, and normalizing the result of adding the first input vector and the second input vector by using a layer normalization technology to obtain a third input vector;
the LSTM unit is used for obtaining an LSTM output vector based on the third input vector;
the softmax multi-classifier is used for classifying the LSTM output vector to generate a next character;
a loop judgment module, configured to judge whether the number of generated characters exceeds a maximum generated character number, if not, skip to step S401, and take a newly generated next character as a previous character to generate the next character in a loop, otherwise, enter the next step;
and the neural network parameter optimization unit is used for optimizing the neural network parameters and learning the neural network model parameters of the text encoder, the word segmentation classifier and the text generator by using the text generation and word segmentation joint loss function.
Wherein the text encoder: the neural network text sequence model is used as a text encoder, and the application uses, but is not limited to, the ALBERT model. The input of the text encoder is obtained by splicing a pronunciation sequence behind an original text to obtain an input text sequence, and then obtaining an input ID sequence by looking up a coding vocabulary. The encoding vocabulary is a mapping of words and pronunciations to numeric IDs, with different words or pronunciations mapped to different numeric IDs. And the text encoder encodes the input ID sequence to obtain a first encoding sequence, and each position of the first encoding sequence stores an encoding vector corresponding to the ID of the input ID sequence at the same position. And according to the starting and stopping positions of the original text in the input text sequence, truncating the first coding sequence to obtain a second coding sequence. Optionally, after the input ID sequence is converted into the input vector sequence by the embedding layer of the text encoder during training, a word-level dropout is performed on the input vector sequence with a certain probability, where the word-level dropout is that if a position of a sequence is dropout, a vector corresponding to the position is entirely dropout.
A word segmentation classifier: and performing secondary classification on each vector in the second coding sequence by using a single-layer or multi-layer feedforward neural network as a word segmentation classifier, wherein the result of the secondary classification is used for indicating whether the character at the corresponding position in the original text is the end character of the word. The number of the neurons in the last layer of the word segmentation classifier is 1, and the sigmoid is used as an activation function in the last layer. The number of layers of the word segmentation classifier in the embodiment of the application is 2, and the layer 1 does not use an activation function.
A text generator: using LSTM as a sequence generator, taking < S > as an initial input character, obtaining a numerical ID corresponding to < S > as the initial input ID by looking up a decoding vocabulary table, taking a preset GEN _ LENGTH as the maximum generation sequence LENGTH, and circularly generating each bit character of the sequence from beginning to end. The decoding vocabulary and the encoding vocabulary use different tables to store a mapping of words and pronunciations to numeric IDs. In each step of circularly generating a character sequence, an embedding layer of the text generator converts an input ID into a first input vector of an LSTM, the attention of the first input vector to a participle ending character in a second coding sequence is calculated by using an attention mechanism to serve as a second input vector, the result of adding the first input vector and the second input vector is normalized by using a layer normalization technology to obtain a third input vector, the third input vector is input into an LSTM unit to obtain an LSTM output vector, then the LSTM output vector is input into a softmax multi-classifier to obtain an output ID probability, the output ID with the maximum probability is taken as a target ID of the circular step, and a target character corresponding to the target ID is obtained by searching a decoding vocabulary table. Generally, the target ID of the previous loop step is the input ID of the next loop step. Optionally, during training, starting from the second loop step, the input ID of the nth loop step is the corresponding numeric ID of the (n-1) th character of the target text in the decoding vocabulary.
The text reconstruction training method and the text reconstruction training system provided by the invention use text coding and a generated neural network, combine word segmentation tasks on the basis of a text generation task, convert a given original text into a target text by using the characteristics of the text and the pronunciation characteristics of the text, correct the wrongly written characters in the original text, supplement missing characters, remove redundant characters and standardize used words so as to achieve the purposes of eliminating text errors and improving the text quality, and support supervised learning and self-supervised learning.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (6)

1. A text reconstruction training method is characterized by comprising the following steps:
s1, constructing a training sample, wherein the training sample comprises an original text, a pronunciation sequence, a word segmentation and annotation sequence and a target text;
s2, based on a training sample, adopting a neural network text sequence model to code splicing of an original text and a pronunciation sequence to obtain a first coding sequence, and then truncating the first coding sequence at the starting and stopping positions of the splicing sequence according to the original text to obtain a second coding sequence corresponding to the original text;
s3, performing secondary classification on each vector in the second coding sequence by adopting a sigmoid feedforward neural network classifier to obtain a word segmentation classification result, wherein the word segmentation classification result indicates whether the character at the corresponding position in the original text is an end character of a word;
s4, generating a text, circularly generating each character of the sequence from beginning to end by taking the preset GEN _ LENGTH as the maximum LENGTH of the generated sequence, and specifically comprising the following steps:
s401, converting a digital ID corresponding to a previous character into a first input vector;
s402, calculating attention of the first input vector to the participle ending character in the second coding sequence as a second input vector by using an attention mechanism, and normalizing the result of adding the first input vector and the second input vector by using a layer normalization technology to obtain a third input vector;
s403, obtaining an LSTM output vector based on the third input vector;
s404, classifying the LSTM output vector to generate a next character;
judging whether the number of the generated characters exceeds the maximum generated character number, if not, jumping to the step S401, taking the newly generated next character as the previous character, circularly generating the next character, otherwise, entering the next step;
s5, optimizing the neural network parameters, specifically comprising:
s501, calculating word segmentation loss according to the real word segmentation labeling sequence and the predicted word segmentation result by using cross entropy;
s502, calculating text generation loss according to a real target text and a generated text by using cross entropy;
s503, carrying out weighted summation on the word loss and the text generation loss to obtain combined loss;
and S504, optimizing parameters of the neural network by using a gradient descent algorithm for the joint loss.
2. The text reconstruction training method according to claim 1, wherein in the step S1, a training sample is constructed, which includes an original text, a pronunciation sequence, a segmentation and labeling sequence, and a target text, and specifically includes:
s101, collecting articles, paragraphs and sentences to assemble a text set, segmenting the text set into sentence subsets, and taking each sentence in the sentence subsets as an original text;
s102, marking pronunciation on the original text to obtain a pronunciation sequence;
s103, segmenting original text, marking the ending character of each word as 1 and marking the non-ending character as 0 to obtain a segmentation marking sequence;
and S102, if the standard text exists, taking the standard text as a target text, otherwise, copying the original text as the target text.
3. The text reconstruction training method according to claim 2, wherein the step S1 further comprises:
s104, judging whether the original text is rewritten, and entering the next step if the original text is not rewritten;
s105, generating a random number between 0 and 1, and entering the next step if the random number is larger than a rewriting threshold;
s106, rewriting the original text to obtain a new text, taking the new text as the original text, and then jumping to S103, wherein the rewriting method is to delete a word or insert a random word or replace the word at the random position of the original text as a new random word.
4. The text reconstruction training method according to claim 1, wherein in step S504, optimizing parameters of a neural network using a gradient descent algorithm for the joint loss specifically includes:
and (3) optimizing parameters of the neural network by using a text generation and word segmentation joint loss function and adopting a gradient descent algorithm, wherein the joint loss function is as follows:
Figure FDA0004115398860000021
wherein Loss (theta) is joint Loss, theta is a parameter of the neural network, D is a training data set, s is an original text, p is a pronunciation sequence, c is a segmentation labeling sequence, t is a target text, and Loss (theta) is g (θ) generates a Loss, for text c (theta) is the loss of participles, omega g Generating a lost weight, ω, for the text c In order to lose the weight for the word segmentation,ω g and omega c Is equal to 1,loss g (theta) and Loss c (θ) the losses are all calculated using cross entropy.
5. A text reconstruction system applied to the text reconstruction training method according to any one of claims 1 to 4, comprising:
the training sample construction module is used for constructing a training sample, and the training sample comprises an original text, a pronunciation sequence, a word segmentation labeling sequence and a target text;
the text editor is used for coding the splicing of the original text and the pronunciation sequence by adopting a neural network text sequence model based on a training sample to obtain a first coding sequence, and then truncating the first coding sequence at the starting and stopping positions of the splicing sequence according to the original text to obtain a second coding sequence corresponding to the original text;
the word segmentation classifier is used for carrying out secondary classification on each vector in the second coding sequence by adopting a sigmoid feedforward neural network classifier to obtain a word segmentation classification result, and the word segmentation classification result indicates whether the character at the corresponding position in the original text is an end character of a word or not;
the text generator is used for generating a text, taking the preset GEN _ LENGTH as the maximum LENGTH of a generated sequence, and circularly generating each character of the sequence from beginning to end;
and the neural network parameter optimization unit is used for optimizing the neural network parameters and learning the neural network model parameters of the text encoder, the word segmentation classifier and the text generator by using the text generation and word segmentation joint loss function.
6. The text reconstruction system of claim 5, wherein the text generator specifically comprises:
the embedding layer is used for converting the digital ID corresponding to the previous character into a first input vector;
the attention layer is used for calculating the attention of the first input vector to the participle ending character in the second coding sequence as a second input vector by using an attention mechanism, and normalizing the result of adding the first input vector and the second input vector by using a layer normalization technology to obtain a third input vector;
the LSTM unit is used for obtaining an LSTM output vector based on the third input vector;
the softmax multi-classifier is used for classifying the LSTM output vector to generate a next character; and a loop judgment module for judging whether the number of the generated characters exceeds the maximum generated character number, if not, skipping to the step S401, taking the newly generated next character as the previous character, and circularly generating the next character, otherwise, entering the next step.
CN202010484716.9A 2020-06-01 2020-06-01 Text reconstruction training method and system Active CN111639477B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010484716.9A CN111639477B (en) 2020-06-01 2020-06-01 Text reconstruction training method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010484716.9A CN111639477B (en) 2020-06-01 2020-06-01 Text reconstruction training method and system

Publications (2)

Publication Number Publication Date
CN111639477A CN111639477A (en) 2020-09-08
CN111639477B true CN111639477B (en) 2023-04-18

Family

ID=72331127

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010484716.9A Active CN111639477B (en) 2020-06-01 2020-06-01 Text reconstruction training method and system

Country Status (1)

Country Link
CN (1) CN111639477B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112149415A (en) * 2020-10-12 2020-12-29 清华大学 Training method and device of text generation model and readable storage medium
CN116361858B (en) * 2023-04-10 2024-01-26 北京无限自在文化传媒股份有限公司 User session resource data protection method and software product applying AI decision

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105244020A (en) * 2015-09-24 2016-01-13 百度在线网络技术(北京)有限公司 Prosodic hierarchy model training method, text-to-speech method and text-to-speech device
CN111145729A (en) * 2019-12-23 2020-05-12 厦门快商通科技股份有限公司 Speech recognition model training method, system, mobile terminal and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017160393A1 (en) * 2016-03-18 2017-09-21 Google Inc. Globally normalized neural networks

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105244020A (en) * 2015-09-24 2016-01-13 百度在线网络技术(北京)有限公司 Prosodic hierarchy model training method, text-to-speech method and text-to-speech device
CN111145729A (en) * 2019-12-23 2020-05-12 厦门快商通科技股份有限公司 Speech recognition model training method, system, mobile terminal and storage medium

Also Published As

Publication number Publication date
CN111639477A (en) 2020-09-08

Similar Documents

Publication Publication Date Title
CN108416058B (en) Bi-LSTM input information enhancement-based relation extraction method
CN107729309B (en) Deep learning-based Chinese semantic analysis method and device
CN111310471B (en) Travel named entity identification method based on BBLC model
Sang et al. Introduction to the CoNLL-2001 shared task: Clause identification
CN110738057B (en) Text style migration method based on grammar constraint and language model
CN115393692A (en) Generation formula pre-training language model-based association text-to-image generation method
CN111639477B (en) Text reconstruction training method and system
CN114118065A (en) Chinese text error correction method and device in electric power field, storage medium and computing equipment
CN112101010B (en) Telecom industry OA office automation manuscript auditing method based on BERT
CN111930939A (en) Text detection method and device
CN112507695A (en) Text error correction model establishing method, device, medium and electronic equipment
CN111522961A (en) Attention mechanism and entity description based industrial map construction method
CN113609285A (en) Multi-mode text summarization system based on door control fusion mechanism
CN112016320A (en) English punctuation adding method, system and equipment based on data enhancement
Tada et al. Robust understanding of robot-directed speech commands using sequence to sequence with noise injection
CN115080750B (en) Weak supervision text classification method, system and device based on fusion prompt sequence
CN114547370A (en) Video abstract extraction method and system
CN111401012B (en) Text error correction method, electronic device and computer readable storage medium
CN115906815A (en) Error correction method and device for modifying one or more types of wrong sentences
CN114707492A (en) Vietnamese grammar error correction method and device fusing multi-granularity characteristics
CN110717316B (en) Topic segmentation method and device for subtitle dialog flow
CN116595023A (en) Address information updating method and device, electronic equipment and storage medium
CN115115432B (en) Product information recommendation method and device based on artificial intelligence
US11727062B1 (en) Systems and methods for generating vector space embeddings from a multi-format document
CN113010635B (en) Text error correction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant