CN112257456A - Text editing technology-based training method and device for text generation model - Google Patents

Text editing technology-based training method and device for text generation model Download PDF

Info

Publication number
CN112257456A
CN112257456A CN202011139506.2A CN202011139506A CN112257456A CN 112257456 A CN112257456 A CN 112257456A CN 202011139506 A CN202011139506 A CN 202011139506A CN 112257456 A CN112257456 A CN 112257456A
Authority
CN
China
Prior art keywords
text
source
source text
generation model
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011139506.2A
Other languages
Chinese (zh)
Inventor
孙超
王健宗
吴天博
程宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202011139506.2A priority Critical patent/CN112257456A/en
Priority to PCT/CN2020/131757 priority patent/WO2021189890A1/en
Publication of CN112257456A publication Critical patent/CN112257456A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses a method and a device for training a text generation model based on a text editing technology, wherein the method comprises the following steps: acquiring a preset source text set; editing the source text set according to a preset text editor to obtain a target text set of the source text set; constructing a vocabulary table according to the source text set and the target text set; processing each source text according to the vocabulary and the target text of each source text in the source text set to obtain a first label sequence; inputting each source text into a text generation model to be trained to obtain a second label sequence; and adjusting configuration parameters of the text generation model according to the first label sequence and the second label sequence. Based on the technical field of machine learning, the method for training the text generation model greatly improves the training efficiency of the text generation model and the accuracy rate of generating high-semantic text by the text generation model.

Description

Text editing technology-based training method and device for text generation model
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a text generation model training method and device based on a text editing technology.
Background
Text generation is an important task in the field of natural language processing, and is also a significant challenge faced by artificial intelligence. Although text generation can assist professionals in professional writing, such as legal document completion, automatic news generation, text summarization generation, text statement generation and the like, training of a text generation model depends on a large amount of data, and high-quality text data in a specific field is poor, so that accuracy of high-semantic text generated by the text generation model is low.
Disclosure of Invention
The embodiment of the invention provides a method and a device for training a text generation model based on a text editing technology, and solves the problem that a text generation model in the prior art needs a large amount of high-quality text data for training to accurately acquire a high-semantic text.
In a first aspect, an embodiment of the present invention provides a method for training a text generation model based on a text editing technology, including:
acquiring a preset source text set;
editing the source text set according to a preset text editor to obtain a target text set of the source text set;
constructing a vocabulary table according to the source text set and the target text set;
processing each source text according to the vocabulary and the target text of each source text in the source text set to obtain a first label sequence;
inputting each source text into a text generation model to be trained to obtain a second label sequence;
and adjusting configuration parameters of the text generation model according to the first label sequence and the second label sequence.
In a second aspect, an embodiment of the present invention provides a training apparatus for a text generation model based on a text editing technology, including:
the first acquisition unit is used for acquiring a preset source text set;
the editing unit is used for editing the source text set according to a preset text editor to obtain a target text set of the source text set;
the first construction unit is used for constructing a vocabulary according to the source text set and the target text set;
the processing unit is used for processing each source text according to the vocabulary and the target text of each source text in the source text set to obtain a first label sequence;
the input unit is used for inputting each source text into a text generation model to be trained to obtain a second label sequence;
and the first adjusting unit is used for adjusting configuration parameters of the text generation model according to the first label sequence and the second label sequence.
In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor, when executing the computer program, implements the method for training a text generation model based on a text editing technology as described in the first aspect.
In a fourth aspect, the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, causes the processor to execute the method for training a text generation model based on a text editing technology according to the first aspect.
The embodiment of the invention provides a training method and a device of a text generation model based on a text editing technology, wherein in the process of training the text generation model, after a source text set required by the training of the text generation model is obtained, each source text in the source text set is edited to obtain a target text of each source text, then a vocabulary is built through the source text and the target text, the source text is processed through the vocabulary and the target text to obtain a first label sequence, meanwhile, the source text is input into the text generation model to be trained to obtain a second label sequence, the similarity of the first label sequence and the second label sequence is calculated to adjust configuration parameters of the text generation model to be trained, wherein the first label sequence is a character string obtained without the text generation model, and the second label sequence is a character string obtained through the text generation model, the target text of the source text can be obtained through the first label sequence, and the target text similar to the source text can be obtained through the second label sequence. The training method of the text generation model based on the text editing technology can still finish the training of the text model when the source text set is less, greatly improves the training efficiency of the text generation model and improves the accuracy of generating the high-semantic text.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a training method for a text generation model based on a text editing technology according to an embodiment of the present invention;
fig. 2 is a schematic sub-flow diagram of a training method for a text generation model based on a text editing technology according to an embodiment of the present invention;
fig. 3 is a schematic sub-flow chart of a training method for a text generation model based on a text editing technology according to an embodiment of the present invention;
fig. 4 is a schematic sub-flow chart of a training method for a text generation model based on a text editing technology according to an embodiment of the present invention;
fig. 5 is a schematic sub-flow chart of a training method for a text generation model based on a text editing technology according to an embodiment of the present invention;
fig. 6 is a schematic sub-flow chart of a training method for a text generation model based on a text editing technology according to an embodiment of the present invention;
FIG. 7 is a schematic block diagram of a training apparatus for generating a text model based on a text editing technique according to an embodiment of the present invention;
FIG. 8 is a schematic block diagram of sub-units of a training apparatus for a text generation model based on a text editing technique according to an embodiment of the present invention;
FIG. 9 is a schematic block diagram of another sub-unit of a training apparatus for a text generation model based on a text editing technology according to an embodiment of the present invention;
FIG. 10 is a schematic block diagram of another sub-unit of a training apparatus for a text generation model based on a text editing technology according to an embodiment of the present invention;
FIG. 11 is a schematic block diagram of another subunit of a training apparatus for a text generation model based on a text editing technology according to an embodiment of the present invention;
FIG. 12 is a schematic block diagram of another sub-unit of a training apparatus for a text generation model based on a text editing technology according to an embodiment of the present invention;
FIG. 13 is a schematic block diagram of a computer device provided by an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Referring to fig. 1, fig. 1 is a schematic flowchart of a training method of a text generation model based on a text editing technology according to an embodiment of the present invention. The training method of the text generation model based on the text editing technology is built and operated in a server, when a text generation model is trained in the server, after a source text set required by the training of the text generation model is obtained, each source text in the source text set is edited to obtain a target text of each source text, then the source texts are processed through a preset vocabulary and the target texts to obtain a first label sequence, meanwhile, the source texts are input into the text generation model to be trained to obtain a second label sequence, and the configuration parameters of the text generation model to be trained are adjusted by calculating the similarity of the first label sequence and the second label sequence, so that when the source text set is few, the training of the text model can be still completed and the training efficiency of the text generation model is greatly improved, and the accuracy rate of generating high semantic texts is improved.
As shown in fig. 1, the method includes steps S110 to S150.
And S110, acquiring a preset source text set.
And acquiring a preset source text set. Specifically, the source text set is a data set for training a text generation model, and the number of texts in the source text set can be configured according to user requirements, and can be a large amount of text data or a small amount of text data. In the embodiment of the invention, the source text team text generation model with less data volume is adopted for training.
And S120, editing the source text set according to a preset text editor to obtain a target text set of the source text set.
And editing the source text set according to a preset text editor to obtain a target text set of the source text set. Specifically, the text editor is a text editing tool that can be used to edit each source text in the source text set to obtain a target text with high semantic meaning, a notepad under a Window flag, text editing under a Mac OS X flag, vi, emacs, getit and the like under a Linux flag can be used to edit each source text in the source text set. For example: when the source text is "Xiaoming" in 1993. When Xiaoming sheng Shanghai, the target text edited by the text editor is "Xiaoming sheng Shanghai less than 1993".
S130, constructing a vocabulary according to the source text set and the target text set.
And constructing a vocabulary according to the source text set and the target text set. Specifically, a word in a source text of each target text in the target text set, which does not exist in the target text, is used as a word in the vocabulary, and in the process of constructing the vocabulary, generally, in order to reduce the amount of calculation in the subsequent use of the vocabulary, the vocabulary needs to be optimized so as to be as small as possible, and words in the vocabulary obtained from the source text set need to be screened according to the frequency of the occurrence of the word in the target text set, for example, a word in the vocabulary that occurs less than ten times in the target text set is removed, so that the optimized vocabulary can be obtained. The vocabulary in the embodiment of the invention is constructed and then stored in the block chain, thereby ensuring the safety performance of vocabulary storage.
In one embodiment, as shown in fig. 2, step S130 includes sub-steps S131 and S132.
S131, constructing the longest common subsequence of each source text and the target text of each source text according to each source text and the target text of each source text.
And constructing the longest common subsequence of each source text and the target text of each source text according to each source text and the target text of each source text. Specifically, the longest common subsequence of each source text and the target text of each source text is obtained by using the longest common subsequence technique. The longest common subsequence is defined as: a sequence S, if it is a subsequence of two or more known sequences, respectively, and is the longest of all sequences that meet this condition, is referred to as the longest common subsequence of known sequences.
In an embodiment, as shown in fig. 3, step S131 includes substeps S1311 and S1312.
S1311, acquiring the subsequence set of each source text and the subsequence set of the target text of each source text.
And acquiring the subsequence set of each source text and the subsequence set of the target text of each source text. Specifically, the subsequence in the subsequence set of each source text is obtained by splitting each source text without changing the character sequence of each source text, the split subsequence of each source text is combined into the subsequence set of each source text, and the subsequence set of the target text of each source text is obtained by referring to the obtaining mode of the subsequence set of each source text.
S1312, matching each subsequence in the set of subsequences of each source text with each subsequence in the set of subsequences of the target text respectively to obtain a set of common subsequences of each source text and the target text of each source text, and taking the longest common subsequence in the set of common subsequences as the longest common subsequence.
And matching each subsequence in the subsequence set of each source text with each subsequence in the subsequence set of the target text to obtain a common subsequence set of each source text and the target text of each source text, and taking the longest common subsequence in the common subsequence set as the longest common subsequence. Specifically, each common subsequence in the common subsequence set is a common subsequence of each source text and the target text of each source text, and the longest sequence in the common subsequence set is the longest common subsequence of each source text and the target text of each source text.
S132, constructing the vocabulary according to the target text of each source text and the longest public subsequence.
And constructing the vocabulary according to the target text and the longest public subsequence of each source text. Specifically, the longest common subsequence is the target text of each source text and the longest common subsequence of each source text, and words not existing in the longest common subsequence are obtained from the target text of each source text through the longest common subsequence and are used as words in the vocabulary, so that the construction of the vocabulary is completed.
In one embodiment, as shown in fig. 4, step S132 includes substeps S1321 and S1322.
S1321, performing word segmentation processing on the target text of each source text to obtain words of the target text of each source text.
And performing word segmentation processing on the target text of each source text to obtain words of the target text of each source text. Specifically, in this embodiment, a reverse maximum matching method in a character string-based word segmentation method is adopted to perform word segmentation processing on a target text of each source text, and the word segmentation process is as follows: and setting the number of Chinese characters contained in the longest entry in a preset dictionary as L, and starting processing from the end of the character string of the target text. At the beginning of each cycle, the last L characters of the character string are taken as processing objects, and the dictionary is searched. If the dictionary has such an L word, the matching is successful, and the processing object is segmented as a word; if the segmentation is unsuccessful, the first Chinese character of the processing object is removed, the rest character strings are used as new processing objects, matching is carried out again until segmentation is successful, namely, one round of matching is completed, a word is segmented, and the process is circulated until all words in the target text are segmented.
S1322, matching the words of the target text of each source text with the longest public subsequence to obtain the words forming the vocabulary from the words of the target text of each source text.
Matching words of the target text of each source text with the longest common subsequence to obtain words constituting the vocabulary from the words of the target text of each source text. Specifically, whether the word of the target text of each source text exists in the longest common subsequence is taken as a result of whether the matching between the word of the target text and the longest common subsequence is successful, and if the matching is successful, the word is not a word in a vocabulary; if the matching is not successful, the words are used as words in the vocabulary.
S140, processing each source text according to the vocabulary and the target text of each source text to obtain a first label sequence.
And processing each source text according to the vocabulary and the target text of each source text to obtain a first label sequence. Specifically, each source text does not include the vocabulary in the vocabulary, each source text is labeled by labeling the longest common subsequence of the target text of each source text in each source text and each source text, then each labeled source text is split to obtain the character of each labeled source text, the character is matched with the words in the vocabulary to obtain a new word, and then the matched words are spliced to obtain the first label sequence.
In an embodiment, as shown in fig. 5, step S140 includes sub-steps S141, S142, S143, and S144.
And S141, labeling each source text according to the longest public subsequence to obtain each labeled source text.
And labeling each source text according to the longest public subsequence to obtain each labeled source text. Specifically, a first label is labeled on a character belonging to the longest public subsequence in each source text, and the character is marked as a symbol "keep"; and marking characters which do not belong to the longest common subsequence as a second label and marking the characters as a "delete", so that each source text is marked with texts with the first label and the second label. For example, the source text "Xiaoming Sheng in 1993. When the Xiaoming occurs in Shanghai, the sequence of the labels labeled on the source text labeled with the labels is "keep keep keep delete keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep.
And S142, performing word segmentation processing on each labeled source text to obtain a character set of each labeled source text.
And performing word segmentation processing on each labeled source text to obtain a character set of each labeled source text. Specifically, the labeled character set of each source text is a set of characters labeled with a first label, and the process of word segmentation is as follows: firstly, performing word segmentation on two adjacent words marked with a first label and a second label in each source text, and then performing word segmentation on the sentence marked with the first label to obtain a character set of each marked source text. For example, the source text "Xiaoming Sheng in 1993. When Xiaoming Shanghai, the label sequence of the source text labeled with the label is "keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep keep, the character set of each source text after labeling finally obtained is: [ Xiao, Ming, Yi, 1993, yearly, Sheng, Yi, Shang, Hai ], wherein each character is labeled with a first label.
S143, respectively matching the words in the vocabulary with the characters in the character set to obtain a word set.
And respectively matching the words in the vocabulary with the characters in the character set to obtain a word set. Specifically, each word in the vocabulary is matched with each character in the character set to form a new word, then the new word is searched in a preset dictionary to obtain whether the new word exists in the dictionary, if the new word does not exist in the dictionary, the new word is ignored, and finally the screened new word forms the word set.
S144, splicing the words in the word set to obtain the first label sequence.
Splicing the words in the word set to obtain the first label sequence. Specifically, the words in the word set are spliced according to the arrangement sequence of the characters in each labeled source text to obtain the first label sequence. In the process of splicing all words in the word set, splicing is carried out according to the sequence formed by the characters of the source text, at least one text is obtained after splicing is completed, then syntactic analysis is carried out on the spliced text, sentences which best accord with the source text are screened out and are used as the first label sequence, and the target text of the source text can be predicted through the first label sequence. For example, the source text "Xiaoming Sheng in 1993. When the Xiaoming Shanghai, the first tag sequence is "keep keep keep delete keep keep keepFirst stage|delete delete deleteGo out| keep keep keep keep keep keep keep, wherein "First stage'sum'Go outAll are labels labeled according to words in the vocabulary.
S150, inputting each source text into a text generation model to be trained to obtain a second label sequence.
And inputting each source text into a text generation model to be trained to obtain a second label sequence. Specifically, the text generation model to be trained is a model of an encoder-decoder model architecture, that is, the text generation model includes an encoder and a decoder. After the source text is input into the text generation model to be trained, a second label sequence can be obtained through encoding and decoding, and the target text of the source text can be predicted through the second label sequence. In the embodiment of the invention, the encoder of the text generation model adopts a pre-trained RoBERTA Chinese model, namely, the pre-trained RoBERTA Chinese model consists of 12 layers of transformers; the decoder adopts a single-layer transform, and can ensure the precision and give consideration to the reasoning speed of the model.
And S160, adjusting configuration parameters of the text generation model according to the first label sequence and the second label sequence.
And adjusting configuration parameters of the text generation model according to the first label sequence and the second label sequence. Specifically, the target text of the source text can be obtained through both the first label sequence and the second label sequence, the first label sequence is obtained on the basis of the target text of the source text, and the second label sequence text is generated through a text generation model to be trained, that is, the first label sequence is more accurate relative to the second label sequence according to the target text obtained through the first label sequence, the similarity between the first label sequence and the second label sequence is obtained through calculating the first label sequence and the second label sequence, and finally, the configuration parameters of the text generation model are correspondingly adjusted according to the similarity, so that the second label sequence generated by the text generation model is closer to the first label sequence, and further, the training of the text generation model is completed.
In one embodiment, as shown in fig. 6, step S160 includes sub-steps S161 and S162.
S161, obtaining the similarity between the second label sequence and the first label sequence.
And acquiring the similarity between the second label sequence and the first label sequence. Specifically, when calculating the similarity between the first tag sequence and the second tag sequence, vectorizing the first tag sequence and the second tag sequence, and then calculating the distance, wherein the calculated distance is used as the similarity between the second tag sequence and the first tag sequence, and the longer the distance is, the lower the similarity is, and the shorter the distance isThe higher the similarity. In the embodiment of the invention, the similarity is obtained by adopting an Euclidean distance calculation mode. The euclidean distance is a commonly used distance definition, referring to the true distance between two points in n-dimensional space, or the natural length of a vector. The Euclidean distance calculation formula of the second label sequence and the first label sequence is as follows:
Figure BDA0002737780090000101
where n denotes the dimension of the vector, x1kIs a vector of the first tag sequence, x2kIs a vector of the second tag sequence.
And S162, if the similarity is lower than a preset threshold value, adjusting configuration parameters of the text generation model according to the similarity.
And if the similarity is lower than a preset threshold value, adjusting the configuration parameters of the text generation model according to the similarity. Specifically, the preset threshold is whether parameters of the text generation model are adjusted or not so that the text generation model can generate the high-semantic text more accurately. The threshold may be set according to actual conditions, and is not limited herein.
The invention relates to a text generation model training method based on a text editing technology, which comprises the steps of obtaining a preset source text set; editing the source text set according to a preset text editor to obtain a target text set of the source text set; constructing a vocabulary table according to the source text set and the target text set; processing each source text according to the vocabulary and the target text of each source text in the source text set to obtain a first label sequence; inputting each source text into a text generation model to be trained to obtain a second label sequence; and adjusting configuration parameters of the text generation model according to the first label sequence and the second label sequence. The training method of the text generation model based on the text editing technology not only greatly improves the training efficiency of the text generation model, but also improves the accuracy of the text generation model for generating high-semantic texts.
The embodiment of the invention also provides a training device 100 for a text generation model based on a text editing technology, which is used for executing any embodiment of the training method for the text generation model based on the text editing technology. Specifically, referring to fig. 7, fig. 7 is a schematic block diagram of a training apparatus 100 for a text generation model based on a text editing technology according to an embodiment of the present invention.
As shown in fig. 7, the training apparatus 100 for generating a model based on a text editing technology includes a first obtaining unit 110, an editing unit 120, a first constructing unit 130, a processing unit 140, an input unit 150, and a first adjusting unit 160.
The first obtaining unit 110 is configured to obtain a preset source text set.
And the editing unit 120 is configured to edit the source text set according to a preset text editor to obtain a target text set of the source text set.
A first constructing unit 130, configured to construct a vocabulary according to the source text set and the target text set.
In another embodiment of the present invention, as shown in fig. 8, the first building unit 130 includes: a second building element 131 and a third building element 132.
A second constructing unit 131, configured to construct a longest common subsequence of the each source text and the target text of the each source text according to the each source text and the target text of the each source text.
In another embodiment of the present invention, as shown in fig. 9, the second building unit 131 includes: a second acquisition unit 1311 and a first matching unit 1312.
A second obtaining unit 1311, configured to obtain the set of subsequences of each source text and the set of subsequences of the target text of each source text.
A first matching unit 1312, configured to match each subsequence in the set of subsequences of each source text with each subsequence in the set of subsequences of the target text, respectively, to obtain a set of common subsequences of each source text and the target text of each source text, and use a longest common subsequence in the set of common subsequences as the longest common subsequence.
A third constructing unit 132, configured to construct the vocabulary according to the target text of each source text and the longest common subsequence.
In other inventive embodiments, as shown in fig. 10, the third building unit 132 includes: a first segmentation unit 1321 and a second matching unit 1322.
The first word segmentation unit 1321 is configured to perform word segmentation on the target text of each source text to obtain a word of the target text of each source text.
A second matching unit 1322 is configured to match the words of the target text of each source text with the longest common subsequence to obtain the words constituting the vocabulary from the words of the target text of each source text.
The processing unit 140 is configured to process each source text in the source text set according to the vocabulary and a target text of each source text in the source text set to obtain a first tag sequence.
In another embodiment of the present invention, as shown in fig. 11, the processing unit 140 includes: an annotation unit 141, a second participle unit 142, a third matching unit 143, and a concatenation unit 144.
And a labeling unit 141, labeling each source text according to the longest common subsequence to obtain each labeled source text.
A second word segmentation unit 142, configured to perform word segmentation processing on each labeled source text to obtain a character set of each labeled source text.
The third matching unit 143 is configured to match the words in the vocabulary with the characters in the character set respectively to obtain a word set.
A splicing unit 144, configured to splice the terms in the term set to obtain the first tag sequence.
An input unit 150, configured to input each source text into a text generation model to be trained to obtain a second label sequence.
A first adjusting unit 160, configured to adjust configuration parameters of the text generation model according to the first tag sequence and the second tag sequence.
In another embodiment of the present invention, as shown in fig. 12, the first adjusting unit 160 includes: a third acquisition unit 161 and a second adjustment unit 162.
A third obtaining unit 161, configured to obtain a similarity between the second tag sequence and the first tag sequence.
A second adjusting unit 162, configured to adjust the configuration parameters of the text generation model according to the similarity if the similarity is lower than a preset threshold.
The training device 100 for the text generation model based on the text editing technology provided by the embodiment of the invention is used for executing the above-mentioned process for obtaining the preset source text set; editing the source text set according to a preset text editor to obtain a target text set of the source text set; constructing a vocabulary table according to the source text set and the target text set; processing each source text according to the vocabulary and the target text of each source text in the source text set to obtain a first label sequence; inputting each source text into a text generation model to be trained to obtain a second label sequence; and adjusting configuration parameters of the text generation model according to the first label sequence and the second label sequence.
Referring to fig. 13, fig. 13 is a schematic block diagram of a computer device according to an embodiment of the present invention.
Referring to fig. 13, the device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, may cause the processor 502 to perform a method of training a text generation model based on text editing techniques.
The processor 502 is used to provide computing and control capabilities that support the operation of the overall device 500.
The internal memory 504 provides an environment for running the computer program 5032 in the non-volatile storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 may be caused to perform a training method of a text generation model based on a text editing technology.
The network interface 505 is used for network communication, such as providing transmission of data information. Those skilled in the art will appreciate that the configuration shown in fig. 13 is a block diagram of only a portion of the configuration associated with aspects of the present invention and does not constitute a limitation of the apparatus 500 to which aspects of the present invention may be applied, and that a particular apparatus 500 may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
Wherein the processor 502 is configured to run the computer program 5032 stored in the memory to implement the following functions: acquiring a preset source text set; editing the source text set according to a preset text editor to obtain a target text set of the source text set; constructing a vocabulary table according to the source text set and the target text set; processing each source text according to the vocabulary and the target text of each source text in the source text set to obtain a first label sequence; inputting each source text into a text generation model to be trained to obtain a second label sequence; and adjusting configuration parameters of the text generation model according to the first label sequence and the second label sequence.
Those skilled in the art will appreciate that the embodiment of the apparatus 500 shown in fig. 13 does not constitute a limitation on the specific construction of the apparatus 500, and in other embodiments, the apparatus 500 may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. For example, in some embodiments, the apparatus 500 may only include the memory and the processor 502, and in such embodiments, the structure and function of the memory and the processor 502 are the same as those of the embodiment shown in fig. 13, and are not repeated herein.
It should be understood that in the present embodiment, the Processor 502 may be a Central Processing Unit (CPU), and the Processor 502 may also be other general-purpose processors 502, a Digital Signal Processor 502 (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The general-purpose processor 502 may be a microprocessor 502 or the processor 502 may be any conventional processor 502 or the like.
In another embodiment of the present invention, a computer storage medium is provided. The storage medium may be a non-volatile computer-readable storage medium. The storage medium stores a computer program 5032, wherein the computer program 5032 when executed by the processor 502 performs the steps of: acquiring a preset source text set; editing the source text set according to a preset text editor to obtain a target text set of the source text set; constructing a vocabulary table according to the source text set and the target text set; processing each source text according to the vocabulary and the target text of each source text in the source text set to obtain a first label sequence; inputting each source text into a text generation model to be trained to obtain a second label sequence; and adjusting configuration parameters of the text generation model according to the first label sequence and the second label sequence.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only a logical division, and there may be other divisions when the actual implementation is performed, or units having the same function may be grouped into one unit, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present invention essentially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a device 500 (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A training method of a text generation model based on a text editing technology is characterized by comprising the following steps:
acquiring a preset source text set;
editing the source text set according to a preset text editor to obtain a target text set of the source text set;
constructing a vocabulary table according to the source text set and the target text set;
processing each source text according to the vocabulary and the target text of each source text in the source text set to obtain a first label sequence;
inputting each source text into a text generation model to be trained to obtain a second label sequence;
and adjusting configuration parameters of the text generation model according to the first label sequence and the second label sequence.
2. The method for training a text generation model based on a text editing technology as claimed in claim 1, wherein the constructing the vocabulary from the source text set and the target text set comprises:
constructing the longest common subsequence of each source text and the target text of each source text according to each source text and the target text of each source text;
and constructing the vocabulary according to the target text and the longest public subsequence of each source text.
3. The method for training a text generation model based on a text editing technique according to claim 2, wherein the constructing the longest common subsequence of each source text and the target text of each source text from the each source text and the target text of each source text comprises:
acquiring a subsequence set of each source text and a subsequence set of a target text of each source text;
and matching each subsequence in the subsequence set of each source text with each subsequence in the subsequence set of the target text to obtain a common subsequence set of each source text and the target text of each source text, and taking the longest common subsequence in the common subsequence set as the longest common subsequence.
4. The method for training a text generation model based on a text editing technique as claimed in claim 2, wherein the constructing the vocabulary according to the target text of each source text and the longest common subsequence comprises:
performing word segmentation processing on the target text of each source text to obtain words of the target text of each source text;
matching words of the target text of each source text with the longest common subsequence to obtain words constituting the vocabulary from the words of the target text of each source text.
5. The method for training the text generation model based on the text editing technology as claimed in claim 2, wherein the processing each source text according to the predetermined vocabulary and the target text of each source text to obtain the first label sequence comprises:
labeling each source text according to the longest public subsequence to obtain each labeled source text;
performing word segmentation processing on each labeled source text to obtain a character set of each labeled source text;
matching words in the vocabulary with the characters in the character set respectively to obtain a word set;
splicing the words in the word set to obtain the first label sequence.
6. The method for training the text generation model based on the text editing technology according to claim 5, wherein the splicing the words in the word set to obtain the first label sequence comprises:
and splicing the words in the word set according to the arrangement sequence of the characters in each labeled source text to obtain the first label sequence.
7. The method for training the text generation model based on the text editing technology according to claim 1, wherein the adjusting the configuration parameters of the text generation model according to the first tag sequence and the second tag sequence comprises:
acquiring the similarity between the second label sequence and the first label sequence;
and if the similarity is lower than a preset threshold value, adjusting the configuration parameters of the text generation model according to the similarity.
8. A training device for a text generation model based on a text editing technology is characterized by comprising:
the first acquisition unit is used for acquiring a preset source text set;
the editing unit is used for editing the source text set according to a preset text editor to obtain a target text set of the source text set;
the first construction unit is used for constructing a vocabulary according to the source text set and the target text set;
the processing unit is used for processing each source text according to the vocabulary and the target text of each source text in the source text set to obtain a first label sequence;
the input unit is used for inputting each source text into a text generation model to be trained to obtain a second label sequence;
and the first adjusting unit is used for adjusting configuration parameters of the text generation model according to the first label sequence and the second label sequence.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of training a text generation model based on text editing techniques according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, causes the processor to carry out a method of training a text generation model based on text editing techniques according to any one of claims 1 to 7.
CN202011139506.2A 2020-10-22 2020-10-22 Text editing technology-based training method and device for text generation model Pending CN112257456A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011139506.2A CN112257456A (en) 2020-10-22 2020-10-22 Text editing technology-based training method and device for text generation model
PCT/CN2020/131757 WO2021189890A1 (en) 2020-10-22 2020-11-26 Text generation model training method and apparatus based on text editing technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011139506.2A CN112257456A (en) 2020-10-22 2020-10-22 Text editing technology-based training method and device for text generation model

Publications (1)

Publication Number Publication Date
CN112257456A true CN112257456A (en) 2021-01-22

Family

ID=74264135

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011139506.2A Pending CN112257456A (en) 2020-10-22 2020-10-22 Text editing technology-based training method and device for text generation model

Country Status (2)

Country Link
CN (1) CN112257456A (en)
WO (1) WO2021189890A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011149A (en) * 2021-03-04 2021-06-22 中国科学院自动化研究所 Text error correction method and system
CN113435183A (en) * 2021-06-30 2021-09-24 平安科技(深圳)有限公司 Text generation method, device and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107168952B (en) * 2017-05-15 2021-06-04 北京百度网讯科技有限公司 Information generation method and device based on artificial intelligence
CN109657051A (en) * 2018-11-30 2019-04-19 平安科技(深圳)有限公司 Text snippet generation method, device, computer equipment and storage medium
CN109933662B (en) * 2019-02-15 2021-03-12 北京奇艺世纪科技有限公司 Model training method, information generation method, device, electronic equipment and computer readable medium
CN110263350A (en) * 2019-03-08 2019-09-20 腾讯科技(深圳)有限公司 Model training method, device, computer readable storage medium and computer equipment
CN110097085B (en) * 2019-04-03 2023-04-14 阿里巴巴集团控股有限公司 Lyric text generation method, training method, device, server and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011149A (en) * 2021-03-04 2021-06-22 中国科学院自动化研究所 Text error correction method and system
CN113435183A (en) * 2021-06-30 2021-09-24 平安科技(深圳)有限公司 Text generation method, device and storage medium
CN113435183B (en) * 2021-06-30 2023-08-29 平安科技(深圳)有限公司 Text generation method, device and storage medium

Also Published As

Publication number Publication date
WO2021189890A1 (en) 2021-09-30

Similar Documents

Publication Publication Date Title
CN108091328B (en) Speech recognition error correction method and device based on artificial intelligence and readable medium
CN108052499B (en) Text error correction method and device based on artificial intelligence and computer readable medium
CN107220235B (en) Speech recognition error correction method and device based on artificial intelligence and storage medium
KR102577514B1 (en) Method, apparatus for text generation, device and storage medium
CN107315737B (en) Semantic logic processing method and system
CN110532554B (en) Chinese abstract generation method, system and storage medium
CN106649783B (en) Synonym mining method and device
CN112395385B (en) Text generation method and device based on artificial intelligence, computer equipment and medium
CN110569505B (en) Text input method and device
CN112507190B (en) Method and system for extracting keywords of financial and economic news
CN113033182B (en) Text creation assisting method, device and server
CN112257456A (en) Text editing technology-based training method and device for text generation model
CN111814479B (en) Method and device for generating enterprise abbreviations and training model thereof
US11322133B2 (en) Expressive text-to-speech utilizing contextual word-level style tokens
CN112507734A (en) Roman Uygur language-based neural machine translation system
CN108664464B (en) Method and device for determining semantic relevance
CN114662476A (en) Character sequence recognition method fusing dictionary and character features
CN114970514A (en) Artificial intelligence based Chinese word segmentation method, device, computer equipment and medium
CN113268989A (en) Polyphone processing method and device
CN113076749A (en) Text recognition method and system
CN112487813A (en) Named entity recognition method and system, electronic equipment and storage medium
CN114970524B (en) Controllable text generation method and device
CN114925175A (en) Abstract generation method and device based on artificial intelligence, computer equipment and medium
Hewlett et al. Bootstrap voting experts
CN116129883A (en) Speech recognition method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination