CN118468822B - Target field text generation method and system - Google Patents

Target field text generation method and system Download PDF

Info

Publication number
CN118468822B
CN118468822B CN202410918161.2A CN202410918161A CN118468822B CN 118468822 B CN118468822 B CN 118468822B CN 202410918161 A CN202410918161 A CN 202410918161A CN 118468822 B CN118468822 B CN 118468822B
Authority
CN
China
Prior art keywords
text
sequence
target
word order
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410918161.2A
Other languages
Chinese (zh)
Other versions
CN118468822A (en
Inventor
汪自立
张越
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Jiafa Antai Education Technology Co ltd
Original Assignee
Chengdu Jiafa Antai Education Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Jiafa Antai Education Technology Co ltd filed Critical Chengdu Jiafa Antai Education Technology Co ltd
Priority to CN202410918161.2A priority Critical patent/CN118468822B/en
Publication of CN118468822A publication Critical patent/CN118468822A/en
Application granted granted Critical
Publication of CN118468822B publication Critical patent/CN118468822B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a target field text generation method and a target field text generation system, which relate to the field of data processing, wherein the method comprises the following steps: receiving a target field text generation request sent by user equipment; generating a target field word list according to the theme parameters and the text data set corresponding to the target field; generating a conditional word order sequence according to the theme parameter, the text type parameter and the field text attribute; invoking a target field text generation model to obtain a decoded word order sequence based on the conditional word order sequence; and screening out a target decoding word order sequence by utilizing the word number range and the target field word list, and decoding the target decoding word order sequence into a target field text. Through the theme parameters and the text type parameters, the system is able to generate text that is related to a particular domain and type; by constructing the target field word list and the conditional word order sequence, a structured information basis is provided for text generation. By precisely controlling various aspects of the text generation process, content redundancy or off-topic is avoided.

Description

Target field text generation method and system
Technical Field
The invention relates to the field of data processing, in particular to a method and a system for generating a text in a target field.
Background
In a teaching or other text application scenario, it is desirable to generate matching text that meets the theme. With respect to automatically generating text, the prior art has challenges in ensuring that the content does not deviate from a predetermined theme and does not contain undue information, which can result in the generated text content being superoutline, i.e., beyond a user-specified theme or scope. Also, when generating text of a particular length, it may be difficult to precisely control the number of words beyond or below the number of words limit required by the user.
Disclosure of Invention
The invention mainly aims to provide a target field text generation method and a target field text generation system, wherein the target field text generation method and the target field text generation system can generate a conditional word order sequence related to a specific field and type through a theme parameter and a text type parameter; and screening the decoded word order sequence by utilizing the word number range and the target field word list, controlling the length and quality of the generated text, and avoiding content redundancy or deviation from the theme. By precisely controlling various aspects of the text, the end user may obtain more satisfactory and desirable text content.
In order to achieve the above object, the embodiment of the present application provides the following technical solutions:
According to a first aspect of an embodiment of the present application, there is provided a target domain text generation method, including:
Receiving a target field text generation request sent by user equipment, wherein the target field text generation request carries a theme parameter, a text type parameter, a word number range and a field text attribute;
generating a target field word list according to the theme parameters and the text data set corresponding to the target field;
generating a conditional word order sequence according to the theme parameter, the text type parameter and the field text attribute;
Invoking a target field text generation model to obtain a decoded word order sequence based on the conditional word order sequence;
And screening out a target decoding word order sequence by utilizing the word number range and the target field word list, and decoding the target decoding word order sequence into a target field text.
The call target field text generation model obtains a decoded word order sequence based on the conditional word order sequence, and the call target field text generation model comprises the following steps:
Invoking the target field text generation model to calculate a probability value set of candidate word orders of the next sequence position in the conditional word order sequence;
sorting probability values in the probability value set from large to small;
And adding the word order with the maximum probability value into the initialized word order sequence, and generating the decoded word order sequence.
Optionally, the screening the target decoding vocabulary sequence by using the word count range and the target domain vocabulary includes:
decoding the decoding vocabulary sequence to obtain a corresponding decoding vocabulary;
Searching whether the decoded vocabulary exists in the target field vocabulary, if not, deleting the current vocabulary from the decoded vocabulary sequence, adding the vocabulary corresponding to the next position of the maximum value in the probability value sequencing into the decoded vocabulary sequence, and continuously executing the subsequent steps; if so, adding the decoded vocabulary to the tail part of the conditional vocabulary sequence;
repeating the steps until the conditional command sequence meets the word number range, and obtaining the target decoding command sequence.
Optionally, the generating a conditional word order sequence according to the theme parameter, the text type parameter and the domain text attribute includes:
Generating a conditional text according to the theme parameter, the text type parameter and the field text attribute;
and encoding the conditional text to obtain the conditional command sequence.
Optionally, the generating the conditional text according to the theme parameter, the text type parameter and the domain text attribute includes:
carrying out serialization processing on the field text attribute to obtain a field text attribute processing result;
Carrying out serialization processing on the theme parameters and the text type parameters to obtain annotation content processing results;
And splicing the initial identification of the field text attribute, the processing result of the field text attribute, the initial identification of the theme parameter and the text type parameter, the processing result of the annotation content and the text identification to be generated in sequence to obtain the conditional text.
Optionally, the target domain text generation model is trained according to the following steps:
Processing based on the training data set to obtain an input command sequence, a target command sequence and a target identification command sequence;
Batch sampling is carried out on the input word order sequence, the target word order sequence and the target identification word order sequence to obtain batch word order sequences, batch target word order sequences and batch target identification word order sequences;
Inputting the batch word order sequence into a pre-trained target field text generation model to obtain probability distribution of predicted word orders;
calculating the cross entropy loss between the probability distribution of the predicted word order and the target word order sequence;
calculating the gradient of the target field text generation model by minimizing cross entropy loss and updating the weight;
repeating the steps until the training times reach the preset number of rounds, and stopping training.
Optionally, the training dataset comprises material text; the processing based on the training data set to obtain an input word order sequence, a target word order sequence and a target identification word order sequence comprises the following steps:
Carrying out serialization processing and coding on the training data set to obtain a training condition word order sequence;
constructing a 0 identification sequence with the same length of the command sequence of the training condition command sequence;
Sequentially splicing according to the sequence of the starting mark of the material text, the material text and the ending mark of the material text to obtain a training generation text, and encoding to obtain a training generation word sequence;
constructing a1 identification sequence with the same length of the command sequence of the training generation command sequence;
splicing the training condition word order sequence and the training generation word order sequence to obtain an input word order sequence, and splicing the 0 identification sequence and the 1 identification sequence to obtain an input identification sequence;
performing left offset of 1 word order position on the input word order sequence to obtain a target word order sequence;
Adding 1 identification command at the tail of the input identification sequence and performing left side offset of 1 command position to obtain a target identification command sequence;
repeating the steps until the training data set is processed into an input word order sequence, a target word order sequence and a target identification word order sequence.
Optionally, the training data set further includes a theme parameter, a text type parameter and a domain text attribute corresponding to the material text; the step of carrying out serialization processing and coding on the training data set to obtain a training condition word order sequence comprises the following steps:
Carrying out serialization processing on the field text attribute corresponding to the material text to obtain a training field text attribute processing result;
Carrying out serialization processing on the theme parameters and the text type parameters corresponding to the material text to obtain a training annotation content processing result;
sequentially splicing the initial identification of the field text attribute of the material text, the training field text attribute processing result, the initial identification of the theme parameter and the text type parameter of the material text and the training annotation content processing result to obtain a training condition text;
And encoding the training condition text to obtain the training condition word order sequence.
Optionally, the generating a target domain vocabulary according to the theme parameters and the text data set corresponding to the target domain includes:
extracting all words under the theme parameters from the text data set corresponding to the target field according to the theme parameters, and taking the words as a first candidate word list;
respectively carrying out grammar standardization processing on the candidate words in the first candidate word list to obtain a second candidate word list;
and merging the first candidate word list and the second candidate word list to obtain the target field word list.
According to a second aspect of an embodiment of the present application, there is provided a target area text generation system, the system including:
the request receiving module is used for receiving a target field text generation request sent by user equipment, wherein the target field text generation request carries a theme parameter, a text type parameter, a word number range and a field text attribute;
the target domain vocabulary module is used for generating a target domain vocabulary according to the theme parameters and the text data set corresponding to the target domain;
The conditional word order sequence module is used for generating a conditional word order sequence according to the theme parameter, the text type parameter and the field text attribute;
The decoded word order sequence module is used for calling a target field text generation model to obtain a decoded word order sequence based on the conditional word order sequence;
And the target field text generation module is used for screening out a target decoding word order sequence by utilizing the word number range and the target field word list and decoding the target decoding word order sequence into a target field text.
According to a third aspect of an embodiment of the present application, there is provided an electronic apparatus including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the computer program to perform the method of the first aspect.
According to a fourth aspect of embodiments of the present application, there is provided a computer readable storage medium having stored thereon computer readable instructions executable by a processor to implement the method of the first aspect described above.
In summary, the embodiment of the application provides a method and a system for generating a target field text, which are used for receiving a target field text generation request sent by user equipment, wherein the target field text generation request carries a theme parameter, a text type parameter, a word number range and a field text attribute; generating a target field word list according to the theme parameters and the text data set corresponding to the target field; generating a conditional word order sequence according to the theme parameter, the text type parameter and the field text attribute; invoking a target field text generation model to obtain a decoded word order sequence based on the conditional word order sequence; and screening out a target decoding word order sequence by utilizing the word number range and the target field word list, and decoding the target decoding word order sequence into a target field text. Through the theme parameters and the text type parameters, the system is able to generate text that is related to a particular domain and type; by constructing the target field word list and the conditional word order sequence, a structured information basis is provided for text generation. And screening the decoded word order sequence by utilizing the word number range and the target field word list, controlling the length and quality of the generated text, and avoiding content redundancy or deviation from the theme. By precisely controlling various aspects of the text, the end user may obtain more satisfactory and desirable text content.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the structures shown in these drawings without inventive effort for a person skilled in the art.
The structures, proportions, sizes, etc. shown in the present specification are shown only for the purposes of illustration and description, and are not intended to limit the scope of the invention, which is defined by the claims, so that any structural modifications, changes in proportions, or adjustments of sizes, which do not affect the efficacy or the achievement of the present invention, should fall within the scope of the invention.
FIG. 1 is a schematic diagram of a target domain text generation flow provided in an embodiment of the present application;
FIG. 2 is a schematic diagram of overall logic of generating a target domain text according to an embodiment of the present application;
FIG. 3 is a block diagram of a target domain text generation system according to an embodiment of the present application;
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
Fig. 5 shows a schematic diagram of a computer-readable storage medium according to an embodiment of the present application.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that all directional indicators (such as up, down, left, right, front, and rear … …) in the embodiments of the present invention are merely used to explain the relative positional relationship, movement, etc. between the components in a particular posture (as shown in the drawings), and if the particular posture is changed, the directional indicator is changed accordingly.
Furthermore, descriptions such as those referred to as "first," "second," and the like, are provided for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implying an order of magnitude of the indicated technical features in the present disclosure. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
In the present invention, unless specifically stated and limited otherwise, the terms "connected," "affixed," and the like are to be construed broadly, and for example, "affixed" may be a fixed connection, a removable connection, or an integral body; can be mechanically or electrically connected; either directly or indirectly, through intermediaries, or both, may be in communication with each other or in interaction with each other, unless expressly defined otherwise. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
In addition, the technical solutions of the embodiments of the present invention may be combined with each other, but it is necessary to be based on the fact that those skilled in the art can implement the technical solutions, and when the technical solutions are contradictory or cannot be implemented, the combination of the technical solutions should be considered as not existing, and not falling within the scope of protection claimed by the present invention.
Fig. 1 shows a target domain text generation method provided by an embodiment of the present application, where the method includes:
Step 101: receiving a target field text generation request sent by user equipment, wherein the target field text generation request carries a theme parameter, a text type parameter, a word number range and a field text attribute;
step 102: generating a target field word list according to the theme parameters and the text data set corresponding to the target field;
step 103: generating a conditional word order sequence according to the theme parameter, the text type parameter and the field text attribute;
step 104: invoking a target field text generation model to obtain a decoded word order sequence based on the conditional word order sequence;
Step 105: and screening out a target decoding word order sequence by utilizing the word number range and the target field word list, and decoding the target decoding word order sequence into a target field text.
In a possible implementation manner, in step 104, the calling the target domain text generation model to obtain the decoded vocabulary sequence based on the conditional vocabulary sequence includes:
Invoking the target field text generation model to calculate a probability value set of candidate word orders of the next sequence position in the conditional word order sequence; sorting probability values in the probability value set from large to small; and adding the word order with the maximum probability value into the initialized word order sequence, and generating the decoded word order sequence.
First a sequence of conditional tokens is received, and then a set of probability values for candidate tokens for the next position in the sequence is computed. With respect to the predictive capabilities of the language model, the model predicts the next most likely word based on the preceding word sequence. After obtaining the probability value set of the candidate word order, the model sorts the probability values from big to small. Finding out the most likely candidate word order, namely the word order with the highest probability value. After the sorting is completed, the model selects the word order with the largest probability value and adds the word order to the current decoded word order sequence. This operation is repeated until the resulting decoded word sequence reaches the desired length or other termination condition is met.
In a possible implementation manner, in step 105, the screening the target decoded vocabulary sequence using the word count range and the target domain vocabulary includes:
decoding the decoding vocabulary sequence to obtain a corresponding decoding vocabulary; searching whether the decoded vocabulary exists in the target field vocabulary, if not, deleting the current vocabulary from the decoded vocabulary sequence, adding the vocabulary corresponding to the next position of the maximum value in the probability value sequencing into the decoded vocabulary sequence, and continuously executing the subsequent steps; if so, adding the decoded vocabulary to the tail part of the conditional vocabulary sequence; repeating the steps until the conditional command sequence meets the word number range, and obtaining the target decoding command sequence.
If the decoded vocabulary exists in the target domain vocabulary, the vocabulary is proved to meet the domain requirement, and is added to the tail part of the conditional vocabulary sequence to continue to generate the next vocabulary. If the decoded vocabulary does not exist in the target domain vocabulary, the decoded vocabulary is not in accordance with the domain requirement, and adjustment is needed. And deleting the word orders which do not meet the requirements from the current decoded word order sequence. And selecting a word order corresponding to the next bit with the largest probability value according to the probability value sequence, and adding the word order to the decoded word order sequence. And continuing to decode and screen the new decoded word order sequence, and repeating the steps until the whole conditional word order sequence meets the specified word number range. And stopping the screening process when the conditional word order sequence reaches the required word number range, wherein the obtained decoded word order sequence is the target decoded word order sequence. The generated text is ensured to meet the word number requirement, and the professional vocabulary in the target field is used, so that the quality and the relevance of the text are improved.
In a possible implementation manner, in step 103, the generating a conditional word order sequence according to the theme parameter, the text type parameter, and the domain text attribute includes:
Generating a conditional text according to the theme parameter, the text type parameter and the field text attribute; and encoding the conditional text to obtain the conditional command sequence.
In a possible implementation manner, in step 103, the generating conditional text according to the theme parameter, the text type parameter, and the domain text attribute includes:
Carrying out serialization processing on the field text attribute to obtain a field text attribute processing result; carrying out serialization processing on the theme parameters and the text type parameters to obtain annotation content processing results; and splicing the initial identification of the field text attribute, the processing result of the field text attribute, the initial identification of the theme parameter and the text type parameter, the processing result of the annotation content and the text identification to be generated in sequence to obtain the conditional text.
And carrying out serialization processing on the field text attribute, namely converting the field-related information into a format which can be understood and processed by the model, and obtaining a field text attribute processing result. And carrying out serialization processing on the theme parameter and the text type parameter to obtain an annotation content processing result. And splicing the processed field text attribute, the processed theme parameter and the processed text type parameter according to a certain sequence to form the conditional text. And coding the spliced conditional text, and converting the conditional text into a word order sequence which can be understood by a model. This encoding process may involve converting text into word vectors, tokenizing, and the like. And using the coded conditional command sequence as input, calling a target field text generation model, and generating a decoded command sequence.
The key information such as the domain knowledge, the theme, the text type and the like can be effectively integrated into the generation process, so that the text which meets the requirements of the user is generated. At the same time, by the serialization and concatenation processes, it can be ensured that the model can take into account all relevant conditions when generating text.
In one possible implementation, the target domain text generation model is trained according to the following steps:
Processing based on the training data set to obtain an input command sequence, a target command sequence and a target identification command sequence; batch sampling is carried out on the input word order sequence, the target word order sequence and the target identification word order sequence to obtain batch word order sequences, batch target word order sequences and batch target identification word order sequences; inputting the batch word order sequence into a pre-trained target field text generation model to obtain probability distribution of predicted word orders; calculating the cross entropy loss between the probability distribution of the predicted word order and the target word order sequence; calculating the gradient of the target field text generation model by minimizing cross entropy loss and updating the weight; repeating the steps until the training times reach the preset number of rounds, and stopping training.
Information in the training data, including text content, topics, types, attributes and the like, can be fully utilized to improve relevance and accuracy of text generation. Meanwhile, through cross entropy loss and gradient descent optimization, the model can learn the capability of generating high-quality text.
In one possible embodiment, the training data set comprises material text; the processing based on the training data set to obtain an input word order sequence, a target word order sequence and a target identification word order sequence comprises the following steps:
Carrying out serialization processing and coding on the training data set to obtain a training condition word order sequence; constructing a 0 identification sequence with the same length of the command sequence of the training condition command sequence; sequentially splicing according to the sequence of the starting mark of the material text, the material text and the ending mark of the material text to obtain a training generation text, and encoding to obtain a training generation word sequence; constructing a1 identification sequence with the same length of the command sequence of the training generation command sequence; splicing the training condition word order sequence and the training generation word order sequence to obtain an input word order sequence, and splicing the 0 identification sequence and the 1 identification sequence to obtain an input identification sequence; performing left offset of 1 word order position on the input word order sequence to obtain a target word order sequence; adding 1 identification command at the tail of the input identification sequence and performing left side offset of 1 command position to obtain a target identification command sequence; repeating the steps until the training data set is processed into an input word order sequence, a target word order sequence and a target identification word order sequence.
Each material text in the training dataset is serialized to break the text into units of word order (tokens) or less, such as characters or subwords (subwords). The serialized tokens are then encoded into a format that the model can understand, such as an integer index or a word vector, resulting in a training conditional token sequence. A0-id sequence of the same length as the word order sequence is created for training the conditional word order sequence for distinguishing different parts of the sequence or as a pad.
The material texts are spliced in a specific order (beginning identification, material text, ending identification) to form a training generation text. And encoding the spliced text to obtain a training generation word order sequence. For training, a 1-id sequence with the same length as the word sequence is created for marking the beginning or end of the word sequence.
And splicing the training condition word order sequence and the training generation word order sequence to form an input word order sequence. And splicing the 0 identification sequence and the 1 identification sequence to form an input identification sequence. And (3) performing left offset of 1 word order position on the input word order sequence, namely moving each word order in the sequence to the left by one position to obtain the target word order sequence. Training for a sequence generation model to predict the next vocabulary. And adding 1 identification command at the tail part of the input identification sequence, and then performing left side offset of 1 command position to obtain a target identification command sequence. Repeating the steps until all the material texts in the training data set are processed into the input word order sequence, the target word order sequence and the target identification word order sequence. Using these processed sequences as inputs and targets, a target domain text generation model is trained, typically using a sequence generation model such as a Recurrent Neural Network (RNN) or a Transformer (transducer).
The above steps provide structured and labeled training data for the model, helping the model learn how to generate text meeting certain conditions.
In a possible implementation manner, the training data set further comprises a theme parameter, a text type parameter and a field text attribute corresponding to the material text; the step of carrying out serialization processing and coding on the training data set to obtain a training condition word order sequence comprises the following steps:
Carrying out serialization processing on the field text attribute corresponding to the material text to obtain a training field text attribute processing result; carrying out serialization processing on the theme parameters and the text type parameters corresponding to the material text to obtain a training annotation content processing result; sequentially splicing the initial identification of the field text attribute of the material text, the training field text attribute processing result, the initial identification of the theme parameter and the text type parameter of the material text and the training annotation content processing result to obtain a training condition text; and encoding the training condition text to obtain the training condition word order sequence.
In a possible implementation manner, in step 102, the generating a target domain vocabulary according to the theme parameters and the text data set corresponding to the target domain includes:
Extracting all words under the theme parameters from the text data set corresponding to the target field according to the theme parameters, and taking the words as a first candidate word list; respectively carrying out grammar standardization processing on the candidate words in the first candidate word list to obtain a second candidate word list; and merging the first candidate word list and the second candidate word list to obtain the target field word list.
And according to the provided theme parameters, retrieving all words related to the theme from the text data set corresponding to the target field. These terms constitute a preliminary candidate vocabulary of words, which are words directly related to a particular topic. And further carrying out grammar standardization processing on each candidate word in the first candidate word vocabulary. This may include stem extraction, morphological reduction, or part-of-speech tagging, to convert the vocabulary into a standard form for unified processing and analysis. The vocabulary after grammar normalization constitutes a second candidate vocabulary. Helps to ensure that the vocabulary in the vocabulary is grammatically correct and can be used more efficiently by the text generation model. And merging the first candidate word list and the second candidate word list to form a final target field word list. This merging process may involve removing duplicate terms, ordering, or other processing to ensure the quality and consistency of the vocabulary.
The final generated target domain vocabulary will be used in the text generation task, as mentioned in step 105, to filter the decoded vocabulary sequence, ensuring that the generated text lexically meets the requirements of the target domain. The generated text is tightly related to the specific field on the vocabulary level, and the relevance and the professional of the text are improved. Meanwhile, through grammar standardization processing, the accuracy and usability of vocabulary can be improved.
The following describes in detail the overall flow of the text generation method based on the artificial intelligence target domain provided by the embodiment of the application with reference to fig. 2, and uses the machine learning technology, especially the sequence generation model in the natural language processing, to generate the target text in the domain which accords with the user setting.
The first stage: a training data set is determined based on the annotation content specified by the user.
Step 1: and collecting material texts corresponding to each target field in the database.
Step 2: and acquiring annotation content of the material text imported or set by the user. The annotation content can be annotated to the material text by experts in the target field, and comprises theme parameters, organization modes and attributes thereof, difficulty level, word number range and proper field text attributes.
I. Theme parameters: i.e. the subject matter to which the text is related in the material of the target field.
Text type parameters: including dialog, statement, role-playing, etc., specific subordinate attributes of each text type parameter include:
a. When the text type parameter belongs to a dialogue organization mode, the attribute comprises: number of dialog turns, gender and name of the dialog character, etc.
B. when the text type parameter belongs to a narrative organization mode, the attribute comprises: letter, news, stories, biography, etc.
C. When the text type parameter belongs to a role playing organization mode, the attribute comprises: context type, role type, rule type, etc.
Difficulty level: the difficulty of the material text.
Word number range: the number of words of the text of the material ranges.
Text attributes: i.e. the text properties of the field to which the material is suitable.
Step 3: and adding corresponding domain text attribute information according to the domain text attribute information in the annotation content. The domain text attribute refers to a language element used in a specific domain, including a vocabulary, a phrase, a sentence pattern, a grammar structure, etc., which are basic components constituting the domain text. For example, in the field of computer science, the vocabulary: "algorithm", "programming", "data structure"; phrase: "user interface", "cloud computing"; periods of time: "to optimize performance," employed; grammar: using terminology and precise descriptions, e.g. "the temporal complexity of the algorithm is O (n"). ".
Step 4: and combining the material text, the annotation content and the domain text attribute to form training data.
And a second stage: and training the text of the target field based on the training data set to generate a model.
A pre-trained target domain text generation model is designed, comprising an embedding layer, a coding layer and a language layer, using a transducer architecture and an autoregressive mechanism. The method specifically comprises the following steps:
a. An embedding layer: the embedded layer is composed of a word embedded layer and a position embedded layer. Wherein the word embedding layer receives the word order sequence and outputs a word embedding vector sequence. The position embedding layer receives the position sequence and outputs a position vector sequence. The sequence of positions (Positional Encoding) is used to provide information to the model about the position of the word in the sentence. The position sequence corresponds to the word order sequence, has the same length, and the model constructs the position sequence with the same length according to the word order sequence during calculation. The sequence of position codes [0,1,2,3,4, …, n ]. The word embedding vector (the embedded representation from the sequence of tokens) and the position-coding vector are added or concatenated to form the final representation of each word, which is fed into the coding layer of the transducer model.
The sequence of positions is typically a sequence of vectors of equal length to the sequence of tokens (tokens sequence), each vector corresponding to a position in the sequence. These vectors are predefined and are typically generated by mathematical functions to ensure that the encoded vectors at the different positions are unique. Thus, a method is needed to tell the model the relative or absolute position of each word in the sequence.
B. Coding layer: is composed of multiple transducer layers, with an autoregressive mechanism in between. The coding layer receives the word embedded vector sequence and the position vector sequence and outputs a semantic hidden vector sequence. The coding layer takes these enhanced word embedded vectors as input and further processes them through self-attention mechanisms and feed forward networks to extract deep semantic information in the sequence.
C. language layer: the method comprises the steps of forming a full-connection layer, receiving a semantic hidden vector sequence output by a coding layer, calculating the probability distribution of each word order through a Softmax function after calculation of the full-connection layer. The output of the coding layer is fed into the language layer, typically a fully connected layer, for predicting the probability distribution of the next word.
Based on the above structure, training the target domain text generation model according to the training data comprises the following steps:
step 1: processing training data to obtain a word order sequence set, which specifically comprises the following steps:
i. and traversing the training data, and respectively carrying out serialization processing on the field text attribute and the annotation content aiming at the training data under each document to obtain a field text attribute serialization result and an annotation content serialization result.
And ii, performing text splicing according to the sequence of the domain text attribute initial identifier, the domain text attribute serialization result, the annotation content initial identifier and the annotation content serialization result to form a training condition text.
And thirdly, performing word segmentation coding on the training condition text by using a word segmentation device to obtain a training condition word order sequence.
And iv, constructing a 0 identification sequence with the same length according to the training condition command sequence.
And v, performing text splicing according to the sequence of the material text mark, the material text and the ending mark to obtain a training generation text.
And vi, performing word segmentation coding on the training generation text by using a word segmentation device to obtain a training generation word order sequence.
Generating a word order sequence according to training, and constructing a1 identification sequence with the same length.
Splicing the generated training condition command sequence and the training generation command sequence to obtain an input command sequence; and splicing the 0 identification sequence and the 1 identification sequence to obtain an input identification sequence.
And ix, carrying out left offset of 1 position on the input command sequence to obtain a target command sequence. And adding a1 identification command at the tail of the input identification sequence, and performing left-side offset of 1 position to obtain a target identification command sequence.
And x, repeating the steps until each document in the training data, namely all training data, is encoded into an input word order sequence, a target word order sequence and a target identification word order sequence.
In the sequence generation task, the model needs to distinguish which tokens are conditional text (i.e., the input portion) and which are generated text (i.e., the portion the model needs to generate). By constructing a 0 identification sequence and a1 identification sequence, the model can identify which tokens belong to conditional text and which belong to generated text. By using different identification sequences, the model may be directed to pay more attention to conditional text when generating text and to generate text when decoding conditional text.
Step 2: training a target field text generation model by using a gradient descent algorithm, wherein the method specifically comprises the following steps of:
i. And carrying out unreplaced batch sampling on the input word order sequence, the target word order sequence and the target identification word order sequence in the training set to obtain batch word order sequences, batch target word order sequences and batch target identification word order sequences.
And ii, inputting the batch word order sequence into a text generation model in the target field to obtain the probability distribution of the predicted word order.
And calculating the cross entropy loss between the probability distribution of the predicted word and the target sequence according to the following formula.
Wherein p represents the probability distribution of the predicted word order, y is the target sequence, w is the batch target identification word order sequence, and N is the batch target word order sequence length. C and C are the current vocabulary, all the vocabularies in the vocabulary, respectively.
And iv, minimizing cross entropy loss through an optimizer, calculating gradient of the estimated model and finishing weight updating.
And v, repeating the steps until all the data are sampled.
Training until the preset number of rounds is reached.
In training details, the following settings were made: in the aspect of an optimizer, adamw is selected, the learning rate is set to 3e-5, the weight attenuation coefficient is set to 0.01, beta 1 is 0.9, and beta 2 is 0.95; in the aspect of a learning rate scheduling strategy, a linear attenuation strategy is selected, the attenuation coefficient is 0.1, and the method is carried out in a preheating step of 100. Training 2000 steps, batch size is set to 8, training selects 4 card A800, and selects zero-2 in parallel.
And a third stage: and responding to a target domain text generation request of the user, and generating a conditional word order sequence according to the annotation content and the domain text attribute.
Step 1: and obtaining a white list word list. A list of words is determined that ensures that only those words are used in the generated material.
In the case of english text, the white word list in the embodiment of the application is obtained by: and extracting the vocabulary appearing in each field from the target field text database according to the organization dimension of the field as candidate words. And performing part-of-speech reduction from the candidate words to obtain second candidate words. And merging the candidate words and the second candidate words to obtain a white list vocabulary. Extracting the vocabulary of each field from the text database of the target field, and performing part-of-speech reduction to obtain a white list vocabulary.
In chinese text processing, chinese has no morphological changes like english, and "part-of-speech reduction" of chinese generally refers to converting a vocabulary into its standard or basic form, rather than morphological reduction. In the case of Chinese text, the white name word list in the embodiment of the application is obtained by the following method: and extracting the vocabulary appearing in each field from the target field text database according to the organization dimension of the field as candidate words. And carrying out standardization processing on the candidate words to obtain second candidate words. And merging the candidate words and the second candidate words to obtain a white list vocabulary.
The normalization process may include the steps of: vocabulary standardization: it is ensured that standard forms of vocabulary are used, e.g. that wrongly written words or non-standard abbreviations are avoided. Removing dialect vocabulary: if the material is intended for a wide range of chinese users, it may be desirable to remove or replace dialect vocabulary for a particular region, using mandarin vocabulary. Word sense reduction: in some cases, it may be desirable to convert ambiguities into their most common meaning to avoid ambiguity. Synonym substitution: where it is desired to control text difficulty or to accommodate a particular population of readers, it may be desirable to replace some of the unusual or unintelligible vocabulary with more common synonyms. Removing or replacing rare words: for educational materials, particularly materials for beginners, it may be desirable to remove or replace those rare or unusual words. Grammar specification: ensuring that the text conforms to the grammar specification of Chinese, including correct word order, collocation and sentence structure.
Step 2: and acquiring the user-set domain text attribute and annotation content as generating conditions.
Step 3: serializing the field text attribute and the annotation content to form a conditional text; and encoding by using a word segmentation device to obtain a conditional command sequence.
Calculating an initial conditional word order sequence according to given conditions: a. and (5) serializing the field text attribute. b. Annotation content serialization processing. c. And splicing the domain text attribute starting identifier, the domain text attribute serialization result, the annotation content starting identifier, the annotation content serialization result and the text identifier to be generated to obtain the conditional text. d. And encoding the conditional text by using a word segmentation device, namely acquiring a conditional word order sequence.
Step 4: the decoded angel sequence is initialized, i.e. the decoded sequence is empty when the generation starts.
Fourth stage: and generating a decoding word order sequence based on the target field text generation model.
Step 1: the trained target domain text generation model is used to predict the next word order in the text generation process.
A: inputting a conditional command sequence: the prepared conditional word order sequence is input into a trained target domain text generation model.
B: predicting word order probability: based on the entered conditional word order sequence, the next possible word order is predicted and a probability value is calculated for each possible word order. This probability reflects the likelihood that the model considers each vocabulary to be the next vocabulary.
C: ordering the word order: all possible tokens are ordered from high to low according to their probability values.
D: selecting the highest probability word order: and selecting the word order with the highest probability as the predicted next word order.
E: added to the decoded sequence: the vocabulary with the highest probability is added to the decoded vocabulary sequence.
And predicting the next word order by using the trained model, and selecting the word order with the highest probability. And inputting the conditional token sequence into a target field text generation model to predict the probability of the next token. The next lexical probability refers to the lexical probability represented by the semantic hidden vector of the last position. And sorting the vocabularies from large to small according to the vocabularies probability, and adding the current vocabularies with the maximum probability into the decoded vocabularies sequence. In this way, the model can build up successive text sequences step by step.
Step 2: and decoding the decoding word order sequence to generate a temporary text.
Fifth stage: and screening out a target decoding word order sequence by using the annotation content and the white name word list, and decoding the target decoding word order sequence into a target field text.
Step 1: traversing the vocabulary in the temporary text to determine whether the vocabulary exists in the white list. If the currently predicted vocabulary is not on the white list, the current vocabulary is removed from the decoded vocabulary sequence, the generating process returns to the previous step, and the vocabulary corresponding to the next probability value is predicted again until a vocabulary on the white list is found. I.e. if present, continue.
Step 2: and replacing the current word order with the text identifier to be generated at the tail part of the conditional word order sequence.
Step 3: repeating the steps until the temporary text obtained by decoding the decoded word order sequence meets the annotation condition or reaches a preset length, and obtaining the target field material text.
Screening is performed through a white list to ensure that all words are within the allowable range. Before adding a predicted token to the decoded token sequence, it is necessary to check whether this token is present in the white list. If the phrase is on the white list, indicating that it is a word that appears in the textbook, it may be accepted and added to the decoding sequence. If the predicted phrase is not on the white list, i.e., it is not a word in the textbook, then this phrase will not be added to the decoding sequence. This ensures that the generated text contains only words that the student should be familiar with, thereby ensuring the accuracy of the material.
The embodiment of the application relates to data preprocessing, model training, condition setting and final text generation and screening. The method can be applied to various fields such as news generation, story creation, hearing material generation, technical document writing and the like.
English listening materials refer to resources made in English for listening training or teaching. In design, the content should be limited to the extent that the student can understand, for example, the hearing materials of the primary school stage should be matched with the vocabulary, grammar and life experience learned by the pupil. The following will explain the technical solution by way of example in connection with the production of an actual hearing material. The solution is further elucidated in three parts, data collection, model training and target material generation.
The first stage: training data is collected.
Step 1: and collecting the material text in the existing English listening test questions. For example, collecting materials of English listening test questions matched with thousands of obligation education.
Step 2: the English listening proposition expert annotates the material text, and the annotation content comprises a theme, an organization mode, the attribute, the difficulty level, the word number range and a proper teaching material unit.
I. Subject matter: i.e. the subject matter to which the text is directed in the english listening material.
Organization and properties under various organization. The organization and the attributes under the various organization include conversation, statement, role playing, etc., and the specific subordinate attributes of the organization include: a. when the material belongs to a dialogue organization mode, the attributes comprise: number of dialog turns, gender and name of the dialog character, etc. b. When the material belongs to a description organization mode, the attributes comprise: letter, news, stories, biography, etc. c. When the material belongs to a role playing organization mode, the attributes comprise: context type, role type, rule type, etc.
Difficulty level: the difficulty of the hearing material. The difficulty level in the embodiment of the application is divided into 3 types, namely, difficult, medium and easy three levels.
Word number range: the number of words of the hearing material ranges. I.e., the word number range of the material, can be divided into "0-50", "50-100", "100-150", "150-200", "200-250", "250-300", "300-350", "350-400" and "400-500" in the embodiments of the present application.
Teaching material unit: i.e. the teaching materials and units for which the material is suitable. Such as seven-grade upper 1 units, eight-grade upper 7 units.
Step 3: adding information of corresponding teaching material units according to the teaching material units in the annotation content, wherein the information comprises the following steps: various elements such as words, phrases, sentence patterns, grammar, targets, etc.
And associating the related teaching material unit information with the material according to the annotation content. The unit information of the present example specifically includes the following elements: vocabulary: teaching materials the new words learned in the unit. Phrase: the important english phrases learned in that cell are taught. Periods of time: the important english sentence patterns learned in this unit of the teaching material. Grammar: teaching materials the grammar knowledge learned in the unit. The object is: teaching material the teaching objectives involved in this unit.
Step 4: and combining the material text, the annotation content and the teaching material unit information into training data.
And a second stage: and training a target field text generation model based on the training data.
Baichuan-7B is selected as a target field text generation model architecture, and the weight of Baichuan-7B is used as an initialization parameter. The pre-trained target domain text generation model, comprising an embedded layer, a coding layer, and a language layer, uses a transducer architecture and an autoregressive mechanism. The structure specifically comprises the following steps:
a. An embedding layer: the embedded layer is composed of a word embedded layer and a position embedded layer. Wherein the word embedding layer receives the word order sequence and outputs a word embedding vector sequence. The position embedding layer receives the position sequence and outputs a position vector sequence.
B. Coding layer: is composed of multiple transducer layers, with an autoregressive mechanism in between. The coding layer receives the word embedded vector sequence and the position vector sequence and outputs a semantic hidden vector sequence.
C. language layer: the method comprises the steps of forming a full-connection layer, receiving a semantic hidden vector sequence output by a coding layer, calculating the probability distribution of each word order through a Softmax function after calculation of the full-connection layer.
The training target field text generation model comprises the following steps:
step 1: processing training data to obtain a word order sequence set, which specifically comprises the following steps:
i. And traversing the training data, serializing the teaching material unit information and the annotation content respectively, and performing text splicing according to the sequence of the teaching material unit information initial identification, the teaching material unit information serialization result, the annotation content initial identification and the annotation content serialization result to obtain the training condition text.
And ii, performing word segmentation coding on the training condition text by using a word segmentation device to obtain a condition word order sequence. In the embodiment of the application, the < reserved_100> is selected as the initial identifier of the text attribute of the field, and the < reserved_101> is selected as the initial identifier of the annotation content. Encoding the conditional text by Baichuan-7B to obtain the conditional word order sequence.
Constructing a 0 identification sequence with the same length according to the conditional command sequence. I.e. 0 identification sequences of the same length as the conditional order sequences are constructed.
And iv, performing text splicing according to the sequence of the material text mark, the material text and the end mark to obtain a generated text. In the embodiment of the application, the < reserved_105> is used as a material text identifier, and the < s > is used as an ending identifier.
And v, performing word segmentation coding on the generated text by using a word segmentation device to obtain a generated word order sequence.
Constructing the 1 identification sequence. And constructing a 1 identification sequence with the same length according to the generated command sequence.
Connecting the conditional command sequence and the generated command sequence to obtain an input command sequence, and connecting the 0 identification sequence and the 1 identification sequence to obtain an input identification sequence.
Left offset of 1 position is carried out on the input word order sequence to obtain the target word order sequence.
And ix, adding 1 identification command at the tail of the input identification sequence, and performing left-side offset of 1 position to obtain a target identification command sequence.
And x, repeating the steps until all training data are encoded into an input word order sequence, a target word order sequence and a target identification word order sequence.
Step 2: training a target field text generation model by using a gradient descent algorithm, wherein the method specifically comprises the following steps of:
i. And carrying out unreplaced batch sampling on the training set to obtain a batch command sequence, a batch target command sequence and a batch target identification command sequence.
And ii, inputting the batch word order sequence into a text generation model in the target field to obtain the probability distribution of the predicted word order.
And calculating the cross entropy loss between the probability distribution of the predicted word and the target sequence according to the following formula.
P represents the probability distribution of the predicted word order, y is the target word order sequence, w is the batch target identification word order sequence, and N is the batch target word order sequence length. C and C are the current vocabulary, all the vocabularies in the vocabulary, respectively.
And iv, minimizing cross entropy loss through an optimizer, calculating gradient of the estimated model and finishing weight updating.
And v, repeating the steps a-d until all training data are sampled.
Repeating the steps a-e until the preset number of rounds is reached, and stopping training.
In the training aspect, the training is performed by adopting the method described in the method, and on training details, the following settings are performed: in terms of an optimizer, adamw is selected, the learning rate is set to 3e-5, the weight attenuation coefficient is set to 0.01, beta 1 is 0.9, and beta 2 is 0.95. In the aspect of a learning rate scheduling strategy, a linear attenuation strategy with preheating is selected, the attenuation coefficient is 0.1, and the preheating is performed in 100 steps. In the embodiment of the application, 2000 steps are performed in total, the batch size is set to 8, 4 cards A800 are selected for training, and zero-2 is selected in a parallel mode.
And a third stage: inputting the text generation requirement of the user into a generation model to generate the English listening material.
Step 1: and obtaining a white list word list. A list of words is determined that ensures that only those words are used in the generated material.
The white words in the embodiment of the application are acquired by the following modes: and extracting words appearing in each unit from the human teaching version textbook according to the organization dimension of the unit as candidate words. And performing part-of-speech reduction from the candidate words to obtain second candidate words. The candidate word and the second candidate word together form a whitelist vocabulary. Words of each unit are extracted from the teaching material, part-of-speech reduction is carried out, a word list is obtained, and only the words are used in the generated material.
Step 2: the teaching material unit information and the annotation content are provided as the generation conditions.
At the time of material generation, the user is required to provide teaching material unit information and annotation content. In this example, seven-grade booklet 5 units are taken as an example.
Step 3: serializing the teaching material unit information and the annotation content to form a conditional text, and encoding by using a word segmentation device to obtain a conditional word order sequence.
In the embodiment of the application, the condition text is spliced by the teaching material unit information initial identification, the teaching material unit information serialization result, the annotation content initial identification, the annotation content serialization result and the material text identification. And further encoding the text in the last step by a word segmentation device to obtain a conditional word order sequence.
Step 4: initializing a decoded word order sequence. The decoding sequence is null at the beginning of generation.
Step 5: the trained target domain text generation model is used to predict the next token probability in the text generation process. The target field text generation model after the conditional word order input training is completed is provided, and the next word order probability refers to the word order probability represented by the semantic hidden vector of the last position.
A: inputting a conditional command sequence: the prepared conditional word order sequence is input into a trained target domain text generation model.
B: predicting word order probability: the model predicts the next possible word order based on the entered conditional word order sequence and calculates a probability value for each possible word order. This probability reflects the likelihood that the model considers each vocabulary to be the next vocabulary.
C: ordering the word order: all possible tokens are ordered from high to low according to their probability values.
D: selecting the highest probability word order: and selecting the word order with the highest probability as the predicted next word order. In this example, it is assumed that the lexical number 10287 has the highest probability.
E: added to the decoded sequence: the vocabulary with the highest probability is added to the decoded vocabulary sequence. The decoded sequence may be empty initially and the sequence becomes [10287] after addition.
And predicting the next word order by using the trained model, and selecting the word order with the highest probability. And inputting the conditional token sequence into a target field text generation model to predict the probability of the next token. The next lexical probability refers to the lexical probability represented by the semantic hidden vector of the last position. And sorting the vocabularies from large to small according to the vocabularies probability, and adding the current vocabularies with the maximum probability into the decoded vocabularies sequence. In this way, the model can build up successive text sequences step by step.
Step 6: generating and screening: and decoding the decoded command sequence to generate a temporary text. The example decodes the decoded word order sequence [10287] to obtain the temporary text as "have".
Before adding a predicted token to the decoded token sequence, it is necessary to check whether this token is present in the white list. If the phrase is on the white list, indicating that it is a word that appears in the textbook, it may be accepted and added to the decoding sequence. If the predicted phrase is not on the white list, i.e., it is not a word in the textbook, then this phrase will not be added to the decoding sequence. This ensures that the generated text contains only words that the student should be familiar with, thereby ensuring the accuracy of the material.
Step 7: screening is performed through a white list to ensure that all words are within the allowable range. Traversing the words in the temporary text determines whether the words exist in the whitelist. If the current predicted word is not on the white list, the current word is removed from the decoded word sequence, the generation process returns to the previous step, and the next word is predicted again until a word on the white list is found. If so, then continue, e.g., only the word having exists in the temporary text, and the word exists in the whitelist, so continue.
Step 8: the current vocabulary is added to the tail of the conditional vocabulary sequence. For example, [10287] is added to the conditional token sequence tail.
Step 9: by repeating the above steps, a complete text sequence is gradually constructed until the generated text meets all annotation conditions or reaches a predetermined length.
Repeating 5-8 continuous iteration until the temporary text obtained by decoding the decoded word order sequence meets the annotation condition, and obtaining the English listening material text. And decoding the word order sequence to obtain a temporary text which is the English listening material text.
The technology can generate the English listening materials meeting the requirements, and particularly can generate complete listening materials with words not exceeding the class and the word number meeting the range. In this process, a whitelist is used to ensure that the generated text contains only words that appear in the textbook, thereby ensuring the accuracy and applicability of the material.
The flow can be applied to a system, and the receiving module receives the annotation content input by the user and transmits the annotation content to the generating module. The generating module receives the annotation content input by the input module, generates the hearing material meeting the requirements according to the method and outputs the hearing material to the output module; the output module receives the hearing material text output by the generation module and returns the hearing material text to the user through the HTTP protocol.
The embodiment of the application provides a target field text generation method, which is characterized by receiving a target field text generation request sent by user equipment, wherein the target field text generation request carries a theme parameter, a text type parameter, a word number range and a field text attribute; generating a target field word list according to the theme parameters and the text data set corresponding to the target field; generating a conditional word order sequence according to the theme parameter, the text type parameter and the field text attribute; invoking a target field text generation model to obtain a decoded word order sequence based on the conditional word order sequence; and screening out a target decoding word order sequence by utilizing the word number range and the target field word list, and decoding the target decoding word order sequence into a target field text. Through the theme parameters and the text type parameters, the system is able to generate text that is related to a particular domain and type; by constructing the target field word list and the conditional word order sequence, a structured information basis is provided for text generation. And screening the decoded word order sequence by utilizing the word number range and the target field word list, controlling the length and quality of the generated text, and avoiding content redundancy or deviation from the theme. By precisely controlling various aspects of the text, the end user may obtain more satisfactory and desirable text content.
Based on the same technical concept, the embodiment of the application also provides a target field text generation system, as shown in fig. 3, wherein the system comprises:
A request receiving module 301, configured to receive a target domain text generation request sent by a user equipment, where the target domain text generation request carries a theme parameter, a text type parameter, a word number range, and a domain text attribute;
the target domain vocabulary module 302 is configured to generate a target domain vocabulary according to the theme parameters and the text data set corresponding to the target domain;
a conditional word order sequence module 303, configured to generate a conditional word order sequence according to the theme parameter, the text type parameter, and the domain text attribute;
the decoded word order sequence module 304 is configured to invoke a target field text generation model to obtain a decoded word order sequence based on the conditional word order sequence;
a target domain text generation module 305 for screening out a target decoding word order sequence by using the word number range and the target domain word list, and decoding the target decoding word order sequence into a target domain text
The embodiment of the application also provides electronic equipment corresponding to the method provided by the embodiment. Referring to fig. 4, a schematic diagram of an electronic device according to some embodiments of the present application is shown. The electronic device 20 may include: a processor 200, a memory 201, a bus 202 and a communication interface 203, the processor 200, the communication interface 203 and the memory 201 being connected by the bus 202; the memory 201 stores a computer program executable on the processor 200, and the processor 200 executes the method according to any of the foregoing embodiments of the present application when the computer program is executed.
The memory 201 may include a high-speed random access memory (RAM: random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the system network element and the at least one other network element is implemented through at least one physical port (which may be wired or wireless), and the internet, wide area network, local network, metropolitan area network, etc. may be used.
Bus 202 may be an ISA bus, a PCI bus, an EISA bus, or the like. The buses may be classified as address buses, data buses, control buses, etc. The memory 201 is configured to store a program, and the processor 200 executes the program after receiving an execution instruction, and the method disclosed in any of the foregoing embodiments of the present application may be applied to the processor 200 or implemented by the processor 200.
The processor 200 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 200 or by instructions in the form of software. The processor 200 may be a general-purpose processor, including a central processing unit (Central Processing Unit, abbreviated as CPU), a network processor (Network Processor, abbreviated as NP), etc.; but may also be a Digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 201, and the processor 200 reads the information in the memory 201, and in combination with its hardware, performs the steps of the above method.
The electronic device provided by the embodiment of the application and the method provided by the embodiment of the application have the same beneficial effects as the method adopted, operated or realized by the electronic device and the method provided by the embodiment of the application due to the same inventive concept.
The present application further provides a computer readable storage medium corresponding to the method provided in the foregoing embodiments, referring to fig. 5, the computer readable storage medium is shown as an optical disc 30, on which a computer program (i.e. a program product) is stored, where the computer program, when executed by a processor, performs the method provided in any of the foregoing embodiments.
It should be noted that examples of the computer readable storage medium may also include, but are not limited to, a phase change memory (PRAM), a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a flash memory, or other optical or magnetic storage medium, which will not be described in detail herein.
The computer-readable storage medium provided by the above-described embodiments of the present application has the same advantageous effects as the method adopted, operated or implemented by the application program stored therein, for the same inventive concept as the method provided by the embodiments of the present application.
It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.
The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structural changes made by the specification and drawings of the present invention or direct/indirect application in other related technical fields are included in the scope of the present invention.

Claims (5)

1. A method for generating a target domain text, the method comprising:
Receiving a target field text generation request sent by user equipment, wherein the target field text generation request carries a theme parameter, a text type parameter, a word number range and a field text attribute; the text type parameters comprise parameters of dialogue type, statement type and role playing type, and the text attribute of the field represents language elements used in the setting field, including vocabulary, phrase, sentence pattern and grammar structure;
Extracting all words under the theme parameters from the text data set corresponding to the target field according to the theme parameters, and taking the words as a first candidate word list; respectively carrying out grammar standardization processing on the candidate words in the first candidate word list to obtain a second candidate word list; combining the first candidate word list and the second candidate word list to obtain the target field word list;
Generating a conditional text according to the theme parameter, the text type parameter and the field text attribute; encoding the conditional text to obtain the conditional command sequence;
Carrying out serialization processing on the field text attribute to obtain a field text attribute processing result; carrying out serialization processing on the theme parameters and the text type parameters to obtain annotation content processing results; sequentially splicing the initial identification of the field text attribute, the processing result of the field text attribute, the initial identification of the theme parameter and the text type parameter, the processing result of the annotation content and the text identification to be generated to obtain the conditional text;
Invoking a target field text generation model to calculate a probability value set of candidate word orders of the next sequence position in the conditional word order sequence; sorting probability values in the probability value set from large to small; adding the word order with the maximum probability value into the initialized word order sequence to generate the decoded word order sequence;
Decoding the decoding vocabulary sequence to obtain a corresponding decoding vocabulary; searching whether the decoded vocabulary exists in the target field vocabulary, if not, deleting the current vocabulary from the decoded vocabulary sequence, adding the vocabulary corresponding to the next position of the maximum value in the probability value sequencing into the decoded vocabulary sequence, and continuously executing the subsequent steps; if so, adding the decoded vocabulary to the tail part of the conditional vocabulary sequence; repeating the steps until the conditional word order sequence meets the word number range, obtaining a target decoding word order sequence, and decoding the target decoding word order sequence into a target field text.
2. The method of claim 1, wherein the target domain text generation model is trained as follows:
Processing based on the training data set to obtain an input command sequence, a target command sequence and a target identification command sequence;
Batch sampling is carried out on the input word order sequence, the target word order sequence and the target identification word order sequence to obtain batch word order sequences, batch target word order sequences and batch target identification word order sequences;
Inputting the batch word order sequence into a pre-trained target field text generation model to obtain probability distribution of predicted word orders;
calculating the cross entropy loss between the probability distribution of the predicted word order and the target word order sequence;
calculating the gradient of the target field text generation model by minimizing cross entropy loss and updating the weight;
repeating the steps until the training times reach the preset number of rounds, and stopping training.
3. The method of claim 2, wherein the training dataset comprises material text; the processing based on the training data set to obtain an input word order sequence, a target word order sequence and a target identification word order sequence comprises the following steps:
Carrying out serialization processing and coding on the training data set to obtain a training condition word order sequence;
constructing a 0 identification sequence with the same length of the command sequence of the training condition command sequence;
Sequentially splicing according to the sequence of the starting mark of the material text, the material text and the ending mark of the material text to obtain a training generation text, and encoding to obtain a training generation word sequence;
constructing a1 identification sequence with the same length of the command sequence of the training generation command sequence;
splicing the training condition word order sequence and the training generation word order sequence to obtain an input word order sequence, and splicing the 0 identification sequence and the 1 identification sequence to obtain an input identification sequence;
performing left offset of 1 word order position on the input word order sequence to obtain a target word order sequence;
Adding 1 identification command at the tail of the input identification sequence and performing left side offset of 1 command position to obtain a target identification command sequence;
repeating the steps until the training data set is processed into an input word order sequence, a target word order sequence and a target identification word order sequence.
4. The method of claim 3, wherein the training dataset further comprises a subject parameter, a text type parameter, and a domain text attribute corresponding to the material text; the step of carrying out serialization processing and coding on the training data set to obtain a training condition word order sequence comprises the following steps:
Carrying out serialization processing on the field text attribute corresponding to the material text to obtain a training field text attribute processing result;
Carrying out serialization processing on the theme parameters and the text type parameters corresponding to the material text to obtain a training annotation content processing result;
sequentially splicing the initial identification of the field text attribute of the material text, the training field text attribute processing result, the initial identification of the theme parameter and the text type parameter of the material text and the training annotation content processing result to obtain a training condition text;
And encoding the training condition text to obtain the training condition word order sequence.
5. A target domain text generation system, the system comprising:
The request receiving module is used for receiving a target field text generation request sent by user equipment, wherein the target field text generation request carries a theme parameter, a text type parameter, a word number range and a field text attribute; the text type parameters comprise parameters of dialogue type, statement type and role playing type, and the text attribute of the field represents language elements used in the setting field, including vocabulary, phrase, sentence pattern and grammar structure;
The target field vocabulary module is used for extracting all words under the theme parameters from the text data set corresponding to the target field according to the theme parameters to serve as a first candidate vocabulary; respectively carrying out grammar standardization processing on the candidate words in the first candidate word list to obtain a second candidate word list; combining the first candidate word list and the second candidate word list to obtain the target field word list;
The conditional order sequence module is used for generating a conditional text according to the theme parameter, the text type parameter and the field text attribute; encoding the conditional text to obtain the conditional command sequence;
The decoded word order sequence module is used for carrying out serialization processing on the field text attribute to obtain a field text attribute processing result; carrying out serialization processing on the theme parameters and the text type parameters to obtain annotation content processing results; sequentially splicing the initial identification of the field text attribute, the processing result of the field text attribute, the initial identification of the theme parameter and the text type parameter, the processing result of the annotation content and the text identification to be generated to obtain the conditional text; invoking a target field text generation model to calculate a probability value set of candidate word orders of the next sequence position in the conditional word order sequence; sorting probability values in the probability value set from large to small; adding the word order with the maximum probability value into the initialized word order sequence to generate the decoded word order sequence;
The target field text generation module is used for decoding the decoding vocabulary sequence to obtain a corresponding decoding vocabulary; searching whether the decoded vocabulary exists in the target field vocabulary, if not, deleting the current vocabulary from the decoded vocabulary sequence, adding the vocabulary corresponding to the next position of the maximum value in the probability value sequencing into the decoded vocabulary sequence, and continuously executing the subsequent steps; if so, adding the decoded vocabulary to the tail part of the conditional vocabulary sequence; repeating the steps until the conditional word order sequence meets the word number range, obtaining a target decoding word order sequence, and decoding the target decoding word order sequence into a target field text.
CN202410918161.2A 2024-07-10 2024-07-10 Target field text generation method and system Active CN118468822B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410918161.2A CN118468822B (en) 2024-07-10 2024-07-10 Target field text generation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410918161.2A CN118468822B (en) 2024-07-10 2024-07-10 Target field text generation method and system

Publications (2)

Publication Number Publication Date
CN118468822A CN118468822A (en) 2024-08-09
CN118468822B true CN118468822B (en) 2024-09-20

Family

ID=92167228

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410918161.2A Active CN118468822B (en) 2024-07-10 2024-07-10 Target field text generation method and system

Country Status (1)

Country Link
CN (1) CN118468822B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113987104A (en) * 2021-09-28 2022-01-28 浙江大学 Ontology guidance-based generating type event extraction method
CN118313345A (en) * 2024-06-07 2024-07-09 成都佳发安泰教育科技股份有限公司 Text data set processing method, system, equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298436B (en) * 2019-06-28 2023-05-09 乐山金蜜工业卫士服务股份有限公司 Generating a model of data to text for a network based on pointers
US11748555B2 (en) * 2021-01-22 2023-09-05 Bao Tran Systems and methods for machine content generation
CN116090010A (en) * 2023-02-01 2023-05-09 北京邮电大学 Context contact-based text generation type steganography method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113987104A (en) * 2021-09-28 2022-01-28 浙江大学 Ontology guidance-based generating type event extraction method
CN118313345A (en) * 2024-06-07 2024-07-09 成都佳发安泰教育科技股份有限公司 Text data set processing method, system, equipment and storage medium

Also Published As

Publication number Publication date
CN118468822A (en) 2024-08-09

Similar Documents

Publication Publication Date Title
CN112836514B (en) Nested entity identification method, apparatus, electronic device and storage medium
CN110866401A (en) Chinese electronic medical record named entity identification method and system based on attention mechanism
CN111062217A (en) Language information processing method and device, storage medium and electronic equipment
CN113268576B (en) Deep learning-based department semantic information extraction method and device
CN114036950B (en) Medical text named entity recognition method and system
CN112052329A (en) Text abstract generation method and device, computer equipment and readable storage medium
CN116991875B (en) SQL sentence generation and alias mapping method and device based on big model
CN110991185A (en) Method and device for extracting attributes of entities in article
Bao et al. Contextualized rewriting for text summarization
CN108664464B (en) Method and device for determining semantic relevance
CN115658898A (en) Chinese and English book entity relation extraction method, system and equipment
CN116628186A (en) Text abstract generation method and system
CN116842168B (en) Cross-domain problem processing method and device, electronic equipment and storage medium
CN116910251A (en) Text classification method, device, equipment and medium based on BERT model
CN118468822B (en) Target field text generation method and system
CN116483314A (en) Automatic intelligent activity diagram generation method
CN114372467A (en) Named entity extraction method and device, electronic equipment and storage medium
CN110955768B (en) Question-answering system answer generation method based on syntactic analysis
CN114707491A (en) Quantity extraction method and system based on natural language processing
CN111090720B (en) Hot word adding method and device
CN114611489A (en) Text logic condition extraction AI model construction method, extraction method and system
CN113486666A (en) Medical named entity recognition method and system
CN111428005A (en) Standard question and answer pair determining method and device and electronic equipment
CN115587589B (en) Statement confusion degree acquisition method and system for multiple languages and related equipment
CN117786066B (en) Document-oriented knowledge question-answering method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant