CN116484808A

CN116484808A - Method and device for generating controllable text for official document

Info

Publication number: CN116484808A
Application number: CN202310441289.XA
Authority: CN
Inventors: 陈政华; 刘学谦; 马延美
Original assignee: Beijing Fangcun Wuyou Technology Development Co ltd
Current assignee: Beijing Fangcun Wuyou Technology Development Co ltd
Priority date: 2023-04-23
Filing date: 2023-04-23
Publication date: 2023-07-25

Abstract

The application discloses a method and a device for generating a controllable text for a document. The controllable text generation method for the official document comprises the following steps: acquiring a text to be written; acquiring text type information of a text to be written according to a preset document type rule; obtaining theme information according to the text to be written; acquiring keyword information and tail sentence information of the text to be written according to the text to be written; acquiring a trained continuous writing model; and inputting the text information, the theme information, the keyword information and the tail sentence information of the text to be written to a trained writing model so as to acquire the writing information. The method for generating the controllable text for the document combines the text information, the theme information, the keyword information and the tail sentence information of the text to be written to carry out the writing of the subsequent article, and the generated text has strong correlation and dependency with the document field, so that good writing assistance can be provided for a document writer.

Description

Method and device for generating controllable text for official document

Technical Field

The application relates to the technical field of text generation, in particular to a neural network-based human shape detection and identification fall detection method.

Background

In the field of automatic text generation, the generation of control content is gradually coming into our field of view in order to meet various demands of people. In this research area, researchers have proposed several controllable techniques for the nature of natural language generation. In the document field, the necessity of controllable techniques in automatically generating documents is more pronounced than natural language due to its special authoring requirements and strictness. Controllable generation for documents has not been explored because of the less training corpus of documents. In the field of natural language generation control, there are a great deal of researches on the generation of natural language, such as Rigid Formats Controlled Text Generation (SongNet), A Compositional Benchmark for Fine-grained Controllable Text Style Transfer (StylePTB), A Conditional Transformer Language Model for Controllable Generation (CTRL) and the like. Therefore, the present patent will gradually introduce our proposed method from the existing achievements of scientific research.

Controllable text generation the latest research progress includes three phases:

the first stage: labeling a large amount of training corpus facing controllable demands, and constructing and adjusting controllable training objective functions facing specific tasks.

And a second stage: on the basis of the pre-training model, the decoding strategy is adjusted so that the generated result contains target content (such as topic keywords) as much as possible.

And a third stage: on the basis of the pre-training model, the generation content of the text is controlled by adjusting model inputs in different formats.

In these three stages, although some achievements can make the content of text generation more prone to the needs of users to some extent, there are relatively large defects in expansibility, control granularity or performance. In particular, by constructing a controllable training objective function for a specific task, it depends on a large number of labeled corpora, but labeling a large number of high-quality controllable corpora is impractical, and is time-consuming, labor-consuming and expensive. Another method is to control the content of text generation by adjusting the decoding strategy, which has high requirements on the text generation capability of the model itself, i.e. if the model itself has poor performance, the perfect decoding strategy will also fail. The third method is to control the generation of text by adding extra output on the basis of the existing pre-trained model, which is also dependent on the pre-trained model itself. On the basis, a prompt learning technology (prompt/adapter/LORA learning) is combined, and the paragraphs related to the document types, the topics and the document titles of the documents are taken as prompt words to control the generation of the documents.

Despite the great advances made in text generation based on pre-training, the text generated by the existing generic generation model still has various problems such as content logic errors, content divergence unfocused, sentence repetition, etc. Moreover, the method has very remarkable characteristics for the document field: the situation of referring to other documents often occurs in the documents, the names of government institutions in the documents are also very frequently used, the document types of the current documents are various, and different document types have different line rules. In summary, the normal generation model appears when applied to the document field: the problems that file names, government agency names are mixed, generated contents do not accord with corresponding rules of expected text types and the like exist in the quote. The model obtained based on the general field is difficult to use in the document field.

Disclosure of Invention

The invention aims to provide a controllable text generation method for a document, which at least solves one technical problem.

In one aspect of the present invention, there is provided a controllable text generation method for a document, the controllable text generation method for a document including:

acquiring a text to be written;

acquiring text type information of a text to be written according to a preset document type rule;

obtaining theme information according to the text to be written;

acquiring keyword information and tail sentence information of the text to be written according to the text to be written;

acquiring a trained continuous writing model;

and inputting the text information, the theme information, the keyword information and the tail sentence information of the text to be written to a trained writing model so as to acquire the writing information.

Optionally, the obtaining the text type information of the text to be written according to the preset document type rule includes:

acquiring title information of a text to be written;

and identifying the text type information in the title information according to the document type rule.

Optionally, the obtaining the theme information according to the text to be written includes:

acquiring a trained topic classification model;

extracting title characteristics of the title information;

and inputting the title features into a trained topic classification model to acquire topic information.

Optionally, the obtaining the keyword information according to the text to be written includes:

acquiring a relevant paragraph of the text to be written according to the title information;

keyword information is extracted according to the relevant paragraphs.

Optionally, the trained renewal model includes an Encoding Layer of 6 transducers models, a Decoding Layer of 6 transducers models, each of the Encoding Layer and the Decoding Layer including an attention Layer, a jagged network module, and a Layer normalization module.

Optionally, the model training loss function of the continuous writing model when training is:

L(C)＝L ₂ (C)+λ _f *L ₁ (C) The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the liquid crystal display device comprises a liquid crystal display device,

l1 (C) is the loss of the GPT model pre-training, L2 (C) is the loss of the GPT model in the fine tuning stage.

Alternatively, the L2 (C) is obtained using the following formula:

L ₂ (C)＝∑logP(y|c ^f ，…c ³² ，x ³³ ，...，x ^m ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the liquid crystal display device comprises a liquid crystal display device,

l2 (C) is the loss of the GPT model in the fine tuning stage; c is prompt information, x is input text, and y is cosine similarity value of the generated text and the marked generated text.

The application also provides a controllable text generation device for the official document, which comprises:

the device comprises a text to be written acquisition module, a text to be written acquisition module and a text processing module, wherein the text to be written acquisition module is used for acquiring a text to be written;

the text information acquisition module is used for acquiring text information of a text to be written according to a preset document rule;

the keyword information acquisition module is used for acquiring keyword information and tail sentence information of the text to be written according to the text to be written;

the continuous writing model acquisition module is used for acquiring a trained continuous writing model;

and the writing information acquisition module is used for inputting the text information, the theme information, the keyword information and the tail sentence information of the text to be written into the trained writing model so as to acquire the writing information.

The application also provides an electronic device comprising a memory, a processor and a computer program stored in the memory and capable of running on the processor, the processor implementing the controllable text generation method for documents as described above when executing the computer program.

The present application also provides a computer readable storage medium storing a computer program which, when executed by a processor, is capable of implementing a controllable document generation method for a document as described above.

Advantageous effects

The method for generating the controllable text for the document combines the text information, the theme information, the keyword information and the tail sentence information of the text to be written to carry out the writing of the subsequent article, and the generated text has strong correlation and dependency with the document field, so that good writing assistance can be provided for a document writer. The method solves the problems that file names, government agency names are mixed and used, generated contents do not accord with corresponding rules of expected text types and the like in the prior art. The model obtained based on the general field is difficult to use in the document field and the like.

Drawings

Fig. 1 is a flow chart of a controllable document generation method for a document according to an embodiment of the present application.

Fig. 2 is a detailed flowchart of a controllable document generation method for documents according to an embodiment of the present application.

Fig. 3 is a schematic diagram of an electronic device for implementing a controllable text generation method for a document in an embodiment of the present application.

Detailed Description

In order to make the purposes, technical solutions and advantages of the implementation of the present application more clear, the technical solutions in the embodiments of the present application will be described in more detail below with reference to the accompanying drawings in the embodiments of the present application. In the drawings, the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The described embodiments are some, but not all, of the embodiments of the present application. The embodiments described below by referring to the drawings are exemplary and intended for the purpose of explaining the present application and are not to be construed as limiting the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application. Embodiments of the present application are described in detail below with reference to the accompanying drawings.

Fig. 1 is a flow chart of a controllable document generation method for a document according to an embodiment of the present application. Fig. 2 is a detailed flowchart of a controllable document generation method for documents according to an embodiment of the present application.

The controllable text generation method for the official document shown in fig. 1 and 2 comprises the following steps:

step 1: acquiring a text to be written;

step 2: acquiring text type information of a text to be written according to a preset document type rule;

step 3: obtaining theme information according to the text to be written;

step 4: acquiring keyword information and tail sentence information of the text to be written according to the text to be written;

step 5: acquiring a trained continuous writing model;

step 6: and inputting the text information, the theme information, the keyword information and the tail sentence information of the text to be written to a trained writing model so as to acquire the writing information.

In this embodiment, the obtaining text type information of the text to be written according to the preset document type rule includes:

acquiring title information of a text to be written;

Specifically, the present application first obtains an existing document data set, and in this embodiment, the document data set obtained by the present application has 60w+ document data, and each document data may obtain header information.

Obtaining corresponding document seed rules by analyzing the titles: the titles of the 15 standard document types are all ended by the names of the document types, but because the online data formats are quite various, regular expressions are required to be set for extracting the corresponding document types. Taking the "notification" text as an example, a corresponding regular expression is set, and the regular expression can be written as (through [ \xa0\u3000] {0, } known [ \)) ] {0, }) $. 15 similar rules are designed according to different types of documents, and 15 types of document types are respectively corresponding to the 15 types of documents.

In this embodiment, the obtaining the theme information according to the text to be written includes:

acquiring a trained topic classification model;

extracting title characteristics of the title information;

Specifically, 60w documents were initially classified using K-means (clustering algorithm). 5000 titles are extracted from the aggregated document data and marked, the classification balance is kept as much as possible, and then a multi-classification model is constructed by using marked data and utilizing a textCNN network structure to classify the document titles. The number of categories is totally divided into: 22, classifying the number according to the topics listed in the national institutes of service.

In this embodiment, the obtaining the keyword information according to the text to be written includes:

keyword information is extracted according to the relevant paragraphs.

Specifically, 60w of document data is stored in an ES (database for searching) database in the form of paragraphs, the invention searches and acquires the paragraphs most relevant to the title by utilizing the ES database, and acquires five keywords of the paragraphs by using a TextRank algorithm (which is a graph-based ranking algorithm for texts), and the keywords are connected by using a pause number.

In this embodiment, training the writing model by using the text information, the topic information and the keyword information extracted from the document data of 60w, that is, combining the text information, the topic and the keyword with "#" as the prompting words for model training, and using < S > as the ending symbol of the prompter. The length of the hint word is set to 32, and the insufficient usage < PAD > is completed.

In this embodiment, the continuation model of the present application includes an Encoding Layer of 6 transducers models and a Decoding Layer of 6 transducers models, where each Encoding Layer and Decoding Layer includes an attention Layer, a variance network module, and a Layer normalization module.

Specifically, the GPT-2 model structure is used as an infrastructure, 6 Encoding layers of 6 transformers are used, and each Encoding Layer and each Decoding Layer comprise an attention Layer, a jagged network and Layer normalization. The attention layer of the patent adopts a 12-head multi-head self-attention mechanism.

The text entered into the model is split, with the first half being the prompt text information of fixed length 32 and the second half being the source text. Encoding the hint text using Encoding and generating tasks using the source text. After the coding layer, semantic representation E of prompt information can be obtained, the output hl of the last layer of E and the decoding layer is added to obtain a final hidden vector H, and the output y of the model is obtained through a full connection layer and an activation layer.

The loss function of model training is:

L(C)＝L ₂ (C)+λ ₁ *L ₁ (C) Wherein lambda is ₁ Is a constant;

l1 (C) is a loss of GPT model pre-training:

u is the input of the model, k is the size of the context window, and θ is a parameter of the model.

h ₀ ＝UW _e +W _p

We is the word embedding matrix and Wp is the location embedding matrix.

L2 (C) is loss of GPT model in the fine tuning phase:

c is the input of the model, E is the semantic vector generated after the first 32 token of the model input goes through Encoding,is the hidden vector of the last layer of Decoding, W _y Is the parameter matrix of the last full link layer. The prompt information (E) is introduced into the loss of the L2 (C), and in this way, the model can be influenced by the control information during training and prediction so as to control the generated text, and the text is more accurate.

In this embodiment, when the application is used, all texts (i.e. texts to be written) in front of a cursor (a writing place) need to be acquired, then the texts are analyzed, and a preset 15-class document type rule is used for judging the title of the texts and the text type to be written. The title was classified using FastText to obtain all the written text topics. The title is searched using ES to obtain the most relevant paragraph and 5 keywords in the paragraph are extracted using TextRank. And (3) splicing the three information (the text information, the theme information and the keyword information) by using # and then adding a first sentence (the tail sentence information of the text to be written) before a cursor (a writing place) to obtain an input text of the model, and transmitting the input text into the model to write articles.

In this embodiment, the present application also performs the following post-processing:

and avoiding the paragraphs with serial numbers which are commonly appeared in the official documents. The invention also carries out some post-processing operations after the text is generated:

(1) Removing the content with the serial number of the generated text;

(2) Desensitizing legal names, institution names, person names and the like in the generated text;

(3) The overlapping word retrieval technology is utilized to avoid the rewriting of multiple words;

(4) Performing semantic comparison between the currently generated text and the previously generated text, and matching semantic repetition;

(5) And (5) checking and modifying punctuation marks in the text by using a punctuation mark checking technology.

In the document field, writing documents is a very tedious and headache, and it is difficult for staff working in institutions to have a lot of time to collect and sort related data, so that writing documents needs to rely on abundant experience and knowledge of writers, and in the "rules of working document processing of Party and administration institutions" published in 2012, the total classification of the documents of documents into 15 categories, different documents correspond to different matters, which also brings great writing difficulty to writers. The invention mainly builds a controllable text generation model under the document field by means of text information such as document types, document topics, related document paragraphs and the like, and the generated text accords with the related requirements of the document field so as to assist writers in the document field.

The promtt (prompter) used in the model of the present application incorporates three types of information: analyzing the document types of the documents obtained from the titles by utilizing rules, classifying the topics of the articles obtained from the titles by utilizing a classification model, searching by utilizing an ES database to obtain paragraphs related to the titles, and acquiring keywords of the paragraphs by utilizing a TextRank algorithm.

The invention combines the relevant characteristics of the document deeply, the generated text has strong relevance and dependence with the document field, and can provide good writing assistance for document writers.

The application also provides a controllable text generation device for the official document, which comprises a text acquisition module to be written, a text information acquisition module, a keyword information acquisition module, a renewal model acquisition module and a renewal information acquisition module,

the text to be written acquisition module is used for acquiring the text to be written;

the renewal model acquisition module is used for acquiring a trained renewal model;

It should be noted that the foregoing explanation of the method embodiment is also applicable to the system of the present embodiment, and is not repeated here.

The application also provides an electronic device comprising a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor implements the controllable document generation method for the official document when executing the computer program.

The application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program can realize the controllable text generation method for the document when being executed by a processor.

Fig. 3 is an exemplary block diagram of an electronic device capable of implementing the controllable text generation method for a document provided according to one embodiment of the present application.

As shown in fig. 3, the electronic device includes an input device 501, an input interface 502, a central processor 503, a memory 504, an output interface 505, and an output device 506. The input interface 502, the central processing unit 503, the memory 504, and the output interface 505 are connected to each other through a bus 507, and the input device 501 and the output device 506 are connected to the bus 507 through the input interface 502 and the output interface 505, respectively, and further connected to other components of the electronic device. Specifically, the input device 504 receives input information from the outside, and transmits the input information to the central processor 503 through the input interface 502; the central processor 503 processes the input information based on computer executable instructions stored in the memory 504 to generate output information, temporarily or permanently stores the output information in the memory 504, and then transmits the output information to the output device 506 through the output interface 505; the output device 506 outputs the output information to the outside of the electronic device for use by the user.

That is, the electronic device shown in fig. 3 may also be implemented to include: a memory storing computer-executable instructions; and one or more processors that, when executing the computer-executable instructions, implement the controllable document generation method for documents described in connection with fig. 1.

In one embodiment, the electronic device shown in FIG. 3 may be implemented to include: a memory 504 configured to store executable program code; one or more processors 503 configured to execute the executable program code stored in the memory 504 to perform the controllable text generation method for documents in the above embodiments.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer-readable media include both permanent and non-permanent, removable and non-removable media, and the media may be implemented in any method or technology for storage of information. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Furthermore, it is evident that the word "comprising" does not exclude other elements or steps. A plurality of units, modules or means recited in the apparatus claims can also be implemented by means of software or hardware by means of one unit or total means.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The processor referred to in this embodiment may be a central processing unit (Central Processing Unit, CPU), or other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may be used to store computer programs and/or modules, and the processor may perform various functions of the apparatus/terminal device by executing or executing the computer programs and/or modules stored in the memory, and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.

In this embodiment, the modules/units of the apparatus/terminal device integration may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as a separate product. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, executable files or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the legislation and the practice of the patent in the jurisdiction. While the preferred embodiments have been described, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention, and it is intended that the scope of the invention shall be limited only by the claims appended hereto.

While the invention has been described in detail in the foregoing general description and with reference to specific embodiments thereof, it will be apparent to one skilled in the art that modifications and improvements can be made thereto. Accordingly, such modifications or improvements may be made without departing from the spirit of the invention and are intended to be within the scope of the invention as claimed.

Claims

1. The controllable text generation method for the document is characterized by comprising the following steps of:

acquiring a text to be written;

obtaining theme information according to the text to be written;

acquiring a trained continuous writing model;

2. The controllable document generation method for documents according to claim 1, wherein the obtaining the document type information of the document to be written according to the preset document type rule comprises:

acquiring title information of a text to be written;

3. The controllable document generation method of claim 2, wherein the obtaining the subject information according to the text to be written comprises:

acquiring a trained topic classification model;

extracting title characteristics of the title information;

4. The controllable document generation method of claim 3, wherein said obtaining keyword information from the text to be renewed comprises:

keyword information is extracted according to the relevant paragraphs.

5. The method of controllable text generation for documents of claim 4, wherein said trained renewal model comprises an Encoding Layer of 6 fransformer models, a Decoding Layer of 6 fransformer models, each of the Encoding Layer and the Decoding Layer comprising an attention Layer, a jagged network module, and a Layer normalization module.

6. The method for controllable text generation for documents as claimed in claim 5, wherein the model training loss function of the continuous writing model when training is:

L(C)＝L ₂ (C)+λ _J *L ₁ (C) The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the liquid crystal display device comprises a liquid crystal display device,

7. The controllable document generation method of claim 6, wherein said L2 (C) is obtained using the formula:

L ₂ (C)＝∑logP(y|c ¹ ，…c ³² ，x ³³ ，...，x ^m ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the liquid crystal display device comprises a liquid crystal display device,

8. A controllable document generation device for a document, characterized in that the controllable document generation device for a document comprises:

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and capable of running on the processor, the processor implementing the controllable document generation method for documents according to any one of claims 1 to 7 when the computer program is executed by the processor.

10. A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program which, when executed by a processor, is capable of implementing the controllable text generation method for documents according to any one of claims 1 to 7.