CN116796796A

CN116796796A - GPT architecture-based automatic document generation method and device

Info

Publication number: CN116796796A
Application number: CN202310009751.9A
Authority: CN
Inventors: 马延美; 刘学谦; 王来奇
Original assignee: Beijing Fangcun Wuyou Technology Development Co ltd
Current assignee: Beijing Fangcun Wuyou Technology Development Co ltd
Priority date: 2023-01-04
Filing date: 2023-01-04
Publication date: 2023-09-22

Abstract

The application discloses a method and a device for automatically generating a document based on a GPT architecture. The automatic document generation method based on the GPT architecture comprises the following steps: acquiring a trained GPT-OD language model; acquiring text information input by a user; and inputting the text information into the trained GPT-OD language model so as to obtain the document information output by the GPT-OD language model. The automatic document generation method based on the GPT architecture provided by the application is aimed at the data preprocessing of document corpus: a simple and efficient data preprocessing strategy is used for filtering large-scale document corpus, so that the document generation capacity of the model is improved.

Description

GPT architecture-based automatic document generation method and device

Technical Field

The application relates to the technical field of text generation, in particular to a GPT architecture-based automatic document generation method and a GPT architecture-based automatic document generation device.

Background

The existing automatic document generation technology mainly comprises three main stream methods: an automatic document generation method based on grammar syntax rules, an automatic document generation method based on search, and an automatic document generation method based on a shallow sub-depth network such as RNN/LSTM. The three methods will be described in detail below.

We analyze the following for their respective disadvantages for three types of technology available:

(1) The method based on the rules has the defects that a large number of manually set rule templates are needed, a large number of linguistic experts are needed for marking, and the method can lead to single diversity of automatically generated documents, so that the automatic generation effect of the documents is weakened.

(2) The method based on the retrieval model utilizes text retrieval and sequencing technology to select proper documents from the document corpus. The method recommends the existing official document to the user, so that the sentence smoothness is higher; the method has the defects that new text corpus cannot be generated, and when searching or sorting is performed, only semantic relevance on the surface is likely to stay, so that real meaning is difficult to capture.

(3) The deep learning algorithm-based generation method mainly uses an encoder-decoder structure to generate replies, and a typical technology is a Seq2Seq network structure. The method has the advantage that no rules are needed, and how to generate the text can be automatically learned from the existing dialogue text. The method has the advantage that the deep neural network can learn the semantic mapping from the input data to the output text end to end without manual participation in feature engineering. Deep neural models tend to have a large number of parameters, and most text generation task datasets are very small, so deep neural networks are very easy to overfit on these datasets, making it impossible to generalize in practical applications.

It is therefore desirable to have a solution that solves or at least alleviates the above-mentioned drawbacks of the prior art.

Disclosure of Invention

The application aims to provide a GPT architecture-based document automatic generation method for solving at least one technical problem.

The application provides the following scheme:

according to one aspect of the present application, there is provided a GPT architecture-based automatic document generation method, including:

acquiring a trained GPT-OD language model;

acquiring text information input by a user;

and inputting the text information into the trained GPT-OD language model so as to obtain the document information output by the GPT-OD language model.

Optionally, before the obtaining the trained GPT-OD language model, the method for automatically generating a document based on the GPT architecture further includes:

acquiring a training set;

and training the GPT-2 language model through a training set so as to obtain a GPT-OD language model.

Optionally, the training set includes a plurality of text sets;

the training of the GPT-2 language model by the training set includes:

preprocessing each text set in the training set;

and training the GPT-2 language model according to the preprocessed text set.

Optionally, the preprocessing each text set in the training set includes:

each text set is processed as follows:

acquiring the number of line-feed symbols of a text set;

judging whether the character set exceeds the preset line number according to the number of line-changing symbols of the character set, if not, then

And deleting the text set which does not exceed the preset line number.

Optionally, the preprocessing each text set in the training set further includes:

judging whether the text set exceeds the preset line number according to the number of line-changing symbols of the text set, if so, then

Judging whether the word number of the word set is smaller than a first preset word number, if so, then

Deleting the character set smaller than the first preset word number.

judging whether the word number of the word set is smaller than a first preset word number, if not, then

Judging whether the number of characters in each row in the continuous 5 rows is less than 15 characters according to the line-changing character and the number of words of the character set, if so, then

And deleting the character set with the character number of each row being less than 15 characters in the continuous 5 rows.

judging whether the number of characters in each row in the continuous 5 rows is less than 15 characters according to the line-changing character and the number of words of the character set, if not, then

Identifying the character set, judging whether the character set has at least two serial numbers, if so, then

Judging whether each serial number is continuous, if not, then

And deleting the text sets with discontinuous sequence numbers.

judging whether each serial number is continuous, if so, then

Acquiring a preset text database, wherein the preset text database comprises at least one preset text;

judging each row of the character set respectively, judging whether one row does not comprise any preset character in the preset character database, if yes, then

And deleting the row which does not comprise any preset characters in the preset character database.

Optionally, the training the GPT-2 language model according to the preprocessed text set includes:

in the training process, a weight value is provided for each text set respectively;

during training, resampling is performed on each text set according to different weight values.

The application also provides a device for automatically generating the document based on the GPT framework, which comprises:

the GPT-OD language model acquisition module is used for acquiring a trained GPT-OD language model;

the text information acquisition module is used for acquiring text information input by a user;

and the document information acquisition module is used for inputting the text information into the trained GPT-OD language model so as to acquire the document information output by the GPT-OD language model.

The automatic document generation method based on the GPT architecture has the following advantages:

data preprocessing for document corpus: the simple and efficient data preprocessing strategy is used for filtering large-scale document corpus, so that the document generation capacity of the model is improved;

resampling strategy technique in GPT-OD: the high-quality document has high sampling probability and more training times, and vice versa, thereby realizing the optimal training result.

Drawings

Fig. 1 is a flow chart of a method for automatically generating a document based on a GPT architecture according to an embodiment of the application.

Fig. 2 is a block diagram of an electronic device according to an embodiment of the present application, which is a method for automatically generating a document based on a GPT architecture.

FIG. 3 is a diagram of a GPT series model architecture in one embodiment of the application.

Fig. 4 is a schematic diagram of a generation effect of a method for automatically generating a document based on a GPT architecture according to an embodiment of the present application.

Fig. 5 is a schematic diagram of another generation effect of the automatic document generation method based on the GPT architecture according to an embodiment of the present application.

Fig. 6 is a schematic diagram of another generation effect of the automatic document generation method based on the GPT architecture according to an embodiment of the present application.

Fig. 7 is a schematic diagram of a loss function curve of the automatic document generation method based on the GPT architecture of the present application.

Detailed Description

The following description of the embodiments of the present application will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the application are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The automatic document generation method based on the GPT architecture shown in FIG. 1 comprises the following steps:

step 1: acquiring a trained GPT-OD language model;

step 2: acquiring text information input by a user;

step 3: and inputting the text information into the trained GPT-OD language model so as to obtain the document information output by the GPT-OD language model.

In this embodiment, before the obtaining the trained GPT-OD language model, the method for automatically generating a document based on the GPT architecture further includes:

acquiring a training set;

In this embodiment, the training set includes a plurality of text sets;

the training of the GPT-2 language model by the training set includes:

preprocessing each text set in the training set;

and training the GPT-2 language model according to the preprocessed text set.

In this embodiment, the preprocessing each text set in the training set includes:

each text set is processed as follows:

acquiring the number of line-feed symbols of a text set;

And deleting the text set which does not exceed the preset line number.

In this embodiment, the preprocessing each text set in the training set further includes:

Deleting the character set smaller than the first preset word number.

Judging whether each serial number is continuous, if not, then

And deleting the text sets with discontinuous sequence numbers.

judging whether each serial number is continuous, if so, then

In this embodiment, the training the GPT-OD language model according to the preprocessed text set includes:

The application will now be described in further detail by way of examples, which should not be construed as limiting the application in any way.

Step 1: acquiring a trained GPT-OD language model;

step 2: acquiring text information input by a user;

Referring to fig. 4 to 6, in the embodiment shown in fig. 4, if we input a sentence, the right-hand description of fig. 4 can be obtained through the GPT-OD language model of the present application.

In the embodiment shown in FIG. 5, if we enter some keywords and are separated by punctuation marks, the right hand description of FIG. 5 can be obtained by the GPT-OD language model of the present application.

In the embodiment shown in fig. 6, if we input a section of, for example, the modern engineering and technology science research of the multidisciplinary fusion is greatly enhanced, the basic science and engineering technology is driven to develop, and a complete modern science and technology system is formed, the description on the right side of fig. 6 can be obtained through the GPT-OD language model of the present application.

In this embodiment, the data set for training the GPT-OD language model of the present application mainly includes two parts, an open-source chinese data corpus such as cleuecorpuscull, and about 23 ten thousand public document files are collected by legal compliance. The following filtering strategies are mainly adopted in the process of cleaning the training set:

the following filtering strategies are mainly adopted in the process of cleaning the training set:

files with fewer than 5 lines, fewer than 500 words, and fewer than 15 characters in consecutive 5 lines;

the sequence numbers in the document file are discontinuous;

and does not include. : (ii) a step of (c) carrying out (a) treatment; lines of any one character from one to fifty.

Specifically, the preprocessing logic for each text set in the training set is as follows:

first, 23 ten thousand document files are legally and successfully collected from document data disclosed by us.

1. Judging that the number of lines of each file is less than 5 lines of filtering according to the line feed symbols in the files;

2. the number of words in each file is determined to be less than 500,

3. judging files with continuous 5 lines of less than 15 characters according to the line feed character and the word number of each file, and filtering;

4. extracting sequence numbers [ one, two-fifty ] in the file, judging whether the sequence numbers are continuous or not, and discontinuously filtering;

5. each line in each file is determined, if a line does not contain [ in the occurrence. : (ii) a step of (c) carrying out (a) treatment; 1. any one of two-fifty characters, then this line is filtered out.

Referring to fig. 3, in this embodiment, the GPT-2 language model is trained on OpenAI using a 40GB oversized dataset WebText crawled from the network, the model architecture is constructed using only a transducer decoder module, and can process a sequence of 1024 words at maximum, each word will flow through all decoder modules along with its previous path, the complete GPT-2 model has 15 hundred million parameters, and the network structure is shown in fig. 3 below.

The GPT-2 based model architecture is a standard transducer model, the number of network layers is 12, the number of layers of the network is 16, the number of layers of the layer is 1024, and the dimension of the FFN layer is 4096; cross-entropy loss with label smoothing rate is equal to 0.1 used in training and is set to contain a maximum of 4096 tokens per batch; in the experiment, adam optimizerwithBeta (0.9,0.98), 2000 wakeup updates and inverse square root learning rate scheduler were used, the maximum learning rate was set to 5e-5, the training period epoch was equal to 5, the batch size was 32, and model preservation was performed every 2000 training steps. Our model was trained using an 8-block NVIDIA Tesla V100 GPU.

Because different files have different importance to training, in the training process, a weight value is provided for each text set respectively; during training, resampling is performed on each text set according to different weight values. In particular, we devised a data resampling strategy that allows high quality files to appear more frequently while maintaining a balance such that each file appears at least once throughout the training process. Because the previous method samples from the training corpus randomly in the training process, the resampling technology samples according to the weight of each file, namely, the document has high quality, the sampling probability is high, the training times are high, and vice versa. The resampling calculation formula is as follows:

w＝w ₁ *w ₂ *w ₃ (1)

w ₁ representing weights, w, obtained from word sets according to word numbers ₂ Representing weights, w, obtained from the number of rows in a literal set ₃ The weight obtained according to the average word number of the line in the word set is represented, the weight w of the whole file is obtained by multiplying the three weights, and the larger the weight is, the more important the file is, and vice versa. Wherein N1 is the total number of words in the word set, N2 is the number of rows in the word set, and N3 is the average number of words in the word set.

When training is carried out, iterative training is carried out to optimize the model, and the iterative process is as follows: firstly, large-scale pre-training is carried out by using Chinese data corpus such as CLUECorpusSmall and the like and based on a GPT-2 architecture. Then we continuously pretrain the collected large amount of post-pretreatment document corpus to obtain the current model GPT-OD. A graph of the loss function during model training is shown in fig. 7.

The technical scheme of the application is based on large-scale document data disclosed by legal compliance for training, and is a first domestic document automatic generation model which can be applied to actual online business;

the technical scheme of the application can adjust the diversity, number and theme of the generated document to a certain extent through the temperature sampling technology;

the technical scheme of the application has high robustness. The system can accept any theme, any character and the number of words within 1024, which can perfectly support the on-line business scene;

the technical scheme of the application can serve on-line service by using only 100M parameter models, and the models are small and exquisite without losing performance.

The application also provides a document automatic generation device based on the GPT framework, which comprises a GPT-OD language model acquisition module, a text information acquisition module and a document information acquisition module, wherein the GPT-OD language model acquisition module is used for acquiring a trained GPT-OD language model; the text information acquisition module is used for acquiring text information input by a user; and the document information acquisition module is used for inputting the text information into the trained GPT-OD language model so as to acquire document information output by the GPT-OD language model.

Fig. 2 is a block diagram of an electronic device according to one or more embodiments of the present application.

As shown in fig. 2, the present application also discloses an electronic device, including: the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; the memory stores a computer program which, when executed by the processor, causes the processor to perform the steps of the GPT architecture-based document automatic generation method.

The present application also provides a computer-readable storage medium storing a computer program executable by an electronic device, which when run on the electronic device causes the electronic device to perform the steps of a GPT architecture-based document automatic generation method.

The communication bus mentioned above for the electronic device may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry StandardArchitecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The electronic device includes a hardware layer, an operating system layer running on top of the hardware layer, and an application layer running on top of the operating system. The hardware layer comprises a central processing unit (CPU, central Processing Unit), a memory management unit (MMU, memoryManagementUnit), a memory and other hardware. The operating system may be any one or more computer operating systems that implement electronic device control via processes (processes), such as a Linux operating system, a Unix operating system, an Android operating system, an iOS operating system, or a windows operating system, etc. In addition, in the embodiment of the present application, the electronic device may be a handheld device such as a smart phone, a tablet computer, or an electronic device such as a desktop computer, a portable computer, which is not particularly limited in the embodiment of the present application.

The execution body controlled by the electronic device in the embodiment of the application can be the electronic device or a functional module in the electronic device, which can call a program and execute the program. The electronic device may obtain firmware corresponding to the storage medium, where the firmware corresponding to the storage medium is provided by the vendor, and the firmware corresponding to different storage media may be the same or different, which is not limited herein. After the electronic device obtains the firmware corresponding to the storage medium, the firmware corresponding to the storage medium can be written into the storage medium, specifically, the firmware corresponding to the storage medium is burned into the storage medium. The process of burning the firmware into the storage medium may be implemented by using the prior art, and will not be described in detail in the embodiment of the present application.

The electronic device may further obtain a reset command corresponding to the storage medium, where the reset command corresponding to the storage medium is provided by the provider, and the reset commands corresponding to different storage media may be the same or different, which is not limited herein.

At this time, the storage medium of the electronic device is a storage medium in which the corresponding firmware is written, and the electronic device may respond to a reset command corresponding to the storage medium in which the corresponding firmware is written, so that the electronic device resets the storage medium in which the corresponding firmware is written according to the reset command corresponding to the storage medium. The process of resetting the storage medium according to the reset command may be implemented in the prior art, and will not be described in detail in the embodiments of the present application.

For convenience of description, the above devices are described as being functionally divided into various units and modules. Of course, the functions of the units, modules may be implemented in one or more pieces of software and/or hardware when implementing the application.

It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it is to be understood and appreciated by one of ordinary skill in the art that the methodologies are not limited by the order of acts, as some acts may, in accordance with the methodologies, take place in other order or concurrently. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the application.

From the above description of embodiments, it will be apparent to those skilled in the art that the present application may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform the method according to the embodiments or some parts of the embodiments of the present application.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

Claims

1. The automatic document generation method based on the GPT architecture is characterized by comprising the following steps of:

acquiring a trained GPT-OD language model;

acquiring text information input by a user;

2. The GPT-architecture-based automatic document generation method of claim 1, wherein prior to the obtaining the trained GPT-OD language model, the GPT-architecture-based automatic document generation method further comprises:

acquiring a training set;

3. The GPT-architecture-based document automatic generation method of claim 2, wherein the training set comprises a plurality of word sets;

the training of the GPT-2 language model by the training set includes:

preprocessing each text set in the training set;

and training the GPT-2 language model according to the preprocessed text set.

4. The GPT-architecture-based document automatic generation method of claim 3, wherein the preprocessing each text set in the training set comprises:

each text set is processed as follows:

acquiring the number of line-feed symbols of a text set;

And deleting the text set which does not exceed the preset line number.

5. The GPT-architecture-based document automatic generation method of claim 4, wherein the preprocessing each text set in the training set further comprises:

Deleting the character set smaller than the first preset word number.

6. The GPT-architecture-based document automatic generation method of claim 5, wherein the preprocessing each set of words in the training set further comprises:

7. The GPT-architecture-based document automatic generation method of claim 6, wherein the preprocessing each set of words in the training set further comprises:

Judging whether each serial number is continuous, if not, then

And deleting the text sets with discontinuous sequence numbers.

8. The GPT-architecture-based document automatic generation method of claim 7, wherein the preprocessing each set of words in the training set further comprises:

judging whether each serial number is continuous, if so, then

9. The GPT-architecture-based document automatic generation method of claim 8, wherein training the GPT-OD language model according to the preprocessed text set comprises:

10. The automatic document generation device based on the GPT architecture is characterized by comprising: