CN117332180A

CN117332180A - Method, equipment and storage medium for intelligent writing of research report based on large language model

Info

Publication number: CN117332180A
Application number: CN202311630628.5A
Authority: CN
Inventors: 吴福文; 康维鹏; 唐逐时; 杨胜利
Original assignee: Zheshang Futures Co ltd
Current assignee: Zheshang Futures Co ltd
Priority date: 2023-12-01
Filing date: 2023-12-01
Publication date: 2024-01-02
Anticipated expiration: 2043-12-01
Also published as: CN117332180B

Abstract

The application relates to a large language model-based intelligent research and report writing method, equipment and storage medium, wherein original research and report writing data are obtained by collecting research and report data and preprocessing the research and report data; performing information extraction processing on the original lapping and writing data to obtain a lapping and writing prompt, and forming a lapping and writing training corpus by the lapping and writing prompt and the corresponding original lapping and writing data; training a large language model according to the training corpus to obtain the training model; acquiring main body information set by a user, and generating a lapping and writing text through the lapping and writing model and the main body information; and acquiring quotation data related to the research and writing text, and inserting the quotation data into the research and writing text in a chart form to obtain a final research report. According to the method, through artificial intelligence and natural language processing technology, the research report with rich charts is automatically generated according to the existing research report data written manually.

Description

Method, equipment and storage medium for intelligent writing of research report based on large language model

Technical Field

The application relates to the field of automatic writing, in particular to a large language model-based intelligent writing method, device and storage medium for research and report.

Background

The most important field of the current writing is traditional manual writing, and a conclusion is obtained by using manually acquired data as index analysis and then the manual writing is performed. However, in more and more applications, there is a need to reduce cost, free hands and increase work efficiency by automated machine authoring.

At present, manual writing is generally performed by analyzing a large amount of data, then drawing a conclusion according to indexes, and then manually inputting writing. This approach often takes a long time, is prone to error in the process of making the index for analyzing a large amount of data, and often causes some data error due to carelessness in manual output, so that the whole process is reworked completely, and the workload is great. For analysis reports which need to be made every month or quarter, the whole process is repeated every time, only the data and the conclusions are different, and the efficiency is quite low. For example, for a securities company, each quarter needs to write a stock securities research report, the report is manually generated by repeatedly researching words after manually acquiring and analyzing the data, but after using automatic writing software, the embodiment can continuously optimize the model according to different companies in different industries and different time replacement parameters and finally directly generate the report, and the process saves a great deal of manpower and material resources. Compared with the traditional handwriting, the embodiment does not need to be manually input every time, and only the model of the embodiment is required to be continuously optimized according to the needs of the article.

At present, an automatic writing software is required to be provided for automatic writing, the automatic writing software is suitable for various scenes, a simple writing mode is provided for writing staff, models can be continuously optimized, the problem that manual writing consumes manpower and material resources is solved, and the cost is saved. Meanwhile, the automatic writing of the machine cannot be wrong, and various problems caused by carelessness are solved.

Disclosure of Invention

The embodiment of the application provides a large language model-based research and report intelligent writing method, equipment and storage medium, which are used for at least solving the problems that manual writing consumes manpower and material resources and writing cost is high in the related technology.

In a first aspect, an embodiment of the present application provides a method for intelligent authoring of a report based on a large language model, including:

collecting the research report data, and preprocessing the research report data to obtain original research and writing data;

performing information extraction processing on the original lapping and writing data to obtain a lapping and writing prompt, and forming a lapping and writing training corpus by the lapping and writing prompt and the corresponding original lapping and writing data;

training a large language model according to the training corpus to obtain the training model;

Acquiring main body information set by a user, and generating a lapping and writing text through the lapping and writing model and the main body information;

and acquiring quotation data related to the research and writing text, and inserting the quotation data into the research and writing text in a chart form to obtain a final research report.

In an embodiment, the information extraction processing is performed on the original lapping and writing data to obtain a lapping and writing prompt, including:

extracting the relation between the main body and the event/index through the UniLM-UIE model to obtain a relation pair of the main body and the event/index;

acquiring event description texts and associated data indexes related to the relation pairs from the current main stream information and a market database table;

and generating the research and write hint according to the relation pair, the event description text and the associated data index.

In an embodiment, the training corpus for research and writing includes a plurality of sentences, and training the large language model according to the training corpus for research and writing to obtain the research and writing model includes:

extracting any sentence to generate a plurality of sentence fragments;

replacing the positions of the sentence fragments in the original sentences by using markers to obtain sentences to be trained, and splicing all the sentence fragments behind the sentences to be trained to obtain spliced sentences;

Predicting the spliced sentences in an autoregressive training mode to obtain attention weight information of a bidirectional attention mechanism;

and training the large language model according to the research and writing training corpus and the attention weight information to obtain the research and writing model.

In an embodiment, the generating the authoring text by the authoring model and the subject information includes:

collecting information events and data indexes related to the main body information;

generating a prompt text according to the main body information, the information event and the data index;

and the lapping and writing model generates a lapping and writing text according to the prompt text.

In an embodiment, the acquiring the quotation data related to the research and writing text includes:

identifying the index category to which the index information in the written and ground text belongs through an index identification model;

performing alignment mapping on the index category and indexes stored in an index library, and obtaining an index name corresponding to the index category after alignment;

and extracting market data corresponding to the index names in a preset period from a data index library.

In an embodiment, the index library includes a market index library and a data index library, and the aligning and mapping the index category with the index stored in the index library includes:

Traversing and calculating the similarity between all indexes in the market index library and the index category, and carrying out type alignment on the index with the maximum similarity;

and after the types are aligned, carrying out standardized alignment treatment on the data indexes in the lapping and writing text and the indexes in the data index library to obtain the index names.

In an embodiment, the step of inserting the market data into the written and ground text in a form of a chart to obtain a final report:

converting the quotation data into pictures or tables according to preset rules;

performing data assembly and rendering on the lapping writing text and the picture or the table to obtain a lapping writing report;

and according to the style set by the user, rendering the content in the lapping and writing report correspondingly to obtain the final lapping report.

In an embodiment, the collecting the report data and preprocessing the report data to obtain the original written and read data includes:

extracting original lapping data of a shared open source from a lapping analysis platform, wherein the original lapping data comprises document data and webpage data;

analyzing the document data through an analysis tool to obtain first characteristic information;

Performing feature recognition on the webpage data through a webpage information extraction model to obtain second feature information;

and eliminating non-lapping writing information in the first characteristic information and the second characteristic information to obtain the original lapping writing data.

In a second aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the method for intelligent authoring a report based on a large language model according to the first aspect when executing the computer program.

In a third aspect, an embodiment of the present application provides a computer readable storage medium, where a computer program is stored, where the program is executed by a processor to implement a method for intelligent authoring of a report based on a large language model as described in the first aspect above.

The method, the device and the storage medium for intelligent writing of the research report based on the large language model have at least the following technical effects:

acquiring the research report data, and preprocessing the research report data to obtain original research and writing data; performing information extraction processing on the original lapping and writing data to obtain a lapping and writing prompt, and forming a lapping and writing training corpus by the lapping and writing prompt and the corresponding original lapping and writing data; training a large language model according to the training corpus to obtain the training model; acquiring main body information set by a user, and generating a lapping and writing text through the lapping and writing model and the main body information; and acquiring quotation data related to the research and writing text, and inserting the quotation data into the research and writing text in a chart form to obtain a final research report. In summary, according to the large language model-based report intelligent writing method provided by the embodiment of the application, the report is automatically generated according to the existing report data written manually by the technology of artificial intelligence and natural language processing, without manual intervention. The application range of the method is wide, including but not limited to futures research report, financial report and the like, so that the production efficiency is improved, and the time and labor cost is reduced. Moreover, the large language model-based intelligent authoring can be used for automatically authoring the futures industry and enriching charts.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a flowchart of a large language model-based intelligent authoring method for research and reporting in accordance with one embodiment of the present application;

FIG. 2 is a schematic diagram of a training sentence of an embodiment of the present application;

FIG. 3 is a schematic diagram of predicting the spliced sentences through an autoregressive training manner according to an embodiment of the present application;

FIG. 4 is a flowchart of traversing and calculating the similarity between all the indices in the market index base and the index categories according to an embodiment of the present application;

fig. 5 is a block diagram of a base electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described and illustrated below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden on the person of ordinary skill in the art based on the embodiments provided herein, are intended to be within the scope of the present application.

It is apparent that the drawings in the following description are only some examples or embodiments of the present application, and it is possible for those of ordinary skill in the art to apply the present application to other similar situations according to these drawings without inventive effort. Moreover, it should be appreciated that while such a development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as having the benefit of this disclosure.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by those of ordinary skill in the art that the embodiments described herein can be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar terms herein do not denote a limitation of quantity, but rather denote the singular or plural. The terms "comprising," "including," "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to only those steps or elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in this application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein refers to two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The terms "first," "second," "third," and the like, as used herein, are merely distinguishing between similar objects and not representing a particular ordering of objects.

The method is mainly used for automatically writing, exploring and constructing futures, and is mainly used for automatically learning based on the existing manual writing, exploring and reporting, so that the machine can also simulate the capability of manual writing. The preparation of artificial writing training corpus, automatic writing prompt generation, automatic writing model construction and training, diagrammatical enrichment of writing text and the like are thoroughly described, and are particularly shown in fig. 1.

And S1, collecting the research report data, and preprocessing the research report data to obtain the original research and writing data. The research and writing is mainly to collect futures research and report data manually written by the existing futures financers as original data, and clean and process the original research and report data so as to prepare high-quality research and writing training data. The original research and writing data mainly comprises rich text data in PDF and webpage as main forms, and the content generally covers research analysis of all markets of futures or certain plates (such as black, noble metals, crops, chemical industry, fuel oil and the like) of futures or certain single varieties, and generally carries out basic facts from aspects of industrial supply and demand basic surfaces, success fund surfaces, technical surfaces and the like, research analysis and research demonstration of subsequent trend and the like, conclusive analysis results and the like.

Therefore, the preparation of the writing training data mainly comprises the steps of acquisition of the original writing data, cleaning of the original data, deep processing of the data and the like.

And 11, capturing original lapping data sharing an open source from a lapping analysis platform, wherein the original lapping data comprises document data and webpage data.

Specifically, the original research data is mainly collected by capturing research data sharing open sources from professional research analysis portals, and mainly comprises research information, PDF research and other articles of research categories in multi-source heterogeneous forms. The embodiment divides the main format of data storage according to the content into two basic form data of PDF document data and HTML webpage data.

And step 12, analyzing the document data through an analysis tool to obtain first characteristic information.

PDF document type data is processed, and the data is a main data form for research. The embodiment mainly extracts data information such as title, text, time and the like. The PDF document is processed by mainly adopting PDF analysis tools such as pdfbox, itextpdf and the like to analyze the document, traversing and reading PDF document objects (including texts, pictures, tables, attachments and the like) according to document page numbers, extracting characteristic information such as text contents, character codes, font sizes, thickening, leading chapter marks or chapter sequence numbers, head line feed, word number discontent lines, no punctuation marks and the like, and identifying specific content parts such as catalogues, titles, hierarchical titles, subtitles, headers, footers, texts and the like in the PDF document.

And step 13, performing feature recognition on the webpage data through a webpage information extraction model to obtain second feature information.

Specifically, information data of the HTML-like web page is extracted, for the content extraction of the web page, the embodiment can utilize the web page structural features such as labels, label texts, label attribute values and the like of web page nodes, and adopts a depth model FreeDOM model method for identification, and the method adopts a view structured modeling idea, so that the information such as row and column blocks, single content, character patterns and the like of the HTML label can be learned, and noise data such as the HTML label, font patterns, picture links, structural layout patterns, JS codes, notes, hyperlinks and the like can be effectively identified, thereby better completing the text extraction of the HTML web page.

The depth model FreeDOM model mainly comprises the following three steps:

the first step is modeling learning using path information of the web page tag language DOM tree. The method mainly comprises the step of performing representation learning on the internal local information of each node label of the web page DOM tree. Specifically, the label, label text, label attribute value, and the like of the node are included. In the embodiment, the Word Embedding is used for encoding the features, so that the vectorization representation of the nodes of the DOM tree of the webpage is obtained. On the basis of vectorization of the nodes, a Softmax classification model can be adopted to determine whether the nodes contain text content fragments, so that potential text content distribution nodes are obtained;

The second step is to characterize the node relation in the DOM number of the webpage. Mainly, the structural dependency relationship among the HTML webpage nodes is learned, and the structural dependency relationship comprises content block distribution and auxiliary information. And judging whether the given node pair is a Value-Value relationship or not through the dependency relationship among the pair-wise modeling nodes. The information representation of the node pair consists of vectorization representation of the node itself, HTML label path relation of the node and position coding information of the node;

finally, after the two stages of processing, the information such as node representation, node relation and the like of the webpage DOM tree is obtained, the information is assembled into a node sequence according to the DOM tree order, vector splicing is carried out, and then the LSTM serialization semantic learning model is adopted to identify and extract text content, so that the text of the HTML webpage is finally extracted.

And 14, eliminating the non-lapping writing information in the first characteristic information and the second characteristic information to obtain the original lapping writing data.

Specifically, since HTML information class data may contain a lot of noise, the present embodiment needs further recognition processing to clean and filter non-lapping information. The small sample learning model ERNIE is mainly adopted for classifying and identifying the research and writing class and the non-research and writing class (mainly news, information, quotation report and the like). The ERNIE model converts the text classification task into a form of a complete gap-fill type, and in this embodiment, the lapping writing class flag is divided into 1, and the non-writing class flag is 0. In the conventional fine tuning method, the parameters to be learned are input by [ CLS ] vector, and in the classification task, the MASK [ MASK ] method of the present embodiment converts the label of "1-yes, 0-no" into the prediction space [ MASK ], thereby converting the classification problem into the gap-filling type sequence prediction problem. The classifier is not initialized at random any more, but is initialized by using the pre-training vectors of the two words, and the parameters learned by the pre-training model are fully utilized, so that the non-lapping and writing information is removed.

By the processing treatment of the method, a large amount of high-quality original investment, grinding and writing data are obtained.

And S2, carrying out information extraction processing on the original lapping and writing data to obtain a lapping and writing prompt, and forming a lapping and writing training corpus by the lapping and writing prompt and the corresponding original lapping and writing data. After the foregoing processing, the present embodiment obtains high-quality authoring data. However, as futures-oriented research and writing is a process of performing content generation formula according to the existing materials, that is, the embodiment obtains the final result of writing, but the prompt word of the original writing data is not obtained, which requires the embodiment to perform reverse extraction according to the existing writing result. That is, in this embodiment, it is necessary to extract and obtain the polishing and writing hint from the polishing result, where the hint and the polishing and writing text together form a piece of training data, that is, the hint, the polishing and writing content.

The process of throwing, researching and writing is to comprehensively and virtually name the environment of a certain variety, plate or whole market from the basic surface, the message surface and the technical area fund surface, and deduce the thought process for giving the demonstration logic and the final market opinion. Therefore, in order to construct the prompt for the research article, the present embodiment needs to extract and identify the parts respectively and supplement the corresponding information and quotation data. The whole process is as follows:

Step S21, extracting the relation between the main body and the event/index through the UniLM-UIE model to the original casting and writing data to obtain a relation pair of the main body and the event/index;

s22, acquiring event description texts and associated data indexes related to the relation pairs from the current main stream information and a market database table;

and S23, generating the research and write prompt according to the relation pair, the event description text and the associated data index.

In a preferred embodiment, first, extraction and identification of entities such as variety, plate, market, etc. and entities such as index, time, etc. are performed. In this context, the present embodiment adopts the UniLM-UIE model (Unified Language Model) to implement the multitasking of the unified information extraction framework UIE (Universal Information Extraction) framework, so as to complete the task of building the industrial chain knowledge graph in multiple future fields, such as entity identification, event extraction, event element extraction, etc. The model is based on a transducer model structure, but Only adopts a Decoder-Only model part of the transducer model, and realizes modeling of different language models such as bidirectional, causal, seq2Seq and the like by modifying different Attention masks (Attention masks). The UniLM model Only adopts a Decoder-Only part of a transducer model, so that on one hand, the complexity of the model is reduced, the model efficiency is improved, and meanwhile, the capability of a basic large model is reserved, thereby supporting the multi-task pre-training on large-scale data in a joint training mode.

In the UniLM-UIE model, the embodiment uses an end-to-end attention mask mechanism (Seq 2Seq Attention Mask) to realize input bidirectional modeling and output unidirectional modeling, and realizes condition generation. The training of the UniLM-UIE model is relatively simple, the input sequence comprises a conditional input and output sequence, and the Attention Mask (Attention Mask) is constructed according to the input and output sequences. The UniLM model also incorporates two special markers: [ CLS ] and [ SEP ]. The [ CLS ] tag is used to represent the start of a sentence and the [ SEP ] tag is used to separate different sentences. Firstly, a text with the input length of n=5 is obtained, and is processed through a plurality of convertors structures after basic word, word segmentation and position vectorization representation, wherein a mask matrix processing mechanism for controlling a pre-training task is included, so that only the characteristics related to the specific task can be focused during prediction, semantic vectorization output representation of response generated according to the specific non-task is realized, and finally readable textualization representation is realized through a Decoder model. Where the attention mask mechanism can be implemented using bi-directional, uni-directional, and Seq2Seq language models when performing the Transformer structuring process, the sequence-to-sequence language model of Seq2 is employed herein. Compared with the traditional CRFs mode, the UniLM-UIE model disclosed herein adopts a sequence-to-sequence mode to input and output, and can support multi-class prediction output with position overlapping.

The input form adopted in this embodiment is as follows:

instruction[SEP]text[SEP]schema-type[SEP]element_1[unused1]element_2[unused2]...[SEP]；

where instruction represents the type of task, a particular entity such as "variety", "index", "event" extracts the task.

And the specific form of the end-to-end UniLM-UIE output form is as follows:

value_1[unused1]value_2[unused2]...<T>value_3[unused1]value_4[unsued2]...<T>[SEP]；

the value_i is a generated value based on a character Span (mainly recording a start position/end position of the Span and a result probability of response at 0-1), the predicted extracted type is a Sentinel Token with a value of [ unsubsued ], and according to the corresponding relation between the element type of the previous input type Schema and the Sentinel Token (sentel Token), the embodiment can know that the type of the value_1 is the element_1. The special division symbol < T > enables the differentiation of different tuples, i.e. each tuple divided by < T > is a complete structured tuple.

For example, for text: the glass supply is sufficient, the stock is increased slightly, the price of the glass is likely to slide down further along with the approach of the glass demand in the season of high demand, and tasks such as variety, index, event and the like and text assembly are input into the UIE model for identification processing, so that identification of main bodies such as each plate/variety/and the like, index, time, event and the like in the text is completed. The final result after UIE identification processing is:

Variety: [ (glass, [0, 1 ]), (glass, [18, 19 ]) ]

The index is as follows: [ (supply, [2,3 ]), ([ 9, 10 ]), ([ 20, 21 ]), ([ 31, 32 ]), ([ price ])

Events: [ (near tail sound in strong demand for glass), [18,26]

In the training phase for the UniLM-UIE model, the embodiment adopts a Roberta-Larget initialization model and uses a T5 pre-training method. T5 (Exploring the Limits of Transfer Learning with a Unified Text-to-Text transducer) is a large-scale generation type pre-training model, and the main idea is that all tasks can be converted into Text2Text, namely the input is a Text, and the output corresponding tasks can be converted into Text format, namely the end-to-end idea. The pre-training task of T5 is divided into two parts, supervised and unsupervised. Specifically, in the unsupervised pre-training, in the embodiment, only MLM pre-training is used and a bidirectional Attention Mask mechanism (Attention Mask) is adopted in the Roberta-Large training stage (Roberta is a model training method for performing strong optimization on the BERT training strategy), and since the UIE input/output form constructed in the embodiment is a section of structured text, the training effect of the downstream task is improved as the UIE input/output form is extremely close to the T5 pre-training form. In the fine tuning task stage, the embodiment mainly adopts the Few-Shot few samples supervised training (entity identification, event extraction and relation extraction) based on extraction task data, and the method can achieve better effect under the condition of limited training data.

Through the processing of the UniLM-UIE model, the embodiment extracts and obtains a large amount of entity information from the original research text data, and specifically includes entity category data such as variety, plate, market, time, index and the like.

Second, this embodiment needs to further determine information such as the basic plane, the message plane, the technical plane, and the fund plane described by each body. For example, for a written article (obtained as described above), the present embodiment needs to determine whether the item (e.g., glass or iron ore, etc.) taught in the article specifically teaches the content of the basic surface or the technical surface, and the information about the index, the event, etc. specifically related to the item on the basic surface or the technical surface. That is, the present embodiment is to determine the relationship between the subject of variety/board/market, etc., and the specific data index (e.g., yield, stock, total import/export, number of warehouse, etc.) or specific event (e.g., yield improvement, tariff reduction) of the base/technical/fund, etc.

In this embodiment, the above-mentioned UniLM-UIE is adopted to perform < main body, event/index > relation extraction, and event and relation extraction, so as to obtain the relation between the variety/plate/market and index/event in the original research article. Thus, the embodiment completes the identification of the main entity and relationship of the original written article.

And finally, inquiring the index details and the associated index data according to the entity and the relation result, enriching the event description language, and finally forming the writing prompt language. For example, the present embodiment looks up the main information reported at the time based on the main body and event relation pair of < crude oil, decreasing the export/import share >, and extracts the detailed description text of the event, thereby forming the prompt text of the information event layer. In addition, according to the specific data index related to the research article, the embodiment searches the corresponding and associated index data in the market database table. Therefore, the embodiment completes the enrichment and perfection of the contents such as the event description text, the associated data index and the like of the original research and writing article, and forms the event prompt for the original research article together. The present embodiment generates a prompt according to the following mode:

according to the following materials, please write a research article about (xx variety/plate/market), mainly write from (xxx basic plane/message plane/technical plane/fund plane) and the like, and give corresponding operation suggestions, and the whole is not less than xx word number; (# Material attachment)

Basic surface: (data index/list, e.g. cotton 2023 plant area is enlarged/reduced, enlarged/reduced scale xx; cotton main yield rainfall/days of sun are all enlarged/reduced … …);

Message face: (event data list such as increasing/decreasing import/export tariffs for a specific region, etc.);

the technical surface: (technical index data list, such as cotton continuous 5-year continuous, final price creation 2 years new high … …);

fund plane: (a list of funds index data, such as a cotton futures fund warehouse extension x% … …).

Thus, the embodiment completes the automatic assembly generation of the polishing and writing prompt, and forms the final polishing and writing training corpus together with the original polishing article to form < the prompt and the original polishing article >. Through the depth structure of the prompt, the machine can simulate the process of the thinking of writing by a learner due to the capability, thereby writing by several times with a certain logic depth.

And step S3, training a pre-constructed training model according to the training corpus, wherein the training corpus comprises a plurality of sentences.

Through the processing of step S1 and step S2, the present embodiment completes all preparation of the training data for the research and writing, and the construction and training process of the research and writing model is described in detail below. The model is to carry out generation type conversion treatment on the prompt language, and finally the writing result is obtained. In the embodiment, the conversion processing from the prompt to the writing result is automatically performed by adopting the ChatGLM based on the GLM big language model framework:

Step S31, extracting any sentence to generate a plurality of sentence fragments;

step S32, replacing the positions of the sentence fragments in the original sentence by using markers to obtain a sentence to be trained, and splicing all the sentence fragments behind the sentence to be trained to obtain a spliced sentence;

step S33, predicting the spliced sentences in an autoregressive training mode to obtain attention weight information of a bidirectional attention mechanism;

and step S34, training a large language model according to the research and writing training corpus and the attention weight information to obtain the research and writing model.

Specifically, chatGLM is a generated dialogue language model based on a generic language model framework General Language Model (GLM) architecture, and provides a model after ChatGLM-6B has been trained on large-scale chinese-english bilingual pre-training, with 62 billion parameters, and modified two-dimensional RoPE position rotation coding implementation, using a traditional Numpy-based feedforward neural network FFN structure. Although the ChatGLM-6 has only 62 hundred million parameters, the ChatGLM-6 is supplemented with the techniques of supervision fine tuning, feedback self-help, human feedback reinforcement learning and the like, and the ChatGLM-6 is excellent in Chinese task performance, greatly reduces reasoning cost and improves efficiency.

First, chatGLM randomly extracts fragments in sentences to generate sentence fragments, such as sentences "x1, x2, x6" randomly generates two sentence fragments for a certain sentence in training expectation: "x3" and "x5, x6". In general, the length ratio of the sentence fragment extraction is about 15% of the original sentence.

Then, the original sentence and the extracted sentence fragment are separated, and the positions of the extracted fragments in the original sentence are replaced with Mask markers, as in the present embodiment of fig. 2, mask strings PartA "x1, x2, [ M ], x4, [ M ], and sentence fragments PartB" x3"," x5, x6 "of the original sentence are obtained.

Finally, in this embodiment, the extracted fragments are randomly spliced behind the original sentence, referring to fig. 3, the sentences are connected with the fragments by using a starter (e.g. M, S, E), and the extracted fragments are predicted by using an autoregressive training mode, so as to obtain a Decoder and an Encoder Attention weight matrix of a bidirectional Attention mechanism of a transform structure, and mainly learn Attention weight information of output semantic vectors on different input parts, thereby realizing selection preference of writing output text on the original input text, i.e. preference and selection of a prompt writing material.

When the ChatGLM-6B pre-training language model is used for carrying out the research and writing, the < term of the text generated in the embodiment is formed by the original research and writing text > to form the content form of { "content": "prompt", "summary": "research article" }, the ChatGLM call in the PaddleNLP and the fine tuning process of the generating task of the further model are adopted, and the model termination training and the like are carried out by setting the iteration times and the loss value (the difference degree between the model prediction output and the original standard value is represented) to terminate the condition.

In this embodiment, the entire training process adopts a Parameter-based Efficient Fine Tuning method (PEFT) to perform tag-to-task Fine Tuning on the ChatGLM. PEFT can adapt a pre-training language model (PLM, chatGLM herein) to various downstream application tasks efficiently without fine tuning all parameters of the pre-training model. Namely, the PEFT method only fine-tunes a small amount of or additional model parameters, fixes most of pre-training parameters, greatly reduces the calculation and storage cost, and simultaneously, the most advanced PEFT technology can realize the performance equivalent to the full-scale fine-tuning. In this embodiment, chatglm-6b is used, and it should be noted that when loading the chatglm model, the chatglm model is initialized by skip_init in pytorch, the parameter is first loaded onto the meta device, and the code to be modified is set to False when loading the model.

And S4, acquiring a prompt set by a user, and generating a written and ground composition through the trained written and ground composition model and the prompt. The embodiment realizes the generation of the written and ground text according to the following steps:

step S41, collecting information events and data indexes related to the main body information;

step S42, generating a prompt text according to the main body information, the information event and the data index;

and step S43, the lapping and writing model generates a lapping and writing text according to the prompt text.

Specifically, after model training after ChatGLM fine tuning, the embodiment obtains a research and writing generation model, which can perform generation type conversion processing on the prompt, and finally obtains a research and writing text. When the on-line actual writing task is performed, the embodiment only needs to collect the automated information event and the important data index according to the main information of the given variety/plate/market, and the information event and the important data index are formed into a prompt text which is transmitted to the writing model, so that the result of the writing, the grinding and the writing is finally obtained.

And S5, acquiring market data related to the research and writing text, and inserting the market data into the research and writing text in a chart form to obtain a final research report. The previous 4 steps thoroughly explain how to prepare the research and writing data, the generation of the prompt and the training of the model, and through the steps, the embodiment can automatically collect event information and list important indexes of a certain variety/plate or market to form the prompt, and the model can generate the pure text writing output. However, in the field of futures research, the pure text writing result cannot be given enough information to the user efficiently, and in this embodiment, the content of the speaking information is also required to be changed according to the data index market trend of the text insertion response, so as to increase the richness of the research writing. In the following, this example illustrates in detail how a final report of a rich chart is automatically generated from already written text.

And step S51, identifying the index category to which the index information in the lapping and writing text belongs through an index identification model. In the embodiment, the UIE model is adopted to extract and identify the indexes of the written and ground text, the specific index identification model adopts the UniLM-UIE model in the step S2, and the UniLM-UIE model can effectively identify all entity information such as varieties, plates and indexes in the text, and the embodiment focuses on the types of the indexes in the text.

Specifically, the index category is aligned and mapped with indexes stored in a market index library, and an index name corresponding to the index category is obtained after alignment. The index library comprises a quotation index library and a data index library, and then the similarity between all indexes in the quotation index library and the index category is traversed and calculated, and the indexes with the maximum similarity are subjected to type alignment; and after the types are aligned, carrying out standardized alignment treatment on the data indexes in the lapping and writing text and the indexes in the data index library to obtain the index names.

And step S52, performing alignment mapping on the index category and the index stored in the index library, and obtaining an index name corresponding to the index category after alignment.

For example, since the index in automated authoring may be textually different from the index names in the posting database table, the present embodiment needs to map the index identified by UIE to the index in the market index database in alignment. The index text semantic similarity alignment mapping process is performed based on the deep contracture network model SBERT (Siamese network). The sub-networks of the SBERT model all use a pre-training language model BERT, the two BERT models share parameters, the whole network structure is shown in fig. 4, the two BERT models respectively carry out semantic understanding on an index U (obtained by identifying a UIE model) and an index V (in a data index base), then the semantic vectors U and the semantic vectors V are respectively obtained through pooling and semantic vectorization processing, finally the similarity of the semantic vectors U and the semantic vectors V is calculated through a Consine similarity algorithm, and the index with the highest similarity is obtained for alignment. After the index alignment mapping process, the embodiment performs standardized mapping alignment on the indexes in the data index database of the data index in the ChatGLM plain text writing result.

Step S53, extracting the market data corresponding to the index name in a preset period from a data index database. According to the aligned indexes, specific data corresponding to the indexes for a certain time period are searched in a data index library, and data assembly and rendering of various patterns are carried out by using open source tools such as Echarts and the like, so that a research and writing report with rich chart styles and corresponding matching with texts is finally obtained.

Step S54, converting the quotation data into pictures or tables according to preset rules;

step S55, carrying out data assembly and rendering on the lapping and writing text and the picture or the table to obtain a lapping and writing report;

and step S56, according to the style set by the user, rendering the content in the lapping and writing report correspondingly to obtain the final lapping report.

For example, the embodiment can perform formatting processing such as font size, style, line feed and the like of the text, mainly adopts the UIE model to perform subtitle recognition (namely sequence labeling), and adopts corresponding style rendering according to the recognition results of each level of the titles, so that finally the written research report is made into research articles with attractive style, level cleaning and rich charts on view results.

In summary, according to the large language model-based report intelligent writing method provided by the embodiment of the application, the report is automatically generated according to the existing report data written manually by the technology of artificial intelligence and natural language processing, without manual intervention. The application range of the method is wide, including but not limited to futures research report, financial report and the like, so that the production efficiency is improved, and the time and labor cost is reduced. Moreover, the intelligent research report writing based on the large language model can also be oriented to the futures industry and can carry out automatic writing with rich charts, so that the research and the attempt on the automatic research report writing in the futures field are realized.

In a second aspect, embodiments of the present application provide an electronic device, and fig. 5 is a block diagram of the electronic device, which is shown according to an exemplary embodiment. As shown in fig. 5, the electronic device may comprise a processor 11 and a memory 12 storing computer program instructions.

In particular, the processor 11 may include a Central Processing Unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC), or may be configured to implement one or more integrated circuits of embodiments of the present application.

Memory 12 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 12 may comprise a Hard Disk Drive (HDD), floppy Disk Drive, solid state Drive (Solid State Drive, SSD), flash memory, optical Disk, magneto-optical Disk, tape, or universal serial bus (Universal Serial Bus, USB) Drive, or a combination of two or more of the foregoing. The memory 12 may include removable or non-removable (or fixed) media, where appropriate. The memory 12 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 12 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, memory 12 includes Read-Only Memory (ROM) and random access Memory (Random Access Memory, RAM). Where appropriate, the ROM may be a mask-programmed ROM, a programmable ROM (Programmable Read-Only Memory, abbreviated PROM), an erasable PROM (Erasable Programmable Read-Only Memory, abbreviated EPROM), an electrically erasable PROM (Electrically Erasable Programmable Read-Only Memory, abbreviated EEPROM), an electrically rewritable ROM (Electrically Alterable Read-Only Memory, abbreviated EAROM), or a FLASH Memory (FLASH), or a combination of two or more of these. The RAM may be Static Random-Access Memory (SRAM) or dynamic Random-Access Memory (Dynamic Random Access Memory DRAM), where the DRAM may be a fast page mode dynamic Random-Access Memory (Fast Page Mode Dynamic Random Access Memory FPMDRAM), extended data output dynamic Random-Access Memory (Extended Date Out Dynamic Random Access Memory EDODRAM), synchronous dynamic Random-Access Memory (Synchronous Dynamic Random-Access Memory SDRAM), or the like, as appropriate.

Memory 12 may be used to store or cache various data files that need to be processed and/or communicated, as well as possible computer program instructions for execution by processor 11.

The processor 11 reads and executes the computer program instructions stored in the memory 12 to implement any of the large language model-based smart authoring methods of the above embodiments.

In an embodiment, the electronic device may further comprise a communication interface 13 and a bus 10. As shown in fig. 5, the processor 11, the memory 12, and the communication interface 13 are connected to each other through the bus 10 and perform communication with each other.

The communication interface 13 is used to implement communications between various modules, devices, units and/or units in the embodiments of the present application. The communication port 13 may also enable communication with other components such as: and the external equipment, the image/data acquisition equipment, the database, the external storage, the image/data processing workstation and the like are used for data communication.

Bus 10 includes hardware, software, or both, that couple components of an electronic device to each other. Bus 10 includes, but is not limited to, at least one of: data Bus (Data Bus), address Bus (Address Bus), control Bus (Control Bus), expansion Bus (Expansion Bus), local Bus (Local Bus). By way of example, and not limitation, bus 10 may include a graphics acceleration interface (Accelerated Graphics Port), abbreviated AGP, or other graphics Bus, an enhanced industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) Bus, a Front Side Bus (FSB), a HyperTransport (HT) interconnect, an industry standard architecture (Industry Standard Architecture, ISA) Bus, a wireless bandwidth (InfiniBand) interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a micro channel architecture (Micro Channel Architecture, abbreviated MCa) Bus, a peripheral component interconnect (Peripheral Component Interconnect, abbreviated PCI) Bus, a PCI-Express (PCI-X) Bus, a serial advanced technology attachment (Serial Advanced Technology Attachment, abbreviated SATA) Bus, a video electronics standards association local (Video Electronics Standards Association Local Bus, abbreviated VLB) Bus, or other suitable Bus, or a combination of two or more of the foregoing. Bus 10 may include one or more buses, where appropriate. Although embodiments of the present application describe and illustrate a particular bus, the present application contemplates any suitable bus or interconnect.

In a third aspect, an embodiment of the present application provides a computer readable storage medium, where a program is stored, where the program, when executed by a processor, implements the method for intelligent authoring based on a large language model provided in the first aspect.

More specifically, among others, readable storage media may be employed including, but not limited to: portable disk, hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.

In a possible implementation manner, the present invention may also be implemented in the form of a program product, which includes a program code for causing a terminal device to execute steps of implementing the method for intelligent authoring based on a large language model provided in the first aspect, when the program product is run on the terminal device.

Wherein the program code for carrying out the invention may be written in any combination of one or more programming languages, which program code may execute entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on the remote device or entirely on the remote device.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A large language model-based intelligent writing method for research and report is characterized by comprising the following steps:

Training the large language model according to the training corpus to obtain a training model;

2. The method for intelligent authoring according to claim 1, wherein the step of extracting information from the original authoring data to obtain the authoring prompt includes:

3. The method for intelligent authoring of claim 1, wherein the authoring corpus comprises a plurality of sentences, the training the large language model according to the authoring corpus to obtain the authoring model comprises:

Extracting any sentence to generate a plurality of sentence fragments;

4. The method for intelligent authoring in advance according to claim 1, wherein the generating the authoring text by the authoring model and the main body information includes:

5. The method for intelligent authoring in advance according to claim 1, wherein the step of obtaining the market data related to the authoring text comprises:

6. The method of claim 5, wherein the index library comprises a market index library and a data index library, and wherein the aligning the index category with the index stored in the index library comprises:

7. The method for intelligent authoring according to claim 1, wherein the step of inserting the market data into the authoring text in the form of a chart to obtain a final newspaper:

8. The method for intelligent authoring of claim 1 wherein said collecting of said lapping data and said preprocessing of said lapping data to obtain raw lapping authoring data comprises:

9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the large language model based intelligent authoring method of any one of claims 1-8 when the computer program is executed.

10. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements the large language model-based smart authoring method of any one of claims 1 to 8.