CN110362797B

CN110362797B - Research report generation method and related equipment

Info

Publication number: CN110362797B
Application number: CN201910513763.9A
Authority: CN
Inventors: 胡文馨
Original assignee: Harbin Institute of Technology Shenzhen
Current assignee: Harbin Institute of Technology Shenzhen
Priority date: 2019-06-14
Filing date: 2019-06-14
Publication date: 2023-10-13
Anticipated expiration: 2039-06-14
Also published as: CN110362797A

Abstract

The invention discloses a research report generation method and related equipment. The invention constructs a research report dictionary by using multiple collected research reports, and then automatically outputs corresponding research based on event text, research report dictionary, outline generation model and report generation model. In the report, the outline generation model and the report generation model select words from the research report dictionary to form a word sequence as the report outline and research report based on the principle of probability optimization, overcoming the existing technology in the existing technology that manually writing reports consumes a lot of effort and labor costs. issues, and the research reports generated based on the report outline are of higher quality.

Description

Research report generation method and related equipment

技术领域Technical field

本发明涉及报告生成领域，尤其一种研究报告生成方法及相关设备。The present invention relates to the field of report generation, in particular to a research report generation method and related equipment.

背景技术Background technique

LSTM(Long Short-Term Memory)是长短期记忆网络，是一种时间循环神经网络，适合于处理和预测时间序列中间隔和延迟相对较长的重要事件。LSTM (Long Short-Term Memory) is a long short-term memory network, a time-cyclic neural network, suitable for processing and predicting important events with relatively long intervals and delays in time series.

变分自编码器(Variational Auto-Encoder,VAE)是一种深度生成模型。Variational Auto-Encoder (VAE) is a deep generative model.

在金融领域中涉及大量的固定格式的报告撰写工作，如研报，招股说明书，以及投资意向书，不同行业不同公司的报告都有不同的要求。这些报告撰写往往要求高时效性、以及进行大量的数据收集、分析等工作，现有技术中，一般是人工收集数据、分析数据并编写报告，因此人工成本高昂，而且需要耗费人们大量的精力。The financial field involves a large number of fixed-format report writing work, such as research reports, prospectuses, and investment letters of intent. Reports from different companies in different industries have different requirements. The writing of these reports often requires high timeliness and a large amount of data collection and analysis. In the existing technology, data is generally collected manually, analyzed, and reports written, so labor costs are high and a lot of energy is consumed.

发明内容Contents of the invention

本发明旨在至少在一定程度上解决相关技术中的技术问题之一。为此，本发明的一个目的是提供一种研究报告生成方法及相关设备，用于根据事件文本自动生成研究报告。The present invention aims to solve one of the technical problems in the related art, at least to a certain extent. To this end, one object of the present invention is to provide a research report generation method and related equipment for automatically generating a research report based on event text.

本发明所采用的技术方案是：The technical solution adopted by the present invention is:

第一方面，本发明提供一种研究报告生成方法，包括：In a first aspect, the present invention provides a method for generating a research report, including:

研究报告采集步骤：获取来自多个资讯源的多个研究报告；Research report collection steps: Obtain multiple research reports from multiple information sources;

词典获取步骤：对多个所述研究报告进行数据预处理和特征选择以构建研究报告词典；Dictionary acquisition step: perform data preprocessing and feature selection on multiple research reports to build a research report dictionary;

大纲获取步骤：根据事件文本、所述研究报告词典和大纲生成模型获取所述事件文本对应的报告大纲，所述大纲生成模型依据概率最优原则从所述研究报告词典中选择多个单词组成单词序列作为所述报告大纲；Outline acquisition step: Obtain the report outline corresponding to the event text according to the event text, the research report dictionary and the outline generation model. The outline generation model selects multiple words from the research report dictionary to form words based on the principle of probability optimization. The sequence serves as an outline for said report;

报告生成步骤：根据所述事件文本、所述报告大纲、所述研究报告词典和报告生成模型获取研究报告，所述报告生成模型依据概率最优原则从所述研究报告词典中选择多个单词组成单词序列作为所述研究报告。Report generation step: obtain a research report based on the event text, the report outline, the research report dictionary and a report generation model. The report generation model selects multiple words from the research report dictionary based on the principle of probability optimization. Sequences of words serve as the research report.

进一步地，所述报告生成模型从所述研究报告词典中选择并逐个输出单词以生成所述研究报告。Further, the report generation model selects and outputs words one by one from the research report dictionary to generate the research report.

进一步地，所述大纲生成模型根据所述事件文本、所述报告大纲、所述研究报告词典、所述报告生成步骤输出的上一个单词更新所述报告大纲。Further, the outline generation model updates the report outline according to the event text, the report outline, the research report dictionary, and the previous word output by the report generation step.

进一步地，所述词典获取步骤还包括：添加开头标记和结尾标记至所述研究报告词典。Further, the dictionary acquisition step further includes: adding a beginning tag and an ending tag to the research report dictionary.

进一步地，所述大纲生成模型包括：Further, the outline generation model includes:

根据所述研究报告词典对所述事件文本、所述开头标记进行向量表示以获取事件向量和开头标记向量；Perform vector representation on the event text and the opening tag according to the research report dictionary to obtain event vectors and opening tag vectors;

根据所述事件向量、所述开头标记向量和LSTM网络获取所述事件文本的隐层状态和所述开头标记的隐层状态；Obtain the hidden layer state of the event text and the hidden layer state of the starting mark according to the event vector, the beginning mark vector and the LSTM network;

根据所述事件文本的隐层状态、所述开头标记的隐层状态和注意力机制获取所述报告大纲。The report outline is obtained according to the hidden state of the event text, the hidden state of the opening mark and the attention mechanism.

进一步地，所述大纲生成模型还包括：Further, the outline generation model also includes:

根据所述研究报告词典对所述报告生成步骤输出的上一个单词进行向量表示以获取单词向量；Perform a vector representation on the previous word output by the report generation step according to the research report dictionary to obtain a word vector;

根据所述单词向量和LSTM网络获取单词的隐层状态；Obtain the hidden layer state of the word according to the word vector and LSTM network;

根据所述单词的隐层状态、所述事件文本的隐层状态和所述注意力机制更新所述报告大纲。The report outline is updated according to the hidden state of the word, the hidden state of the event text and the attention mechanism.

根据所述事件文本、所述开头标记和Transformer模型获取所述报告大纲；Obtain the report outline according to the event text, the opening tag and the Transformer model;

根据所述报告生成步骤输出的上一个单词、所述事件文本和所述Transformer模型更新所述报告大纲。The report outline is updated according to the previous word output by the report generation step, the event text and the Transformer model.

进一步地，所述报告生成模型包括VAE生成模型。Further, the report generation model includes a VAE generation model.

进一步地，所述研究报告生成方法还包括：Further, the research report generating method also includes:

根据所述研究报告采集步骤获得的研究报告进行事件实体识别以获取对应的事件文本，多个研究报告和对应的事件文本组成训练数据集，所述训练数据集用于训练所述大纲生成模型和所述报告生成模型。Event entity recognition is performed according to the research report obtained in the research report collection step to obtain the corresponding event text. Multiple research reports and corresponding event texts constitute a training data set. The training data set is used to train the outline generation model and The report generation model.

第二方面，本发明提供一种研究报告生成装置，包括：In a second aspect, the present invention provides a research report generating device, including:

研究报告采集模块，用于获取来自多个资讯源的多个研究报告；The research report collection module is used to obtain multiple research reports from multiple information sources;

词典获取模块，用于对多个所述研究报告进行数据预处理和特征选择以构建研究报告词典；The dictionary acquisition module is used to perform data preprocessing and feature selection on multiple research reports to build a research report dictionary;

大纲获取模块，用于根据事件文本、所述研究报告词典和大纲生成模型获取所述事件文本对应的报告大纲，所述大纲生成模型依据概率最优原则从所述研究报告词典中选择多个单词组成单词序列作为所述报告大纲；An outline acquisition module, configured to obtain a report outline corresponding to the event text based on the event text, the research report dictionary, and an outline generation model. The outline generation model selects multiple words from the research report dictionary based on the principle of probability optimization. form a sequence of words to serve as an outline for said report;

报告生成模块，用于根据所述事件文本、所述报告大纲、所述研究报告词典和报告生成模型获取研究报告，所述报告生成模型依据概率最优原则从所述研究报告词典中选择多个单词组成单词序列作为所述研究报告。A report generation module is configured to obtain a research report based on the event text, the report outline, the research report dictionary, and a report generation model. The report generation model selects multiple items from the research report dictionary based on the principle of probability optimization. Words form word sequences as the research report.

第三方面，本发明提供一种计算机可读存储介质，所述计算机可读存储介质存储有计算机可执行指令，所述计算机可执行指令用于使所述计算机执行所述的研究报告生成方法。In a third aspect, the present invention provides a computer-readable storage medium that stores computer-executable instructions, and the computer-executable instructions are used to cause the computer to execute the research report generating method.

本发明的有益效果是：The beneficial effects of the present invention are:

本发明通过利用采集的多个研究报告构建研究报告词典，再根据事件文本、研究报告词典、大纲生成模型和报告生成模型自动输出对应的研究报告，其中，大纲生成模型和报告生成模型依据概率最优原则从研究报告词典中选择单词组成单词序列作为报告大纲和研究报告，克服现有技术中存在人工编写报告耗费大量精力和人工成本的技术问题，并且根据报告大纲生成的研究报告质量更高。The present invention constructs a research report dictionary by using multiple collected research reports, and then automatically outputs corresponding research reports based on the event text, research report dictionary, outline generation model, and report generation model. Among them, the outline generation model and the report generation model are based on the most probable You Principle selects words from the research report dictionary to form a word sequence as the report outline and research report, overcoming the technical problem in the existing technology that manual writing of reports consumes a lot of effort and labor costs, and the research report generated based on the report outline is of higher quality.

附图说明Description of the drawings

图1是本发明中研究报告生成方法的一种实施例的方法流程图；Figure 1 is a method flow chart of an embodiment of a research report generating method in the present invention;

图2是本发明中研究报告生成方法的第一种具体实施例的方法流程图；Figure 2 is a method flow chart of a first specific embodiment of a research report generating method in the present invention;

图3是图2的训练过程示意图；Figure 3 is a schematic diagram of the training process in Figure 2;

图4是利用本发明中的研究报告生成方法生成研报的实例示意图；Figure 4 is a schematic diagram of an example of generating a research report using the research report generation method in the present invention;

图5是本发明中研究报告生成方法的第二种具体实施例的方法流程图；Figure 5 is a method flow chart of a second specific embodiment of a research report generating method in the present invention;

图6是图5的训练过程示意图；Figure 6 is a schematic diagram of the training process in Figure 5;

图7是本发明中研究报告生成装置的一种实施例的结构框图。Figure 7 is a structural block diagram of an embodiment of a research report generating device in the present invention.

具体实施方式Detailed ways

需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互组合。It should be noted that, as long as there is no conflict, the embodiments and features in the embodiments of this application can be combined with each other.

实施例1Example 1

本实施例中研究报告生成方法的思想主要是：如同写作之前会拟定一个提纲一样，先根据事件文本生成一份研究报告(以下简称研报)的大纲，再根据大纲和事件文本生成最终的研报。本文中，事件文本是指事件性叙述文本，也就是叙述了一个或多个事件(或问题)的文字集合，例如金融领域中的事件性金融新闻(以下简称新闻)，人们根据新闻进行宏观分析以产生宏观研究报告(以下简称宏观研报)；又如高考语文中的议论文写作，题目中给定某个问题或某个事件的文本，要求根据这个文本进行写作，写出来的这个作文也就相当于本文中的研究报告。具体地，参考图1，图1是本发明中研究报告生成方法的一种实施例的方法流程图；研究报告生成方法包括：The main idea of the research report generation method in this embodiment is: just like an outline is drawn up before writing, the outline of a research report (hereinafter referred to as the research report) is first generated based on the event text, and then the final research report is generated based on the outline and event text. Report. In this article, event text refers to event narrative text, that is, a text collection that describes one or more events (or issues), such as event-based financial news (hereinafter referred to as news) in the financial field. People conduct macro analysis based on the news. In order to produce a macro research report (hereinafter referred to as a macro research report); another example is the argumentative writing in the college entrance examination Chinese language. The text of a certain issue or event is given in the title and is required to be written based on this text. The written composition is also It is equivalent to the research report in this article. Specifically, refer to Figure 1, which is a method flow chart of an embodiment of a research report generation method in the present invention; the research report generation method includes:

研究报告采集步骤：获取来自多个资讯源的多个研究报告，资讯源可以是互联网或者纸质数据源(例如图书馆)，以金融领域为例，获取研报的资讯源可以是新浪财经、东方财富和同花顺等财经网站，而作文的资讯源可以是图书馆或者网络上的作文资源；Research report collection steps: Obtain multiple research reports from multiple information sources. The information sources can be the Internet or paper data sources (such as libraries). Taking the financial field as an example, the information sources for obtaining research reports can be Sina Finance, Financial websites such as Oriental Fortune and Flush, and the information source for the composition can be the library or composition resources on the Internet;

词典获取步骤：对获得的多个研究报告进行数据预处理和特征选择以构建研究报告词典，数据预处理包括对研究报告进行分词处理，再统计词语在研报中出现的次数以进行特征选择，根据出现次数决定是否将词语放入研究报告词典中；Dictionary acquisition steps: Perform data preprocessing and feature selection on multiple research reports obtained to build a research report dictionary. Data preprocessing includes word segmentation of the research reports, and then counting the number of times words appear in the research reports for feature selection. Decide whether to put words into the research report dictionary based on the number of occurrences;

大纲获取步骤：根据事件文本、研究报告词典和大纲生成模型获取事件文本对应的报告大纲，大纲生成模型依据概率最优原则从研究报告词典中选择多个单词组成单词序列作为报告大纲，依据概率最优原则选择单词的方法可以是采用全局搜索最优概率的单词输出的方法，即一次只输出一个单词，多个单词组成单词序列作为报告大纲；也可以是采用集束搜索算法搜索得到预设数目个最优概率的单词输出的方法，即一次输出预设数目个单词，多次输出的单词再根据概率最优的原则选择出多个单词组成单词序列作为报告大纲；Outline acquisition steps: Obtain the report outline corresponding to the event text based on the event text, research report dictionary and outline generation model. The outline generation model selects multiple words from the research report dictionary to form a word sequence as the report outline based on the probability optimal principle. The method of selecting words according to the optimal principle can be to use a global search method for word output with optimal probability, that is, only one word is output at a time, and multiple words form a word sequence as a report outline; it can also be to use a beam search algorithm to search for a preset number of words. The method of word output with optimal probability is to output a preset number of words at one time, and the words output multiple times are then selected according to the principle of optimal probability to form a word sequence as a report outline;

报告生成步骤：根据事件文本、报告大纲、研究报告词典和报告生成模型获取研究报告，报告生成模型依据概率最优原则从研究报告词典中选择多个单词组成单词序列作为研究报告，与大纲生成相似，可以采用全局搜索最优概率的单词输出也可以采用集束搜索算法选择单词进行输出，不再赘述。Report generation steps: Obtain a research report based on the event text, report outline, research report dictionary and report generation model. The report generation model selects multiple words from the research report dictionary to form a word sequence as a research report based on the principle of probability optimization, which is similar to outline generation. , you can use global search for word output with optimal probability, or you can use beam search algorithm to select words for output, which will not be described again.

本发明通过利用采集的多个研究报告构建研究报告词典，再根据事件文本、研究报告词典、大纲生成模型和报告生成模型自动输出对应的研究报告，克服现有技术中存在人工编写报告耗费大量精力和人工成本的技术问题，解放人力，提高研报的输出效率，降低人力成本。并且根据报告大纲生成的研究报告质量更高。进一步地，研究报告生成方法的整体框架是Encoder-Decoder(编码-解码)结构(所谓编码，就是将输入序列转化成一个固定长度的向量；解码，就是将之前生成的固定向量再转化成输出序列。)，即输入一个序列，输出一个序列。Encoder的过程是将事件文本序列变成一个固定长度的向量表达，Decoder的过程是将这个固定长度的向量表达变成可变长度的文本序列，即研究报告。The present invention constructs a research report dictionary by using multiple collected research reports, and then automatically outputs corresponding research reports based on event text, research report dictionary, outline generation model and report generation model, overcoming the problem in the prior art that manual writing of reports consumes a lot of energy. and technical issues of labor costs, liberating manpower, improving the output efficiency of research reports, and reducing labor costs. And the research reports generated based on the report outline are of higher quality. Furthermore, the overall framework of the research report generation method is the Encoder-Decoder (encoding-decoding) structure (the so-called encoding is to convert the input sequence into a fixed-length vector; decoding is to convert the previously generated fixed vector into an output sequence. .), that is, input a sequence and output a sequence. The process of Encoder is to turn the event text sequence into a fixed-length vector expression, and the Decoder process is to turn this fixed-length vector expression into a variable-length text sequence, that is, a research report.

下面以金融领域的研报为例，对研究报告采集和研究报告词典的构建过程进行说明：The following takes research reports in the financial field as an example to explain the process of collecting research reports and constructing a dictionary of research reports:

首先，爬取新浪财经、东方财富和同花顺等网站在宏观经济板块的研究报告文本(总共11万条数据)。再通过正则表达式提取研报中的中文字符和中文标点符号。由于数字符号难以直接生成高正确率的结果，因而舍去不用。接着采用jieba分词将文本数据切分为单个词语以获得单词集，为了提高计算的效率，选择单词集中重复出现超过5次的单词则放入研究报告词典中。另外，词典获取步骤中，添加开头标记和结尾标记至研究报告词典。具体地，研究报告词典的前四个键(键可以理解为词典中的元素)分别为掩码标记(mask)、未知标记(unk)、开头标记(start)、结尾标记(end)。其中，掩码标记(mask)用于表示遮蔽掉不需要的信息。未知标记(unk)用于表示单词集里不在词典中的词，通常为不经常出现但有含义的词，比如机构名、人名。开头标记(start)和结尾标记(end)用于分别加在每一条文本数据的开头和结尾，以标明文本数据的开始和结束。词典的键为单词(包含单个的标点符号)，词典的值为单词的序号。First, crawl the text of research reports in the macroeconomic section of websites such as Sina Finance, Oriental Fortune, and Flush (a total of 110,000 pieces of data). Then use regular expressions to extract Chinese characters and Chinese punctuation marks in the research report. Since it is difficult to directly generate high-accuracy results with numerical symbols, they are discarded. Then Jieba word segmentation is used to divide the text data into individual words to obtain a word set. In order to improve the efficiency of calculation, words that appear more than 5 times in the word set are selected and put into the research report dictionary. In addition, in the dictionary acquisition step, add opening tags and ending tags to the research report dictionary. Specifically, the first four keys (keys can be understood as elements in the dictionary) of the research report dictionary are mask mark (mask), unknown mark (unk), start mark (start), and end mark (end). Among them, the mask mark (mask) is used to indicate that unnecessary information is masked. The unknown tag (unk) is used to indicate words in the word set that are not in the dictionary. They are usually words that do not appear frequently but have meaning, such as organization names and personal names. The start tag (start) and the end tag (end) are used to be added to the beginning and end of each piece of text data to indicate the beginning and end of the text data. The key of the dictionary is the word (containing a single punctuation mark), and the value of the dictionary is the sequence number of the word.

另外，研究报告生成方法还包括：In addition, research report generation methods also include:

根据研究报告采集步骤获得的研究报告进行事件实体识别以获取对应的事件文本，多个研究报告和对应的事件文本组成训练数据集，训练数据集用于训练大纲生成模型和报告生成模型。以金融领域的研究报告为例，当爬取得到宏观研报后，利用规则匹配和词性匹配处理宏观研报以获得新闻文本，抽取出含有事件相关内容的文本作为新闻，从而将数据变成一对一的新闻和宏观研报的集合作为训练数据集。具体地，可以将研究报告进行分段切割并保存为数组，再对数组中的文本通过设置好的事件规则进行匹配以获取事件段落作为新闻文本。Event entity recognition is performed based on the research report obtained in the research report collection step to obtain the corresponding event text. Multiple research reports and corresponding event texts form a training data set. The training data set is used to train the outline generation model and the report generation model. Take research reports in the financial field as an example. After crawling to obtain macro research reports, rule matching and part-of-speech matching are used to process the macro research reports to obtain news text, and extract text containing event-related content as news, thereby turning the data into a A collection of one-to-one news and macro research reports is used as a training data set. Specifically, the research report can be cut into segments and saved as an array, and then the text in the array can be matched according to the set event rules to obtain event paragraphs as news text.

本实施例中，提供两种大纲生成模型，第一种大纲生成模型参考图2，图2是本发明中研究报告生成方法的第一种具体实施例的方法流程图；以金融领域的新闻作为事件文本为例进行说明，第一种大纲生成模型包括：In this embodiment, two outline generation models are provided. For the first outline generation model, refer to Figure 2. Figure 2 is a method flow chart of the first specific embodiment of the research report generation method in the present invention; news in the financial field is used as Taking event text as an example, the first outline generation model includes:

根据研究报告词典分别对事件文本、开头标记(<start>)进行向量表示以获取事件向量和开头标记向量；本实施例中，可以通过word embedding(词嵌入向量)将一个文本转换成固定长度的向量表示。According to the research report dictionary, vector representation is performed on the event text and the start tag (<start>) to obtain the event vector and the start tag vector; in this embodiment, a text can be converted into a fixed-length text through word embedding (word embedding vector) vector representation.

根据事件向量和双向LSTM网络获取事件文本的隐层状态，根据开头标记向量和单向LSTM网络获取开头标记的隐层状态，其中，单向LSTM网络所需的处理时间较少，但是，双向的LSTM网络能够更好地捕捉双向的语义依赖；Obtain the hidden layer state of the event text according to the event vector and the bidirectional LSTM network, and obtain the hidden layer state of the beginning mark according to the beginning mark vector and the one-way LSTM network. Among them, the one-way LSTM network requires less processing time, but the two-way The LSTM network can better capture bidirectional semantic dependencies;

根据事件文本的隐层状态、开头标记的隐层状态和注意力机制获取报告大纲；Obtain the report outline based on the hidden state of the event text, the hidden state of the beginning tag and the attention mechanism;

再根据研究报告词典对报告生成步骤输出的上一个单词进行向量表示以获取单词向量，同样可以采用word embedding将单词转为单词向量；Then use a vector representation of the previous word output from the report generation step according to the research report dictionary to obtain a word vector. Word embedding can also be used to convert words into word vectors;

根据单词向量和单向LSTM网络获取单词的隐层状态；Obtain the hidden layer state of the word based on the word vector and the one-way LSTM network;

根据单词的隐层状态、事件文本的隐层状态和注意力机制更新报告大纲。Update the report outline based on the hidden state of the word, the hidden state of the event text, and the attention mechanism.

本实施例中，报告生成模型包括VAE生成模型，第一种大纲生成模型中，经过注意力机制后可以得到单词的概率分布序列，大纲生成模型再根据单词的概率分布信息选择研究报告词典中最优概率的词汇组合，输出词汇作为报告大纲。而同理，报告生成模型中，经过VAE生成模型处理后可以得到另一个概率分布信息，报告生成模型再根据概率分布信息选择研究报告词典中最优概率的词汇组合输出组成研报。具体地，参考图2，将研报的开头用<start>标记来标记，结尾用<end>标记来标记。利用第一种大纲生成模型和VAE生成模型从<start>标记开始预测研报下一个词的概率分布，再根据概率分布从研究报告词典中选择单词输出以生成研究报告。简单的说，研究报告生成方法执行时，首先，输入新闻和一个开头<start>标记，经大纲生成模型和报告生成模型处理后输出研报的第一个单词，这第一个输出的单词返回至输入端替换掉开头标记，再生成第二个输出，第二个输出再返回输入端替换掉第一个输出，依此类推，逐个输出单词，直到输出为结尾标记<end>则停止，最终的研究报告生成完毕。即当前输出的单词需要依赖于上一个输出的单词，根据上一个输出的单词和输入的新闻更新报告大纲，有效提高了生成的研报的质量。其中，上述根据概率分布从研究报告词典中选择单词输出以生成研究报告时，可以通过全局搜索最优解在研究报告词典选择概率最大的单词(即每次输出只有输出一个单词)，但是当研究报告词典很庞大时，全局搜索最优解的空间效率很低。因此，可以采用集束搜索算法来提高搜索效率，集束搜索使用beam size参数来限制在每一步保留下来的可能性词的数，不仅仅考虑了单个单词的概率，其还考虑了前后词放在一起的概率。因此，在VAE生成模型处理后利用集束搜索算法获取预设数目个概率最大的可能性单词，本实施例中，设置beam size＝3，即预设数目为3个，每一步保留三个最大可能的结果。每一次预测都能得到3个输出，再将这3个输出返回输入端进行下一次预测，当预测结束后，再根据输出的所有用于生成研报的单词，根据概率最优原则，每3个输出选择一个输出作为最终输出，以得到最终的研报。In this embodiment, the report generation model includes a VAE generation model. In the first outline generation model, the probability distribution sequence of words can be obtained through the attention mechanism. The outline generation model then selects the best value in the research report dictionary based on the probability distribution information of the words. Combining vocabulary with optimal probability and outputting the vocabulary as a report outline. In the same way, in the report generation model, another probability distribution information can be obtained after processing by the VAE generation model. The report generation model then selects the word combination with the best probability in the research report dictionary based on the probability distribution information to output the research report. Specifically, referring to Figure 2, the beginning of the research report is marked with a <start> tag, and the end is marked with an <end> tag. Use the first outline generation model and VAE generation model to predict the probability distribution of the next word in the research report starting from the <start> tag, and then select the word output from the research report dictionary based on the probability distribution to generate a research report. To put it simply, when the research report generation method is executed, first, input the news and a starting <start> tag, and then output the first word of the research report after processing by the outline generation model and report generation model, and the first output word is returned Go to the input end and replace the beginning tag, then generate the second output, and then return the second output to the input end to replace the first output, and so on, output words one by one until the output reaches the end tag <end>, then stop, and finally The research report is generated. That is, the currently output word needs to depend on the last output word, and the report outline is updated based on the last output word and the input news, which effectively improves the quality of the generated research report. Among them, when the above-mentioned word output is selected from the research report dictionary according to the probability distribution to generate a research report, the word with the highest probability can be selected in the research report dictionary through global search for the optimal solution (that is, only one word is output for each output), but when the research report When the reporting dictionary is large, the space efficiency of the global search for the optimal solution is very low. Therefore, the beam search algorithm can be used to improve search efficiency. Beam search uses the beam size parameter to limit the number of possible words retained at each step. It not only considers the probability of a single word, but also considers the words before and after being put together. The probability. Therefore, after VAE generation model processing, the beam search algorithm is used to obtain a preset number of possible words with the highest probability. In this embodiment, beam size=3 is set, that is, the preset number is 3, and three maximum possible words are retained in each step. the result of. Each prediction can get 3 outputs, and then return these 3 outputs to the input terminal for the next prediction. When the prediction is completed, based on all the output words used to generate the research report, according to the principle of probability optimization, every 3 Select one of the outputs as the final output to obtain the final research report.

大纲生成模型和VAE生成模型在正式使用前需要进行训练，参考图3，图3是图2的训练过程示意图；以金融领域的新闻作为事件文本，下面对第一种大纲生成模型的训练过程进行说明：The outline generation model and VAE generation model need to be trained before formal use. Refer to Figure 3. Figure 3 is a schematic diagram of the training process in Figure 2. Using news in the financial field as event text, the following is the training process of the first outline generation model. Be explained:

首先定义输入的新闻为X，x为新闻中的单词，新闻的表达式如下，First define the input news as X, x is the word in the news, the expression of the news is as follows,

X＝(x₁,…,x_m) (1)X＝(x ₁ ,…,x _m ) (1)

潜在向量的解码有两个阶段，第一阶段生成研报的大纲O，o为研报大纲中的单词，第二阶段生成最终的研报Y，y为研报中的单词，定义生成文本的长度为L，表达式如下：The decoding of latent vectors has two stages. The first stage generates the outline O of the research report, o is the word in the research report outline, and the second stage generates the final research report Y, y is the word in the research report, and defines the generated text. The length is L, and the expression is as follows:

O＝(o₁,…,o_L) (2)O＝(o ₁ ,…,o _L ) (2)

Y＝(y₁,…,y_L) (3)Y＝(y ₁ ,…,y _L ) (3)

再对训练数据集的新闻和宏观研报通过正则匹配提取中文和中文字符，然后jieba分词之后将文本以词为单位，做了一个初步的长度统计，最后得到的长度统计数据如表1所示：Then we extracted Chinese and Chinese characters from the news and macro research reports of the training data set through regular matching. Then, after jieba word segmentation, we made a preliminary length statistics of the text in units of words. The final length statistics are shown in Table 1. :

表1新闻和研报长度统计Table 1 Statistics of length of news and research reports

接着为了训练需要，将新闻和研报进行“截短补长”到设置的同一长度，比如新闻长度为30个单词，研报长度为200个单词。输入新闻文本和研报文本通过word embedding变成向量表示以得到新闻向量和研报向量，此处直接采用一层Embedding网络。由于LSTM网络对句子建模的时候，无法编码从后到前的信息，而双向的LSTM网络能够更好地捕捉双向的语义依赖。接着把Embedding之后的新闻向量输入到双向的LSTM网络里。其中，新闻向量输入双向LSTM网络后可以得到新闻的隐层状态表示为H，h为隐层子状态，t为时间，h的表达式如下：Then, for training needs, the news and research reports are "truncated and lengthened" to the same set length. For example, the length of news is 30 words and the length of research reports is 200 words. The input news text and research report text are converted into vector representations through word embedding to obtain news vectors and research report vectors. Here, a layer of Embedding network is directly used. Since the LSTM network cannot encode information from back to front when modeling sentences, the bidirectional LSTM network can better capture bidirectional semantic dependencies. Then the news vector after Embedding is input into the bidirectional LSTM network. Among them, after the news vector is input into the bidirectional LSTM network, the hidden layer state of the news can be obtained as H, h is the hidden layer sub-state, t is time, and the expression of h is as follows:

为了生成高质量的研报，模型编码和解码的过程需要充分吸收对应研报的结构和内容，Decoder过程通过LSTM网络预测研报下一个词的在词表中的概率分布。宏观研报的隐层状态表示为S，每一个时间步的隐层状态都依赖于上一个时间输入和上一个时间的隐层状态，表达式如下：In order to generate high-quality research reports, the model encoding and decoding processes need to fully absorb the structure and content of the corresponding research reports. The Decoder process predicts the probability distribution of the next word in the research report in the vocabulary through the LSTM network. The hidden state of the macro research report is represented by S. The hidden state of each time step depends on the input of the previous time and the hidden state of the previous time. The expression is as follows:

新闻的隐层状态和宏观研报的隐层状态通过注意力机制计算注意力分数(如式(6))，此处采用乘法的attention以累积注意力分数，使用softmax函数计算得到注意力权重(如式(7))。通过注意力权重与新闻的隐层状态的加权平均(如式(8))，即将注意力权重和新闻的隐层状态相乘以得到上下文向量。式(6)、式(7)、式(8)的表达式如下：The hidden layer state of the news and the hidden layer state of the macro research report calculate the attention score through the attention mechanism (such as equation (6)). Here, multiplicative attention is used to accumulate the attention score, and the softmax function is used to calculate the attention weight ( Such as formula (7)). Through the weighted average of the attention weight and the hidden layer state of the news (such as equation (8)), the context vector is obtained by multiplying the attention weight and the hidden layer state of the news. The expressions of formula (6), formula (7) and formula (8) are as follows:

接着串联上下文向量和宏观研报的隐层状态以更新得到注意力隐层状态。输入一个词的预测输出将被注意力隐层状态计算得到。表达式如下：Then, the context vector and the hidden layer state of the macro research report are concatenated to update the attention hidden layer state. The predicted output of an input word will be calculated by the attention hidden layer state. The expression is as follows:

其中，Wc是模型参数。Among them, Wc is the model parameter.

解码的第一阶段目标函数如下：The first stage objective function of decoding is as follows:

其中，P为概率。Among them, P is the probability.

至此，根据输入的新闻得到在研究报告词典中的单词的概率分布，根据概率分布可以生成报告的大纲。At this point, the probability distribution of words in the research report dictionary is obtained based on the input news, and the outline of the report can be generated based on the probability distribution.

解码的第二个阶段，为了生成最终的研报，本阶段需要输入新闻和根据新闻生成的大纲。In the second stage of decoding, in order to generate the final research report, this stage requires the input of news and an outline generated based on the news.

解码模型采用了变分自编码器模型(VAE)，通过融合双输入新闻X和大纲O产生目标变量，学习隐变量z的后验概率分布，P(z|X)可以改写为如下表达式：The decoding model uses a variational autoencoder model (VAE), which generates target variables by fusing dual input news X and outline O, and learns the posterior probability distribution of the latent variable z. P(z|X) can be rewritten as the following expression:

假设该后验分布是标准正态分布，从分布中随机采样再解码到原始的文本。通过衡量重构损失和正则化损失进行训练，那么ELBO可以表达为：Assuming that the posterior distribution is a standard normal distribution, randomly sample from the distribution and decode to the original text. By measuring the reconstruction loss and regularization loss for training, ELBO can be expressed as:

logP(X,O)≥E_q(z|x,o)[logp(x,o|z)]-KL(q(z|x,o)||p(z)) (12)logP(X,O)≥E _q(z|x,o) [logp(x,o|z)]-KL(q(z|x,o)||p(z)) (12)

上面不等式的右侧是ELBO，其中第一项是从P(z|X,O)中采样出注意力，使用采样出的注意力作为解码器的输入计算交叉熵损失，第二项是通过KL散度衡量两个概率分布的相似度，确保后验分布接近于先验分布。The right side of the above inequality is ELBO, where the first term is to sample the attention from P(z|X,O), use the sampled attention as the input of the decoder to calculate the cross-entropy loss, and the second term is through KL Divergence measures the similarity of two probability distributions, ensuring that the posterior distribution is close to the prior distribution.

解码的第二阶段目标函数如下：The second stage objective function of decoding is as follows:

至此，根据新闻和报告大纲可以再次得到单词的概率分布信息，根据概率分布信息可以得到最终的研报。At this point, the probability distribution information of the word can be obtained again based on the news and report outline, and the final research report can be obtained based on the probability distribution information.

全局的目标函数通过两个解码阶段的损失函数的加和得到，表达式如下：The global objective function is obtained by the sum of the loss functions of the two decoding stages, and the expression is as follows:

将上述得到的一对一新闻和宏观研报的训练数据集输入图3的模型并训练50轮后，可以调整并确定最终的模型参数。具体地，以全局最优搜索为例，研报生成时一次只输出一个单词，每一个宏观研报以开头标记-宏观研报-结尾标记顺序输入模型中，每次逐个单词输入进行训练，例如一开始输入开头标记可以在模型输出端得到一个输出，这一个输出即为预测的研报的第一个单词，将输出的单词与真实研报的第一个单词进行对比，根据对比结果修改模型参数；然后再将宏观研报的第一个单词输入模型中，再得到一个输出，这个输出即为预测的研报的第二个单词，将第二个单词与真实的宏观研报的第二个单词进行比较，再次根据对比结果调整模型参数，不断缩小模型输出与真实单词之间的误差。以多个新闻-宏观研报数据对模型进行训练后，模型的结构和参数都将在训练之后保存下来。根据最终确定好模型参数的模型可以对新输入的新闻进行预测，以图2的模型进行预测，输入新闻后可以得到图4的研报，图4是利用本发明中的研究报告生成方法生成研报的实例示意图。After inputting the one-to-one news and macro research report training data sets obtained above into the model in Figure 3 and training for 50 rounds, the final model parameters can be adjusted and determined. Specifically, taking the global optimal search as an example, only one word is output at a time when the research report is generated. Each macro research report is input into the model sequentially from the beginning tag - the macro research report - the end tag, and each word is input for training, for example By inputting the opening tag at the beginning, you can get an output at the model output. This output is the first word of the predicted research report. Compare the output word with the first word of the real research report, and modify the model based on the comparison results. parameters; then input the first word of the macro research report into the model and obtain an output. This output is the second word of the predicted research report, and compare the second word with the second word of the real macro research report. Compare the words, and then adjust the model parameters based on the comparison results to continuously reduce the error between the model output and the real words. After training the model with multiple news-macro research report data, the structure and parameters of the model will be saved after training. According to the model with finally determined model parameters, the newly input news can be predicted using the model in Figure 2. After the news is input, the research report in Figure 4 can be obtained. Figure 4 is a research report generated using the research report generation method in the present invention. Schematic diagram of the reported example.

参考图5，图5是本发明中研究报告生成方法的第二种具体实施例的方法流程图；图5中，事件文本以金融领域的新闻为例，第二种大纲生成模型包括：Referring to Figure 5, Figure 5 is a method flow chart of a second specific embodiment of a research report generation method in the present invention; in Figure 5, the event text takes news in the financial field as an example, and the second outline generation model includes:

根据事件文本、开头标记和Transformer模型获取报告大纲；Get the report outline based on event text, opening tag and Transformer model;

根据报告生成步骤输出的上一个单词、事件文本和Transformer模型更新报告大纲。其中，报告生成模型也是采用VAE生成模型。而Transformer模型相当于替换了LSTM网络和注意力机制，Transformer模型的attention分数求解如式(15)，Update the report outline based on the last word, event text, and Transformer model output by the report generation step. Among them, the report generation model also uses the VAE generation model. The Transformer model is equivalent to replacing the LSTM network and attention mechanism. The attention score of the Transformer model is solved by equation (15),

其中，Q，K，V是输入X变换成的三个矩阵向量。与图2相似地，图5的模型的执行过程为输入新闻和一个开头<start>标记后，经大纲生成模型和报告生成模型处理后输出研报的第一个单词，这第一个输出的单词返回至输入端替换掉开头标记，依此类推，图5的模型将逐个输出研报的单词。另外，同样地，大纲生成模型和报告生成模型可以选择全局搜索最优解的方法选择单词，而集束搜索算法也可以应用于图5的模型中，以提高搜索效率。Among them, Q, K, V are the three matrix vectors that the input X is transformed into. Similar to Figure 2, the execution process of the model in Figure 5 is to input news and an opening <start> tag, and then output the first word of the research report after being processed by the outline generation model and the report generation model. This first output The word is returned to the input terminal to replace the beginning tag, and so on. The model in Figure 5 will output the reported words one by one. In addition, similarly, the outline generation model and report generation model can choose the method of global search for optimal solutions to select words, and the beam search algorithm can also be applied to the model in Figure 5 to improve search efficiency.

参考图6，图6是图5的训练过程示意图；将训练数据集依次输入模型进行训练，与图4相似地，将宏观研报以开头标记-宏观研报-结尾标记的形式输入模型中，根据模型输出和真实的研报单词进行对比以调整模型参数。Refer to Figure 6, which is a schematic diagram of the training process in Figure 5; the training data set is input into the model in sequence for training. Similar to Figure 4, the macro research report is input into the model in the form of beginning mark - macro research report - end mark. Compare the model output with real research words to adjust model parameters.

实施例2Example 2

基于实施例1提供实施例2，实施例2提供一种研究报告生成装置，参考图7，图7是本发明中研究报告生成装置的一种实施例的结构框图，研究报告生成装置包括：Embodiment 2 is provided based on Embodiment 1. Embodiment 2 provides a research report generating device. Refer to Figure 7. Figure 7 is a structural block diagram of an embodiment of the research report generating device in the present invention. The research report generating device includes:

词典获取模块，用于对多个研究报告进行数据预处理和特征选择以构建研究报告词典；The dictionary acquisition module is used to perform data preprocessing and feature selection on multiple research reports to build a research report dictionary;

大纲获取模块，用于根据事件文本、研究报告词典和大纲生成模型获取事件文本对应的报告大纲，大纲生成模型依据概率最优原则从研究报告词典中选择多个单词组成单词序列作为报告大纲；The outline acquisition module is used to obtain the report outline corresponding to the event text based on the event text, the research report dictionary and the outline generation model. The outline generation model selects multiple words from the research report dictionary to form a word sequence as the report outline based on the principle of probability optimization;

报告生成模块，用于根据事件文本、报告大纲、研究报告词典和报告生成模型获取研究报告，报告生成模型依据概率最优原则从研究报告词典中选择多个单词组成单词序列作为研究报告。The report generation module is used to obtain research reports based on event text, report outline, research report dictionary and report generation model. The report generation model selects multiple words from the research report dictionary to form a word sequence as a research report based on the principle of probability optimization.

研究报告生成装置的具体工作过程描述可参照实施例1的描述，不再赘述。利用研究报告生成装置可以自动生成研报，解放人力，提高研报输出效率。The specific working process description of the research report generating device may refer to the description in Embodiment 1, and will not be described again. The research report generation device can be used to automatically generate research reports, freeing up manpower and improving the efficiency of research report output.

实施例3Example 3

基于实施例1提供实施例3，实施例3提供一种计算机可读存储介质，所述计算机可读存储介质存储有计算机可执行指令，所述计算机可执行指令用于使所述计算机执行如实施例1中所述的研究报告生成方法。研究报告生成方法的具体描述可参照实施例1的描述，不再赘述。Embodiment 3 is provided based on Embodiment 1. Embodiment 3 provides a computer-readable storage medium. The computer-readable storage medium stores computer-executable instructions. The computer-executable instructions are used to cause the computer to execute the implementation. The research report generation method described in Example 1. For a specific description of the method for generating a research report, please refer to the description in Embodiment 1 and will not be described again.

以上是对本发明的较佳实施进行了具体说明，但本发明创造并不限于所述实施例，熟悉本领域的技术人员在不违背本发明精神的前提下还可做出种种的等同变形或替换，这些等同的变形或替换均包含在本申请权利要求所限定的范围内。The above is a detailed description of the preferred implementation of the present invention, but the present invention is not limited to the embodiments. Those skilled in the art can also make various equivalent modifications or substitutions without violating the spirit of the present invention. , these equivalent modifications or substitutions are included in the scope defined by the claims of this application.

Claims

1. A research report generation method, comprising:

study report acquisition step: acquiring a plurality of research reports from a plurality of information sources;

dictionary acquisition step: feature selecting a plurality of the study reports to construct a study report dictionary, adding a beginning marker and an ending marker to the study report dictionary;

outline acquisition: acquiring a report outline corresponding to the event text according to the event text, the research report dictionary and an outline generation model, wherein the outline generation model selects a word sequence formed by a plurality of words from the research report dictionary as the report outline according to a probability optimal principle; wherein the outline generation model comprises: vector representation is carried out on the event text and the beginning marks according to the research report dictionary so as to obtain event vectors and beginning mark vectors; acquiring the hidden layer state of the event text and the hidden layer state of the beginning mark according to the event vector, the beginning mark vector and an LSTM network; acquiring the report outline according to the hidden layer state of the event text, the hidden layer state of the beginning mark and the attention mechanism; performing vector representation on the last word output by the report generating step according to the research report dictionary to obtain a word vector; acquiring the hidden layer state of the word according to the word vector and the LSTM network; updating the report outline according to the hidden layer state of the word, the hidden layer state of the event text and the attention mechanism; acquiring the report outline according to the event text, the beginning mark and a transducer model; updating the report outline according to the last word, the event text and the transducer model which are output in the report generating step;

report generation: acquiring a research report according to the event text, the report outline, the research report dictionary and a report generation model, wherein the report generation model selects a plurality of word forming word sequences from the research report dictionary as the research report according to a probability optimal principle; wherein the report generation model comprises a VAE generation model that selects word outputs from a study report dictionary according to a probability distribution to generate the study report.

2. The study report generating method of claim 1, wherein the report generating model selects and outputs words one by one from the study report dictionary to generate the study report.

3. The study report generating method of claim 2, wherein the outline generating model updates the report outline based on the event text, the report outline, the study report dictionary, a last word output by the report generating step.

4. A research report generating method according to any one of claims 1 to 3, wherein said research report generating method further comprises:

and carrying out event entity recognition according to the research reports obtained in the research report acquisition step to obtain corresponding event texts, wherein a plurality of research reports and the corresponding event texts form a training data set, and the training data set is used for training the outline generation model and the report generation model.

5. A study report generating apparatus, comprising:

a study report acquisition module for acquiring a plurality of study reports from a plurality of information sources;

a dictionary acquisition module for performing feature selection on a plurality of the study reports to construct a study report dictionary, and adding a beginning mark and an ending mark to the study report dictionary;

the outline acquisition module is used for acquiring a report outline corresponding to the event text according to the event text, the research report dictionary and an outline generation model, wherein the outline generation model selects a plurality of word forming word sequences from the research report dictionary as the report outline according to a probability optimal principle; wherein the outline generation model comprises: vector representation is carried out on the event text and the beginning marks according to the research report dictionary so as to obtain event vectors and beginning mark vectors; acquiring the hidden layer state of the event text and the hidden layer state of the beginning mark according to the event vector, the beginning mark vector and an LSTM network; acquiring the report outline according to the hidden layer state of the event text, the hidden layer state of the beginning mark and the attention mechanism; performing vector representation on the last word output by the report generating step according to the research report dictionary to obtain a word vector; acquiring the hidden layer state of the word according to the word vector and the LSTM network; updating the report outline according to the hidden layer state of the word, the hidden layer state of the event text and the attention mechanism; acquiring the report outline according to the event text, the beginning mark and a transducer model; updating the report outline according to the last word, the event text and the transducer model which are output in the report generating step;

the report generation module is used for acquiring a research report according to the event text, the report outline, the research report dictionary and a report generation model, and the report generation model selects a plurality of word forming word sequences from the research report dictionary as the research report according to a probability optimal principle; wherein the report generation model comprises a VAE generation model that selects word outputs from a study report dictionary according to a probability distribution to generate the study report.

6. A computer-readable storage medium storing computer-executable instructions for causing the computer to perform the study report generation method of any one of claims 1 to 4.