WO2022141872A1

WO2022141872A1 - Document abstract generation method and apparatus, computer device, and storage medium

Info

Publication number: WO2022141872A1
Application number: PCT/CN2021/084241
Authority: WO
Inventors: 颜泽龙; 王健宗; 吴天博; 程宁
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-12-30
Filing date: 2021-03-31
Publication date: 2022-07-07
Also published as: CN112732898A

Abstract

A document abstract generation method and apparatus, a computer device, and a storage medium, relating to the technical field of artificial intelligence. The method comprises: obtaining an abstract generation request, the abstract generation request comprising an abstract keyword (S201); querying a database on the basis of the abstract keyword, and obtaining N original documents corresponding to the abstract keyword from initial documents stored in the database (S202); processing the original documents by using a pre-trained extraction type document abstract model to obtain M target sentences (S203); inputting the M target sentences into a trained target model combination to obtain M*(M-1)/2 directed acyclic graphs corresponding to the target sentences (S204); and obtaining a target abstract on the basis of the M*(M-1)/2 directed acyclic graphs (S205). The method determines the sequence of any two target sentences by using the target model combination, thereby improving the accuracy of generating the target abstract, and ensuring good coherence of the generated target abstract.

Description

Method, device, computer equipment and storage medium for generating literature abstract

This application claims the priority of the Chinese patent application filed on December 30, 2020 with the application number 202011623844.3 and the title of the invention is "Method, Apparatus, Computer Equipment and Storage Medium for Generating Document Abstracts", the entire contents of which are by reference Incorporated in this application.

technical field

The present application relates to the technical field of artificial intelligence, and in particular, to a method, apparatus, computer equipment and storage medium for generating a literature abstract.

Background technique

随着互联网产生的文献数据越来越多，文献信息过载问题日益严重，用户需要花费大量时间从数量众多的文献数据得到关键信息，阅读效率低；但是发明人发现当前通常是利用单个模型提取文献中的句子得到摘要，但是目前得到摘要的方法准确率较低。With more and more document data generated by the Internet, the problem of document information overload is becoming more and more serious. Users need to spend a lot of time to obtain key information from a large number of document data, and the reading efficiency is low; however, the inventor found that currently, a single model is usually used to extract documents. Sentences in are summarized, but the current methods of obtaining summaries have low accuracy.

technical problem

The embodiments of the present application provide a method, device, computer equipment and storage medium for generating a literature abstract, so as to solve the problem of low accuracy of the current method for obtaining abstracts.

technical solutions

一种文献摘要生成方法，包括：A method for generating literature abstracts, comprising:

获取摘要生成请求，所述摘要生成请求包括摘要关键词；Obtain an abstract generation request, where the abstract generation request includes abstract keywords;

基于所述摘要关键词查询数据库，从所述数据库存储的初始文献中获取The database is queried based on the abstract keywords, and obtained from the initial documents stored in the database NN 个indivual 与所述摘要关键词对应的原始文献；Original documents corresponding to the abstract keywords;

采用预先训练好的抽取式文献摘要模型对所述原始文献进行处理，得到The original document is processed by using the pre-trained extractive document abstract model to obtain MM 个indivual 目标句子；target sentence;

将Will MM 个indivual 所述目标句子输入训练好的目标模型组合，得到所述目标句子对应的The target sentence is input into the trained target model combination, and the corresponding target sentence is obtained. M*(M-1)/2M*(M-1)/2 个有向无环图；A directed acyclic graph;

基于based on M*(M-1)/2M*(M-1)/2 个所述有向无环图，获取目标摘要。A directed acyclic graph is obtained, and the target summary is obtained.

一种文献摘要生成装置，包括：A device for generating literature abstracts, comprising:

摘要生成请求获取模块，用于获取摘要生成请求，所述摘要生成请求包括摘要关键词；An abstract generation request acquisition module is used to acquire an abstract generation request, where the abstract generation request includes abstract keywords;

原始文献获取模块，用于基于所述摘要关键词查询数据库，从所述数据库存储的初始文献中获取An original document acquisition module, configured to query a database based on the abstract keywords, and acquire from the initial documents stored in the database NN 个indivual 与所述摘要关键词对应的原始文献；Original documents corresponding to the abstract keywords;

目标句子获取模块，用于采用预先训练好的抽取式文献摘要模型对所述原始文献进行处理，得到The target sentence acquisition module is used to process the original document by using the pre-trained extractive document abstract model to obtain MM 个indivual 目标句子；target sentence;

有向无环图获取模块，用于将Directed acyclic graph acquisition module for converting MM 个indivual 所述目标句子输入训练好的目标模型组合，得到所述目标句子对应的The target sentence is input into the trained target model combination, and the corresponding target sentence is obtained. M*(M-1)/2M*(M-1)/2 个有向无环图；A directed acyclic graph;

目标摘要获取模块，用于基于Target summary acquisition module for M*(M-1)/2M*(M-1)/2 个所述有向无环图，获取目标摘要。A directed acyclic graph is obtained, and the target summary is obtained.

一种计算机设备，包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令，其中，所述处理器执行所述计算机可读指令时实现如下步骤：A computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, wherein the processor implements the following steps when executing the computer-readable instructions:

一个或多个存储有计算机可读指令的可读存储介质，所述计算机可读指令被一个或多个处理器执行时，使得所述一个或多个处理器执行如下步骤：One or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:

beneficial effect

上述文献摘要生成方法、装置、计算机设备及存储介质，基于所述摘要关键词查询数据库，从所述数据库存储的初始文献中获取The above-mentioned method, device, computer equipment and storage medium for generating a document abstract, query a database based on the abstract keyword, and obtain from the initial document stored in the database NN 个indivual 与所述摘要关键词对应的原始文献，实现自动化确定相同摘要类型的原始文献，确保后续的目标摘要的准确性，减少人工成本。采用预先训练好的抽取式文献摘要模型对所述原始文献进行处理，快速得到For the original documents corresponding to the abstract keywords, the original documents of the same abstract type can be automatically determined, so as to ensure the accuracy of subsequent target abstracts and reduce labor costs. The original document is processed by using a pre-trained extractive document abstract model to quickly obtain MM 个indivual 目标句子，使得目标句子之间具有较强的联系，确保后续生成的目标摘要记载原始文献的重要信息。将The target sentence makes a strong connection between the target sentences and ensures that the subsequently generated target abstract records the important information of the original document. Will MM 个indivual 所述目标句子输入训练好的目标模型组合，得到所述目标句子对应的The target sentence is input into the trained target model combination, and the corresponding target sentence is obtained. M*(M-1)/2M*(M-1)/2 个有向无环图，通过确定任意两个目标句子之间的前后顺序，保证后续可以对目标句子进行排序过程简单，有效提高准确率，保证生成的目标摘要连贯性较佳。基于A directed acyclic graph, by determining the sequence between any two target sentences, it ensures that the subsequent process of sorting the target sentences is simple, effectively improves the accuracy, and ensures that the generated target abstracts are more coherent. based on M*(M-1)/2M*(M-1)/2 个所述有向无环图，可以快速得到连贯的目标摘要。A directed acyclic graph can quickly obtain a coherent target summary.

Description of drawings

为了更清楚地说明本申请实施例的技术方案，下面将对本申请实施例的描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. , for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative labor.

图picture 11 是本申请一实施例中文献摘要生成方法的一应用环境示意图；is a schematic diagram of an application environment of the method for generating a document abstract in an embodiment of the present application;

图picture 22 是本申请一实施例中文献摘要生成方法的一流程图；is a flow chart of a method for generating a document abstract in an embodiment of the present application;

图picture 33 是本申请一实施例中文献摘要生成方法的另一流程图；is another flow chart of the method for generating a document abstract in an embodiment of the present application;

图picture 44 是本申请一实施例中文献摘要生成方法的另一流程图；is another flow chart of the method for generating a document abstract in an embodiment of the present application;

图picture 55 是本申请一实施例中文献摘要生成方法的另一流程图；is another flow chart of the method for generating a document abstract in an embodiment of the present application;

图picture 66 是本申请一实施例中文献摘要生成方法的另一流程图；is another flow chart of the method for generating a document abstract in an embodiment of the present application;

图picture 77 是本申请一实施例中文献摘要生成方法的另一流程图；is another flow chart of the method for generating a document abstract in an embodiment of the present application;

图picture 88 是本申请一实施例中文献摘要生成装置的一原理框图；is a schematic block diagram of a document abstract generating device in an embodiment of the present application;

图picture 99 是本申请一实施例中拓扑图；is a topology diagram in an embodiment of the present application;

图picture 1010 是本申请一实施例中计算机设备的一示意图。It is a schematic diagram of a computer device in an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

The document abstract generation method provided in the embodiment of the present application can be applied in the application environment shown in FIG. 1 . Specifically, the literature abstract generation method is applied in an abstract generation system, the abstract generation system includes a client and a server as shown in FIG. 1 , the client and the server communicate through the network, and are used to determine any two through the target model combination. The sequence between the target sentences improves the accuracy of generating the target summary and ensures that the generated target summary has better coherence. Among them, the client, also known as the client, refers to the program corresponding to the server and providing local services for the client. Clients can be installed on, but not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server can be implemented as an independent server or a server cluster composed of multiple servers.

In one embodiment, as shown in FIG. 2, a method for generating a document abstract is provided, and the method is applied to the server in FIG. 1 as an example for description, including the following steps:

S201: Obtain an abstract generation request, where the abstract generation request includes abstract keywords.

The digest generation request is a request for generating a target digest. The abstract keyword is a keyword for generating a target abstract required by the user, so that the corresponding original document can be obtained according to the abstract keyword. For example, the abstract keyword may be xx disease, xx medical treatment, or xx financial product. Among them, abstract is also called summary and abstract. Abstracts are short texts that describe the important content of the literature concisely and precisely for the purpose of providing the outline of the content of the literature.

Specifically, an original display interface for generating the target abstract is displayed in the client, the user clicks on the original display interface, and sends an instruction to fill in the abstract keywords to the server. After the server obtains the instruction to fill in the abstract keywords, it controls the client to enter the abstract keywords. A generation interface, where the user fills in at least one abstract keyword in the abstract keyword generation interface. Understandably, in order to ensure that the generated target abstract meets the requirements of the user, when the number of abstract keywords filled in by the user is less than the preset number of keywords When the user fills in the summary keywords, more similar keywords are recommended for the user. When the user fills in the summary keywords that are not less than the preset number of keywords, the client is controlled to display the confirmation button, and the user clicks the confirmation button to form an abstract. The request is sent to the server, and when the server receives the abstract generation request, it parses the abstract generation request to obtain the abstract keywords, thereby realizing the automatic generation of the target abstract.

S202: Query the database based on the abstract keywords, and obtain N original documents corresponding to the abstract keywords from the initial documents stored in the database.

Among them, the database is the library used to store the original literature. The initial documents are documents pre-stored in the database. It is understood that the initial documents include abstracts of various types of abstracts, and the initial documents include documents corresponding to the abstract keywords and documents that do not correspond to the abstract keywords. For example, the initial documents can be It is the literature corresponding to the medical direction, the literature corresponding to the food direction, or the literature corresponding to the financial direction, etc. The original document refers to the abstract corresponding to the abstract keywords.

Specifically, the document abstracts in the database are classified in advance, and abstract documents of the same abstract type are obtained, and each abstract document carries at least one abstract type, so as to provide technical support for obtaining the corresponding abstract type according to the abstract keywords in the future. When the server obtains the abstract keywords, it uses a matching algorithm to query the database according to the abstract keywords, so as to obtain the original documents corresponding to the abstract keywords from the document abstracts, realize the automatic determination of the original documents of the same abstract type, and ensure subsequent target abstracts. accuracy and reduce labor costs. Among them, the abstract type refers to the type corresponding to the literature abstract.

S203: Use a pre-trained extractive literature summary model to process the original literature to obtain M target sentences.

Among them, the extractive document abstract model refers to a model that directly extracts the required target sentences from N original documents. The extractive document abstract model can better retain the information of the original documents and can effectively improve the accuracy of the target abstract. Reduce the grammatical and syntactic error rates of the subsequently generated target abstracts. In this embodiment, the extractive document abstract model is the NeuSUM model, which automatically extracts sentences with higher scores in the original documents as target sentences, reducing labor costs. The model uses sentence benefit as a scoring method, taking into account the relationship between sentences, to ensure that the obtained target sentences are highly relevant, and the subsequently generated target summaries are more coherent. The target sentence refers to the sentence used to form the target summary.

Specifically, the original document is input into the pre-trained extractive document abstract model. First, the original document is divided so that the original document is divided into multiple abstract sentences, and the abstract sentences are converted into sentence vectors in the embedding layer to convert them into computer Recognizable format; the sentence vector is encoded at the target coding layer to obtain the target coding vector containing semantic information to retain more information of the summary sentence; at the scoring coding layer, the target coding vector is scored according to the benefit of the sentence, and each target coding vector is obtained. The score corresponding to the abstract sentence, using the sentence benefit as the scoring method, that is, with The ROUGE evaluation index is used as the index for scoring the summary sentences to consider the relationship between the summary sentences, and the top M summary sentences with higher scores are used as the target sentences, so that the target sentences have a strong connection, and the target sentence can be quickly obtained. .

In this embodiment, the training of the extractive document abstract model is to continuously adjust the weight of the initial model by using the back-propagation algorithm until the weight of the initial model converges, and then the extractive document abstract model is obtained.

S204: Input the M target sentences into the trained target model combination to obtain M*(M-1)/2 directed acyclic graphs corresponding to the target sentences.

where , the target model combination is the model used to rank any two target sentences. The target combination model includes a BERT model and an MLP model. In this embodiment, the BERT model and the MLP model can be used to accurately determine the sequence of target sentences, and provide technical support for the subsequent generation of target abstracts. Among them, the BERT model is used to process any two target sentences, and the transformer structure is used in the encoding layer and decoding layer of the BERT model to ensure that the output of the BERT model is a semantic information vector with semantic information. For example, any two Each target sentence is input into the BERT model in the form of target sentence 1; target sentence 2, and the corresponding output vector with semantic information is obtained, and the output vector is input into the MLP model to obtain the sequence between any two target sentences. Before using the BERT model, train the BERT model. The training process is as follows: 1. Obtain a training corpus sample. The training expected sample includes positive sample sentence pairs and negative sample sentence pairs. Understandably, positive sample sentence pairs exist between sentences. Context relationship, there is no context relationship between sentences for negative sample sentence pairs; 2. Use [SEP] tags to connect sentence pairs to connected sentences, for example, sentence 1 [SEP] sentence 2; use [CLS] as the connected sentence in the connected sentence Label, use [SEP] as a label at the end of the sentence, and use these labels to mark the position of the sentence itself and the contextual relationship between sentences, which can help the initial Bert model learn these features when training the initial Bert model; 3. Randomly 4. Input the training corpus into the initial Bert model for training to obtain the Bert model.

The MLP model is a multi-layer perceptron model, which is used to perform binary classification processing on any two sentences to obtain the sequence of any two target sentences. In this embodiment, the MLP model is trained before the MLP model is used. The training process is as follows: obtaining training samples and sequence labels corresponding to the training samples, wherein the training samples are original sentence pairs; input the training samples into the initial model, and obtain Predicted sentence sequence results; according to the sequence labels and sentence sequence results, the classification accuracy is calculated, and when the classification accuracy is greater than the preset value, the MLP model is obtained.

A directed acyclic graph is a directed graph without loops. Suppose the directed acyclic graph is S1→S2→S3→S4, where S1, S2, S3 and S4 are the target sentences.

In the prior art, in the process of sorting the target sentence to form the target summary, to determine the sentence in the current sorting position, it is necessary to predict the sentence in the previous position first. This method is relatively complex in the model training process, requires a large amount of calculation, takes a long training time, and has high accuracy. low to get the target summary. In this embodiment, M target sentences are combined in pairs to obtain M*(M-1)/2 sentence combinations, each sentence combination is input into the BERT model to obtain a semantic information vector, and the semantic information vector is input into the MLP model to obtain the sequence of any two target sentences, and form a directed acyclic graph based on the sequence. In this embodiment, by determining the sequence between any two target sentences, it ensures that the subsequent sequence of the target sentences is simple and effective. Improve the accuracy and ensure that the generated target summaries are more coherent. As an example, when M is equal to 3, that is, the target sentences are S1, S2 and S3, the three sentences are combined in pairs to obtain three sentence combinations, namely S1 and S3, S1 and S2, S2 and S3, and each A sentence combination is input into the target model combination to obtain the sequence of any two target sentences, so as to ensure that the subsequent sorting process of the target sentences is simple, the accuracy is effectively improved, and the consistency of the generated target summaries is ensured.

S205: Based on M*(M-1)/2 directed acyclic graphs, obtain a target summary.

Among them, the target summary refers to the summary required by the user.

In this embodiment, M*(M-1)/2 directed acyclic graphs are processed to obtain a topology map, and the breadth-first algorithm is used to process the topology map to obtain the current in-degree of each target sentence. The in-degree sorts the target sentences and obtains the target summary. The process is relatively simple, and a coherent target summary can be obtained quickly. As an example, assuming that the target sentences are S1, S2, and S3, the directed acyclic graph is S1→S2; S1→S3; S2→S3; the topological graph shown in Figure 9 is obtained by processing the directed acyclic graph. In 9, the current in-degree of S1 is 0; the current in-degree of S2 is 1; the current in-degree of S3 is 2; the target sentence whose current in-degree is 0, that is, S1, is pushed into the stack queue as the stack bottom element, and the target sentence is The in-degree of other target sentences pointed to by S1 is subtracted by 1, then the current in-degree of S2 is 1-1=0; the current in-degree of S3 is 2-1=1; then the target sentence S2 is pushed into the stack queue as the stack The bottom element, repeat this process until all target sentences are pushed into the stack queue, the stack queue output by this process is the target summary, the process is relatively simple, and a coherent target summary can be obtained quickly. Among them, the in-degree, derived from the graph theory algorithm, usually refers to the sum of the times that a certain point in a directed graph is used as the end point of an edge in the graph. The current in-degree refers to the in-degree corresponding to each target sentence.

The method for generating a document abstract provided in this embodiment queries a database based on the abstract keywords, obtains N original documents corresponding to the abstract keywords from the initial documents stored in the database, realizes automatic determination of the original documents of the same abstract type, and ensures subsequent Accuracy of target summaries, reducing labor costs. The pre-trained extractive literature abstract model is used to process the original literature, and quickly obtain M target sentences, which makes the target sentences have a strong connection, and ensures that the subsequently generated target abstracts record the important information of the original literature. Input the M target sentences into the trained target model combination to obtain M*(M-1)/2 directed acyclic graphs corresponding to the target sentences. By determining the sequence between any two target sentences, it is guaranteed that the subsequent The sorting process of the target sentences is simple, the accuracy is effectively improved, and the consistency of the generated target summaries is ensured. Based on M*(M-1)/2 directed acyclic graphs, coherent target summaries can be quickly obtained.

In one embodiment, as shown in FIG. 3 , step S202 is to query the database based on the abstract keywords, and obtain N original documents corresponding to the abstract keywords from the initial documents stored in the database, including:

S301: Query the classification table in the database based on the abstract keyword, and obtain the abstract type corresponding to the abstract keyword.

The classification table is a preset table, and the classification table is used to indicate the association relationship between preset keywords and abstract types. The preset keywords are words corresponding to the abstract keywords. The summary type refers to the type of summary, for example, the summary type can be medical type, financial type, mechanical type, etc. As an example, if the preset keyword is xx disease, the corresponding abstract type is medical type.

In this embodiment, a matching algorithm is used to match the abstract keywords with the preset keywords in the classification table. If the matching is successful, it means that there are preset keywords corresponding to the abstract keywords. Therefore, according to the corresponding preset keywords The keyword can be used to obtain the abstract type corresponding to the abstract keyword, which provides technical support for the subsequent determination of the original literature.

S302: Query the initial documents in the database based on the abstract type, and determine the N initial documents including the abstract keywords as the N initial documents.

In this embodiment, the initial documents in the database are classified in advance, and after the abstract type is determined, the abstract keywords are matched with the initial documents in the abstract type to obtain the initial documents corresponding to the abstract keywords, which is faster .

The method for generating a document abstract provided in this embodiment queries a classification table in a database based on the abstract keyword, obtains the abstract type corresponding to the abstract keyword, and provides technical support for subsequent determination of the original document. The initial documents in the database are queried based on the abstract type, and the N initial documents containing the abstract keywords are determined as the N initial documents, which is faster.

In one embodiment, as shown in FIG. 4 , in step S203, a pre-trained extractive document abstract model is used to process the original document to obtain M target sentences, including:

S401: Segment the original document to obtain at least two abstract sentences.

Among them, the segmentation process refers to the process of dividing the original document into multiple sentences, so that the computer can process the abstract sentences. The abstract sentence is a single sentence obtained by segmenting the original document.

As an example, the division is performed according to commas and periods in the original document. For example, if the original document is xxxx, yyyyy; zzz, the original document is divided into xxxx, yyyyy and zzz as three sentences by searching for commas and periods.

S402: Input all abstract sentences into the word embedding layer of the extractive document abstract model, and obtain sentence vectors corresponding to each abstract sentence.

Among them, the sentence vector refers to the vector obtained after the abstract sentence is processed by the word embedding layer. After the word embedding layer, the abstract sentence can be converted into a corresponding vector, which is convenient for computer recognition. The word embedding layer is the layer used to convert summary sentences into computer-recognizable sentence vectors.

S403: Input each sentence vector into the target coding layer of the extractive document abstract model to obtain a target coding vector corresponding to each sentence vector.

Among them, the target encoding layer is used for sentence-level and document-level encoding of sentence vectors. Specifically, the sentence vector is firstly input into the sentence encoding layer to obtain the original encoding vector represented by the vector of the sentence, and the original encoding vector is input into the document encoding layer to obtain the target encoding vector.

S404: Input the target coding vector into the scoring coding layer of the extractive document summary model, and obtain the scoring result corresponding to each summary sentence.

The scoring result refers to the result of scoring the target coding vector corresponding to each summary sentence using the scoring coding layer. Understandably, the summary sentence with a higher score is determined as the target sentence, so that the target sentence contains important information. sentences, to ensure that the subsequently generated target abstracts record the important sentences of the original literature.

S405: According to the scoring results of the multiple target coding vectors, the first M summary sentences are selected in sequence from high to low, and are determined as M target sentences.

In this embodiment, sentence scoring and sentence selection are combined using an extractive document summary model, so as to associate the information of the sentences and ensure that the target sentence has important information.

In the method for generating a document abstract provided in this embodiment, the original document is segmented to obtain at least two abstract sentences, so that the computer can process the abstract sentences. Input all abstract sentences into the word embedding layer of the extractive document abstract model, and obtain the sentence vector corresponding to each abstract sentence, which is convenient for computer identification. Input each sentence vector into the target coding layer of the extractive document summarization model to obtain the target coding vector corresponding to each sentence vector; input the target coding vector into the scoring coding layer of the extractive document summarization model to obtain the corresponding score of each abstract sentence Results: According to the scoring results of multiple target encoding vectors, the first M summary sentences are selected in order from high to low, and determined as M target sentences. The extractive literature abstract model is used to combine sentence scoring and sentence selection. Correlate the information of the sentences to ensure that the target sentence has important information.

In one embodiment, as shown in FIG. 5 , step S403 is to input each sentence vector into the target coding layer of the extractive document abstract model to obtain the target coding vector corresponding to each sentence vector, including:

S501: Input each sentence vector into the sentence coding layer of the extractive document abstract model for coding, and obtain the original coding vector corresponding to the sentence vector;

S502: Input the original encoding vector into the document encoding layer of the extractive document abstract model for re-encoding to obtain a target encoding vector.

Among them, the sentence encoding layer is a bidirectional GRU sentence encoding layer, and the sentence-level encoding is obtained by using the bidirectional GRU sentence encoding layer. The document encoding layer refers to the bidirectional GRU document encoding layer, and the document-level encoding is obtained by using the bidirectional GRU document encoding layer.

In one embodiment, the target model combination includes a pre-trained BERT model and an MLP model; as shown in Figure 6, step S204 is to input M target sentences into the trained target model combination to obtain M*( M-1)/2 directed acyclic graphs, including:

S601: Combining target sentences in pairs to obtain M*(M-1)/2 sentence combinations;

Among them, the sentence combination refers to the combination formed by any two target sentences, so that the context before the two target sentences can be obtained subsequently.

In this embodiment, the target sentences are combined in pairs to obtain M*(M-1)/2 sentence combinations, which is conducive to simplifying the subsequent steps of determining the context between any two target sentences and ensuring that any two The accuracy of contextual relationships between target sentences ensures a coherent target summary. As an example, when the target sentence is 3, assuming that the target sentences are S1, S2 and S3, and the target sentences are combined in pairs, the sentence combinations can be obtained as S1 and S2, S1 and S3, S2 and S3.

S602: Input each sentence combination into the BERT model, and obtain a semantic information vector corresponding to each sentence combination.

In this embodiment, the role of BERT is to obtain a vector representation of sentence combinations. BERT mainly includes word embedding layer, encoding layer and decoding layer. The role of the word embedding layer is to map documents to vectors, the input is a document, and the output is a vector. Both the encoding layer and the decoding layer use the transformer structure to obtain semantic information vectors with semantic information.

S603: Input the semantic information vector into the MLP model to obtain a directed acyclic graph of any two target sentences.

In this embodiment, the Bert model and the MLP model are used to extract and classify the abstract sentences, so as to obtain the target sentences, and determine the contextual dependencies between the target sentences, so as to solve the problem that in the prior art, only the Bert model is used for classification accuracy. low problem.

In the method for generating document abstracts provided in this embodiment, target sentences are combined in pairs to obtain M*(M-1)/2 sentence combinations, which is beneficial to simplify the subsequent steps of determining the contextual relationship between any two target sentences And ensure the accuracy of determining the contextual relationship between any two target sentences, and ensure a coherent target summary. Input each sentence combination into the BERT model to obtain the semantic information vector corresponding to each sentence combination; input the semantic information vector into the MLP model to obtain the directed acyclic graph of any two target sentences to obtain the target sentence and determine the target sentence In order to solve the problem of low classification accuracy using only Bert model in the prior art.

In one embodiment, as shown in FIG. 7 , step S205, that is, based on M*(M-1)/2 directed acyclic graphs, obtain the target digest, including:

S701: Process M*(M-1)/2 directed acyclic graphs to obtain a topology graph.

Among them, the topological graph refers to the graph formed by the collection of all directed acyclic graphs, so that the subsequent breadth-first traversal can be performed to obtain the current in-degree of each target sentence.

S702: Use the breadth-first algorithm to traverse the topology map to obtain the current in-degree of each target sentence.

Among them, the breadth-first algorithm, also known as breadth-first search, breadth-first search and horizontal-first search, is a graph search algorithm; the so-called breadth is a layer-by-layer traversal.

In this implementation, the breadth-first algorithm is used to process the topology map to obtain the current in-degree of each target sentence, and the target sentences are sorted according to the current in-degree to obtain the target abstract. The process is relatively simple, and a coherent target abstract can be obtained quickly. Suppose that the target sentence S1 is before the target sentence S2, the target sentence S2 is before the target sentence S3, and the target sentence S3 is before the target sentence S4; then S1 points to S2, S3 and S4 respectively, S2 points to S3 and S4 respectively, S3 points to S4, Therefore, the current in-degree of S1 is 0; the current in-degree of S2 is 1; the current in-degree of S3 is 2; the current in-degree of S4 is 3.

S703: Push all target sentences into the stack according to the current in-degree to obtain a stack queue.

Specifically, push the first target sentence with an in-degree of 0 into the stack queue as the bottom element of the stack, and subtract 1 from the in-degree of other target sentences pointed to by the target sentence, then the original target sentence with an in-degree of 1 becomes When the in-degree becomes 0, the target sentence 2 is pushed into the stack queue as the bottom element of the stack, and this process is repeated until all target sentences are pushed into the stack queue. The stack queue formed by this process is the target summary. The process is relatively simple and can be Get a coherent summary of your goals quickly.

Suppose the target sentence S1 is before the target sentence S2, the target sentence S2 is before the target sentence S3, and the target sentence S3 is before the target sentence S4; then S1 points to S2, S3 and S4 respectively, S2 points to S3 and S4 respectively, S3 points to S4, so , the current in-degree of S1 is 0; the current in-degree of S2 is 1; the current in-degree of S3 is 2; the current in-degree of S4 is 3. Then first push S1 into the stack queue as the stack bottom element, the current in-degree of S2 becomes 0; the current in-degree of S3 is 1; the current in-degree of S4 is 2, and S2 is pushed into the stack queue as the stack bottom element, ......, the stack queue is S1→S2→S3→S4.

S704: Obtain a target summary based on the stack queue.

In this embodiment, the target abstract is obtained according to the order of each target sentence in the stack queue, so as to ensure that the generated target abstract is coherent.

The method for generating literature abstracts provided in this embodiment processes M*(M-1)/2 directed acyclic graphs to obtain a topology graph, so that breadth-first traversal can be performed subsequently to obtain the current in-degree of each target sentence. . The breadth-first algorithm is used to traverse the topology map to obtain the current in-degree of each target sentence; according to the current in-degree, all target sentences are pushed into the stack to obtain the stack queue; based on the stack queue, the target abstract is obtained to ensure that the generated target abstract is coherent.

It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

In one embodiment, an apparatus for generating a document abstract is provided, and the apparatus for generating a document abstract corresponds one-to-one with the method for generating a document abstract in the above-mentioned embodiment. As shown in FIG. 8 , the document abstract generation device includes an abstract generation request acquisition module 801 , an original document acquisition module 802 , a target sentence acquisition module 803 , a directed acyclic graph acquisition module 804 and a target abstract acquisition module 805 . The detailed description of each functional module is as follows:

The abstract generation request acquiring module 801 is configured to acquire an abstract generation request, where the abstract generation request includes abstract keywords.

The original document obtaining module 802 is configured to query the database based on the abstract keywords, and obtain N original documents corresponding to the abstract keywords from the initial documents stored in the database.

The target sentence obtaining module 803 is used to process the original document by using the pre-trained extractive document abstract model to obtain M target sentences.

The directed acyclic graph obtaining module 804 is used for inputting the M target sentences into the trained target model combination to obtain M*(M-1)/2 directed acyclic graphs corresponding to the target sentences.

The target abstract obtaining module 805 is configured to obtain the target abstract based on M*(M-1)/2 directed acyclic graphs.

Preferably, the original document acquisition module 802 includes: an abstract type acquisition unit and an original document acquisition unit.

The abstract type obtaining unit is used to query the classification table in the database based on the abstract keywords, and obtain the abstract types corresponding to the abstract keywords.

The original document acquisition unit is used to query the document abstracts in the database according to the abstract type, and determine N document abstracts including abstract keywords as N original documents.

Preferably, the target sentence obtaining module 803 includes: a segmentation processing unit, a sentence vector obtaining unit, a target coding vector obtaining unit, and a scoring result obtaining unit.

The segmentation processing unit is used to segment the original document to obtain at least two abstract sentences.

The sentence vector obtaining unit is used to input all the abstract sentences into the word embedding layer of the extractive document abstract model, and obtain the sentence vector corresponding to each abstract sentence.

The target coding vector obtaining unit is used to input each sentence vector into the target coding layer of the extractive document abstract model, and obtain the target coding vector corresponding to each sentence vector.

The scoring result obtaining unit is used for inputting the target coding vector into the scoring coding layer of the extractive document summarization model to obtain the scoring result corresponding to each summary sentence.

The target sentence obtaining unit is used to select the first M summary sentences in order from high to low according to the scoring results of multiple target encoding vectors, and determine them as M target sentences.

Preferably, the target coding vector obtaining unit includes: a first coding subunit and a second coding subunit.

The first encoding subunit is used to input each sentence vector into the sentence encoding layer of the extractive document abstract model for encoding, and obtain the original encoding vector corresponding to the sentence vector.

The second encoding subunit is used for inputting the original encoding vector into the document encoding layer of the extractive document abstract model for re-encoding to obtain the target encoding vector.

Preferably, the target model combination includes a BERT model and an MLP model. The directed acyclic graph obtaining module 804 includes: a sentence combination obtaining unit, a semantic information vector obtaining unit and a directed acyclic graph obtaining unit.

The sentence combination acquisition unit is used to combine the target sentences in pairs to obtain M*(M-1)/2 sentence combinations.

The semantic information vector obtaining unit is used to input each sentence combination into the BERT model, and obtain the semantic information vector corresponding to each sentence combination.

The directed acyclic graph acquisition unit is used to input the semantic information vector into the MLP model to obtain the directed acyclic graph of any two target sentences.

Preferably, the target digest obtaining module 805 includes: a topology map obtaining unit, an in-degree obtaining unit, a stack queue obtaining unit and a target digest obtaining unit.

The topology map obtaining unit is used for processing M*(M-1)/2 directed acyclic graphs to obtain a topology map.

The in-degree obtaining unit is used to traverse the topology map using the breadth-first algorithm to obtain the current in-degree of each target sentence.

The stack queue obtaining unit is used to push all target sentences into the stack according to the current in-degree, and obtain the stack queue.

The target digest obtaining unit is used to obtain the target digest based on the stack queue.

For the specific limitation of the document abstract generating apparatus, please refer to the above limitation on the document abstract generating method, which will not be repeated here. Each module in the above-mentioned document abstract generating apparatus can be implemented in whole or in part by software, hardware and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

In one embodiment, a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 10 . The computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions and a database. The internal memory provides an environment for the execution of the operating system and computer-readable instructions in the non-volatile storage medium. The database of the computer equipment is used to store the directed acyclic graph. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer readable instructions, when executed by a processor, implement a method for generating a literature abstract.

In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and the processor implements the documents in the above embodiments when the processor executes the computer-readable instructions The steps of the abstract generating method, such as steps S201-S205 shown in FIG. 2, or steps shown in FIG. 3 to FIG. 7, are not repeated here in order to avoid repetition. Or, when the processor executes the computer-readable instructions, the functions of each module/unit in this embodiment of the document abstract generating apparatus are implemented, for example, the abstract generation request acquisition module 801, the original document acquisition module 802, the target sentence acquisition shown in FIG. 8 The functions of the module 803 , the directed acyclic graph obtaining module 804 and the target abstract obtaining module 805 are not repeated here in order to avoid repetition. The readable storage medium provided by this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.

In one embodiment, one or more readable storage media storing computer-readable instructions are provided, and the computer-readable instructions are stored on the readable storage medium, and the computer-readable instructions are executed by the processor to implement the above-mentioned implementation. The steps of the method for generating a document abstract in the example, such as steps S201-S205 shown in FIG. 2 , or steps shown in FIG. 3 to FIG. 7 , are not repeated here to avoid repetition. Or, when the processor executes the computer-readable instructions, the functions of each module/unit in this embodiment of the document abstract generating apparatus are implemented, for example, the abstract generation request acquisition module 801, the original document acquisition module 802, the target sentence acquisition shown in FIG. 8 The functions of the module 803 , the directed acyclic graph obtaining module 804 and the target abstract obtaining module 805 are not repeated here in order to avoid repetition.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a non-volatile computer. In the readable storage medium, the computer-readable instructions, when executed, may include the processes of the foregoing method embodiments. Wherein, any reference to memory, storage, database or other medium used in the various embodiments provided in this application may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (MRAM), synchronous MRAM (SMRAM), double data rate SMRAM (MMRSMRAM), enhanced SMRAM (ESMRAM), synchronous chain Road (Synchlink) MRAM (SLMRAM), memory bus (Rambus) direct RAM (RMRAM), direct memory bus dynamic RAM (MRMRAM), and memory bus dynamic RAM (RMRAM), etc.

Those skilled in the art can clearly understand that, for the convenience and simplicity of description, only the division of the above-mentioned functional units and modules is used as an example. Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the above-mentioned embodiments, those of ordinary skill in the art should understand that: it can still be used for the above-mentioned implementations. The technical solutions described in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the application, and should be included in the within the scope of protection of this application.

Claims

A method for generating literature abstracts, including:

Obtain an abstract generation request, where the abstract generation request includes abstract keywords;

Querying a database based on the abstract keywords, obtaining N original documents corresponding to the abstract keywords from the initial documents stored in the database;

The original document is processed by using a pre-trained extractive document abstract model to obtain M target sentences;

The M described target sentences are input into the trained target model combination to obtain M*(M-1)/2 directed acyclic graphs corresponding to the target sentences;

Based on M*(M-1)/2 of the directed acyclic graphs, a target summary is obtained.
The method for generating document abstracts according to claim 1, wherein the querying a database based on the abstract keywords, and obtaining N original documents corresponding to the abstract keywords from the initial documents stored in the database, comprising:

Query the classification table in the database based on the abstract keywords, and obtain the abstract types corresponding to the abstract keywords;

The document abstracts in the database are searched according to the abstract type, and N document abstracts including the abstract keywords are determined as N original documents.
The method for generating a document abstract as claimed in claim 1, wherein the pre-trained extractive document abstract model is used to process the original document to obtain M target sentences, including:

Segmenting the original document to obtain at least two abstract sentences;

Input all the abstract sentences into the word embedding layer of the extractive document abstract model, and obtain the sentence vector corresponding to each of the abstract sentences;

Inputting each of the sentence vectors into the target coding layer of the extractive document abstract model to obtain a target coding vector corresponding to each of the sentence vectors;

Inputting the target coding vector into the scoring coding layer of the extractive document abstract model, to obtain the scoring result corresponding to each of the summary sentences;

The first M summary sentences are selected in order from high to low from the scoring results of the plurality of target coding vectors, and are determined as M target sentences.
The method for generating document abstracts according to claim 3, wherein each of the sentence vectors is input into the target coding layer of the extractive document abstract model to obtain a target coding vector corresponding to each sentence vector, comprising: :

Inputting each of the sentence vectors into the sentence coding layer of the extractive document abstract model for coding, and obtaining the original coding vector corresponding to the sentence vector;

The original encoding vector is input into the document encoding layer of the extractive document abstract model for re-encoding to obtain the target encoding vector.
The method for generating document abstracts according to claim 1, wherein the target model combination comprises a BERT model and an MLP model;

Described inputting M described target sentences into the trained target model combination to obtain M*(M-1)/2 directed acyclic graphs of any two described target sentences, including:

The target sentences are combined in pairs to obtain M*(M-1)/2 sentence combinations;

Inputting each combination of sentences into the BERT model to obtain the semantic information vector corresponding to each combination of sentences;

Input the semantic information vector into the MLP model to obtain a directed acyclic graph of any two target sentences.
The method for generating a document abstract according to claim 1, wherein, based on M*(M-1)/2 of the directed acyclic graphs, obtaining the target abstract, comprising:

Process the M*(M-1)/2 directed acyclic graphs to obtain a topology graph;

The breadth-first algorithm is used to traverse the topology map to obtain the current in-degree of each target sentence;

All target sentences are pushed into the stack according to the current in-degree, and the stack queue is obtained;

Based on the stack queue, a target digest is obtained.
A device for generating literature abstracts, comprising:

An abstract generation request acquiring module is used to acquire an abstract generation request, where the abstract generation request includes an abstract keyword;

An original document acquisition module, configured to query a database based on the abstract keywords, and acquire N original documents corresponding to the abstract keywords from the initial documents stored in the database;

The target sentence acquisition module is used to process the original document by using the pre-trained extractive document abstract model to obtain M target sentences;

The directed acyclic graph acquisition module is used to input the M target sentences into the trained target model combination to obtain M*(M-1)/2 directed acyclic graphs corresponding to the target sentences;

The target abstract obtaining module is used for obtaining target abstracts based on M*(M-1)/2 of the directed acyclic graphs.
The device for generating document abstracts according to claim 7, wherein the original document acquisition module comprises:

An abstract type obtaining unit is configured to query the classification table in the database based on the abstract keywords, and obtain the abstract types corresponding to the abstract keywords;

The original document acquisition unit is configured to query the document abstracts in the database according to the abstract type, and determine N document abstracts including the abstract keywords as N original documents.
A computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, wherein the processor implements the following steps when executing the computer-readable instructions:

Obtain an abstract generation request, where the abstract generation request includes abstract keywords;

Querying a database based on the abstract keywords, obtaining N original documents corresponding to the abstract keywords from the initial documents stored in the database;

The original document is processed by using a pre-trained extractive document abstract model to obtain M target sentences;

The M described target sentences are input into the trained target model combination to obtain M*(M-1)/2 directed acyclic graphs corresponding to the target sentences;

Based on M*(M-1)/2 of the directed acyclic graphs, a target summary is obtained.
The computer device according to claim 9, wherein querying a database based on the abstract keywords, and acquiring N original documents corresponding to the abstract keywords from the initial documents stored in the database, comprising:

Query the classification table in the database based on the abstract keywords, and obtain the abstract types corresponding to the abstract keywords;

The document abstracts in the database are searched according to the abstract type, and N document abstracts including the abstract keywords are determined as N original documents.
The computer device according to claim 9, wherein the original document is processed by using a pre-trained extractive document abstract model to obtain M target sentences, including:

Segmenting the original document to obtain at least two abstract sentences;

Input all the abstract sentences into the word embedding layer of the extractive document abstract model, and obtain the sentence vector corresponding to each of the abstract sentences;

Inputting each of the sentence vectors into the target coding layer of the extractive document abstract model to obtain a target coding vector corresponding to each of the sentence vectors;

Inputting the target coding vector into the scoring coding layer of the extractive document abstract model, to obtain the scoring result corresponding to each of the summary sentences;

The first M summary sentences are selected in order from high to low from the scoring results of the plurality of target coding vectors, and are determined as M target sentences.
The computer device according to claim 11, wherein each of the sentence vectors is input into the target coding layer of the extractive document abstraction model to obtain a target coding vector corresponding to each of the sentence vectors, comprising:

Inputting each of the sentence vectors into the sentence coding layer of the extractive document abstract model for coding, and obtaining the original coding vector corresponding to the sentence vector;

The original encoding vector is input into the document encoding layer of the extractive document abstract model for re-encoding to obtain the target encoding vector.
The computer device of claim 9, the target model combination comprising a BERT model and an MLP model;

Described inputting M described target sentences into the trained target model combination to obtain M*(M-1)/2 directed acyclic graphs of any two described target sentences, including:

The target sentences are combined in pairs to obtain M*(M-1)/2 sentence combinations;

Inputting each combination of sentences into the BERT model to obtain the semantic information vector corresponding to each combination of sentences;

Input the semantic information vector into the MLP model to obtain a directed acyclic graph of any two target sentences.
The computer device according to claim 9, based on M*(M-1)/2 of the directed acyclic graphs, obtaining a target summary, comprising:

Process the M*(M-1)/2 directed acyclic graphs to obtain a topology graph;

The breadth-first algorithm is used to traverse the topology map to obtain the current in-degree of each target sentence;

All target sentences are pushed into the stack according to the current in-degree, and the stack queue is obtained;

Based on the stack queue, a target digest is obtained.
One or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:

Obtain an abstract generation request, where the abstract generation request includes abstract keywords;

Querying a database based on the abstract keywords, obtaining N original documents corresponding to the abstract keywords from the initial documents stored in the database;

The original document is processed by using a pre-trained extractive document abstract model to obtain M target sentences;

The M described target sentences are input into the trained target model combination to obtain M*(M-1)/2 directed acyclic graphs corresponding to the target sentences;

Based on M*(M-1)/2 of the directed acyclic graphs, a target summary is obtained.
The readable storage medium according to claim 15, wherein the querying a database based on the abstract keywords, and obtaining N original documents corresponding to the abstract keywords from the initial documents stored in the database, comprising:

Query the classification table in the database based on the abstract keywords, and obtain the abstract types corresponding to the abstract keywords;

The document abstracts in the database are searched according to the abstract type, and N document abstracts including the abstract keywords are determined as N original documents.
The readable storage medium according to claim 15, wherein the original document is processed by using a pre-trained extractive document abstract model to obtain M target sentences, including:

Segmenting the original document to obtain at least two abstract sentences;

Input all the abstract sentences into the word embedding layer of the extractive document abstract model, and obtain the sentence vector corresponding to each of the abstract sentences;

Inputting each of the sentence vectors into the target coding layer of the extractive document abstract model to obtain a target coding vector corresponding to each of the sentence vectors;

Inputting the target coding vector into the scoring coding layer of the extractive document abstract model, to obtain the scoring result corresponding to each of the summary sentences;

The first M summary sentences are selected in order from high to low from the scoring results of the plurality of target coding vectors, and are determined as M target sentences.
The readable storage medium according to claim 17, wherein the inputting each of the sentence vectors into a target coding layer of the extractive document summarization model to obtain a target coding vector corresponding to each of the sentence vectors, comprising: :

Inputting each of the sentence vectors into the sentence coding layer of the extractive document abstract model for coding, and obtaining the original coding vector corresponding to the sentence vector;

The original encoding vector is input into the document encoding layer of the extractive document abstract model for re-encoding to obtain the target encoding vector.
The readable storage medium of claim 15, wherein the target model combination comprises a BERT model and an MLP model;

Described inputting the M described target sentences into the trained target model combination to obtain M*(M-1)/2 directed acyclic graphs of any two described target sentences, including:

The target sentences are combined in pairs to obtain M*(M-1)/2 sentence combinations;

Input each of the sentence combinations into the BERT model, and obtain the semantic information vector corresponding to each of the sentence combinations;

Input the semantic information vector into the MLP model to obtain a directed acyclic graph of any two target sentences.
The readable storage medium of claim 15, wherein, based on M*(M-1)/2 of the directed acyclic graphs, obtaining the target digest comprises:

Process the M*(M-1)/2 directed acyclic graphs to obtain a topology graph;

The breadth-first algorithm is used to traverse the topology map to obtain the current in-degree of each of the target sentences;

All target sentences are pushed into the stack according to the current in-degree, and the stack queue is obtained;

Based on the stack queue, a target digest is obtained.