WO2022262266A1 - Text abstract generation method and apparatus, and computer device and storage medium - Google Patents

Text abstract generation method and apparatus, and computer device and storage medium Download PDF

Info

Publication number
WO2022262266A1
WO2022262266A1 PCT/CN2022/071791 CN2022071791W WO2022262266A1 WO 2022262266 A1 WO2022262266 A1 WO 2022262266A1 CN 2022071791 W CN2022071791 W CN 2022071791W WO 2022262266 A1 WO2022262266 A1 WO 2022262266A1
Authority
WO
WIPO (PCT)
Prior art keywords
clauses
clause
sentence
similarity
recommended
Prior art date
Application number
PCT/CN2022/071791
Other languages
French (fr)
Chinese (zh)
Inventor
李夏昕
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022262266A1 publication Critical patent/WO2022262266A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The present application relates to the field of artificial intelligence. Provided are a text abstract generation method and apparatus, and a computer device and a storage medium. The defects whereby a traditional TextRank algorithm, when calculating the similarity of plain text, does not distinguish the importance of different terms and also does not filter out unimportant words according to the parts of speech to which said words belong can be effectively overcome, thereby improving the possibility of sentences having high business relevance being selected for an abstract; the before-after proximity relationship between sentences and the positions of the sentences in an original article are fully considered during a modeling process, thereby effectively overcoming the problem of inaccurate text abstract generation caused by the fact that the importance of the positional sequence of sentences in an article is not considered in a traditional mode; and post-processing is added on the basis of traditional text abstract generation, and an abstract result acquired by means of a graph algorithm is corrected, thereby improving the quality of a finally-output abstract, and thus realizing more accurate text abstract generation on the basis of artificial intelligence. The present application further relates to blockchain technology, and an abstract sentence may be stored in a blockchain node.

Description

文本摘要生成方法、装置、计算机设备及存储介质Text summary generation method, device, computer equipment and storage medium
本申请要求于2021年6月18日提交中国专利局、申请号为202110679639.7,申请名称为“文本摘要生成方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202110679639.7 and the title of "text abstract generation method, device, computer equipment and storage medium" submitted to the China Patent Office on June 18, 2021, the entire content of which is incorporated by reference incorporated in this application.
技术领域technical field
本申请涉及人工智能技术领域,尤其涉及一种文本摘要生成方法、装置、计算机设备及存储介质。The present application relates to the technical field of artificial intelligence, and in particular to a text abstract generation method, device, computer equipment and storage medium.
背景技术Background technique
文本摘要技术是人工智能领域的重要技术。对于人类来说,阅读一段长文本,并提炼其核心摘要内容,是一种天生的能力。但对于计算机来说,却代表了人工智能领域最具挑战性技术的进展和突破。当今世界的互联网承载了海量的文本信息,其中不乏大量的中、长篇幅文本。通过机器对这些文本加以理解,并提炼出核心摘要,可以支持各类有益于人类社会的应用功能,如:媒体监控、搜索引擎营销和优化、财务和法务文本分析研究、社交媒体营销、书籍和文献内容索引、视频会议摘要、自动内容创作等。Text summarization technology is an important technology in the field of artificial intelligence. For humans, it is an innate ability to read a long text and extract its core summary content. But for computers, it represents the most challenging technological progress and breakthrough in the field of artificial intelligence. The Internet in today's world carries a large amount of text information, including a large number of medium and long texts. Understanding these texts through machines and extracting core summaries can support various application functions that are beneficial to human society, such as: media monitoring, search engine marketing and optimization, financial and legal text analysis research, social media marketing, books and Document content indexing, video conferencing abstracts, automatic content authoring, and more.
现有的文本摘要技术可以被横向分为有监督和无监督两种,被纵向分为抽取式和生成式两种。有监督的文本摘要技术需要大量的人工标记数据,文本摘要的人工标记十分费力且成本高昂,不同标记人员对文章核心摘要内容的判断也存在一定偏差,因此工业界技术落地一般采用无监督的方案。抽取式摘要一般以句子为单位从原文章中抽取出重要内容,再拼接起来作为文章摘要。生成式摘要通过深度学习seq2seq(Sequence to Sequence)方式直接生成文章摘要内容,其中涉及到语义表征、推断和自然语言生成等很难落地的技术,因此,生成式摘要更多的是作为学术界的研究热点,在工业界落地效果并不理想。Existing text summarization techniques can be divided horizontally into supervised and unsupervised, and vertically into extractive and generative. Supervised text summarization technology requires a large amount of manual labeling data. Manual labeling of text summaries is very laborious and costly. There are also certain deviations in the judgment of the core abstract content of articles by different labelers. Therefore, unsupervised solutions are generally adopted for the implementation of technology in the industry. . Extractive summarization generally extracts important content from the original article in units of sentences, and then stitches them together as an article abstract. Generative summarization directly generates the content of article summaries through deep learning seq2seq (Sequence to Sequence), which involves semantic representation, inference, and natural language generation, which are difficult to implement. Therefore, generative summarization is more of an academic Research hotspots, the landing effect in the industry is not ideal.
目前,工业界文本摘要技术落地最常采用的是无监督抽取式方案,具体的方法有基于图、基于主题模型、基于中心度和基于信息冗余等方法。其中,基于图的TextRank算法是最经典且应用最广泛的方法。TextRank算法具有较好的通用性,适合各种领域的文本以及中篇和长篇文本,但是也具有一些缺陷:(1)TextRank算法中两图节点的连边是单条无向边,这条边只有单一权重,从这单条无向边来看,两端节点句子的权值是相等的。但文章中的任意两个句子单独拿出来比较,他们的重要程度也应该有高低之分;(2)TextRank算法中,图中任意两个节点都有一条连边,相当于把文章中所有的句子混在一起建模,没有考虑句子前后邻近关系以及它们在原文章的位置。但提取摘要时,句子的位置和句子的前后关系都对摘要句的判定有重要作用,比如文章或段落起头和结尾的句子,以及总结性的句子,都很可能是摘要句;(3)TextRank算法在计算图中连边的权重时,只考虑两个句子之间的纯文本相似度,没有考虑语义相似度,即没有考虑文本写法不一样但语义类似的情况;(4)TextRank算法在计算纯文本相似度时,没有区分不同词条的重要性,也没有按词性过滤掉不重要的词,因此对纯文本相似度计算的准确性有待提升。At present, the unsupervised extraction scheme is most commonly used in the implementation of text summarization technology in the industry. The specific methods include graph-based, topic-based model-based, centrality-based and information redundancy-based methods. Among them, the graph-based TextRank algorithm is the most classic and widely used method. The TextRank algorithm has good versatility and is suitable for texts in various fields as well as medium-length and long-length texts, but it also has some defects: (1) In the TextRank algorithm, the edge connecting two graph nodes is a single undirected edge, and this edge has only Single weight, from the perspective of this single undirected edge, the weights of the node sentences at both ends are equal. However, if any two sentences in the article are compared separately, their importance should also be divided into high and low points; (2) In the TextRank algorithm, any two nodes in the graph have a connecting edge, which is equivalent to taking all the sentences in the article Sentences are mixed together and modeled, without considering the neighbor relationship of sentences and their position in the original article. However, when extracting the abstract, the position of the sentence and the context of the sentence play an important role in the judgment of the summary sentence. For example, the sentence at the beginning and end of the article or paragraph, as well as the summary sentence, are likely to be a summary sentence; (3) TextRank When the algorithm calculates the weights of the edges in the graph, it only considers the plain text similarity between two sentences, and does not consider the semantic similarity, that is, it does not consider the situation that the text is written differently but has similar semantics; (4) TextRank algorithm is calculating In the pure text similarity, the importance of different entries is not distinguished, and the unimportant words are not filtered out according to the part of speech, so the accuracy of the pure text similarity calculation needs to be improved.
发明人意识到,上述缺陷会导致最终的文本摘要生成效果受到影响,并且,现有的文本摘要生成技术也缺乏对所生成摘要的修正,而TextRank算法输出的摘要结果一般也存在一些问题,导致生成的摘要并不理想。The inventor realized that the above-mentioned defects will affect the final text summary generation effect, and the existing text summary generation technology also lacks the correction of the generated summary, and the summary results output by the TextRank algorithm generally have some problems, resulting in The generated summary is not ideal.
发明内容Contents of the invention
本申请实施例提供了一种文本摘要生成方法、装置、计算机设备及存储介质,能够基于人工智能手段实现更加准确的文本摘要生成。Embodiments of the present application provide a text abstract generation method, device, computer equipment, and storage medium, which can realize more accurate text abstract generation based on artificial intelligence means.
第一方面,本申请实施例提供了一种文本摘要生成方法,其包括:In the first aspect, the embodiment of the present application provides a method for generating a text abstract, which includes:
响应于文本摘要生成指令,根据所述文本摘要生成指令获取待处理数据;Responding to a text summary generation instruction, acquiring data to be processed according to the text summary generation instruction;
根据任务场景获取词典对所述待处理数据进行切分处理,得到多个分句;Obtaining a dictionary according to the task scene to perform segmentation processing on the data to be processed to obtain multiple clauses;
计算所述多个分句中每两个分句间的相互推荐度;calculating the degree of mutual recommendation between every two clauses in the plurality of clauses;
计算所述多个分句中每两个分句间的语义相似度;calculating the semantic similarity between every two clauses in the plurality of clauses;
计算所述多个分句中每两个分句间的位置相似度;calculating the positional similarity between every two clauses in the multiple clauses;
对每两个分句间的相互推荐度、每两个分句间的语义相似度以及每两个分句间的位置相似度进行融合处理,得到图邻接矩阵;The mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses are fused to obtain a graph adjacency matrix;
将所述图邻接矩阵输入至TextRank算法计算每个分句的重要度;The graph adjacency matrix is input to the importance of TextRank algorithm to calculate each clause;
根据每个分句的重要度进行筛选,得到备选分句;Filter according to the importance of each clause to obtain alternative clauses;
对所述备选分句进行后处理,得到摘要句子。Perform post-processing on the candidate clauses to obtain a summary sentence.
第二方面,本申请实施例提供了一种文本摘要生成装置,其包括:In a second aspect, the embodiment of the present application provides a text abstract generating device, which includes:
获取单元,用于响应于文本摘要生成指令,根据所述文本摘要生成指令获取待处理数据;An acquisition unit, configured to respond to a text summary generation instruction, and obtain data to be processed according to the text summary generation instruction;
切分单元,用于根据任务场景获取词典对所述待处理数据进行切分处理,得到多个分句;A segmentation unit, configured to segment the data to be processed according to the task scene acquisition dictionary to obtain multiple clauses;
计算单元,用于计算所述多个分句中每两个分句间的相互推荐度;a calculation unit, configured to calculate the mutual recommendation degree between every two clauses in the plurality of clauses;
所述计算单元,还用于计算所述多个分句中每两个分句间的语义相似度;The calculation unit is also used to calculate the semantic similarity between every two clauses in the plurality of clauses;
所述计算单元,还用于计算所述多个分句中每两个分句间的位置相似度;The calculation unit is also used to calculate the positional similarity between every two clauses in the plurality of clauses;
融合单元,用于对每两个分句间的相互推荐度、每两个分句间的语义相似度以及每两个分句间的位置相似度进行融合处理,得到图邻接矩阵;The fusion unit is used to fuse the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses to obtain a graph adjacency matrix;
所述计算单元,还用于将所述图邻接矩阵输入至TextRank算法计算每个分句的重要度;The calculation unit is also used to input the graph adjacency matrix to the TextRank algorithm to calculate the importance of each clause;
筛选单元,用于根据每个分句的重要度进行筛选,得到备选分句;A screening unit is used to screen according to the importance of each clause to obtain alternative clauses;
后处理单元,用于对所述备选分句进行后处理,得到摘要句子。A post-processing unit, configured to post-process the candidate clauses to obtain a summary sentence.
第三方面,本申请实施例又提供了一种计算机设备,其包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现以下步骤:In the third aspect, the embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor executes the computer program. The following steps are implemented in the program:
响应于文本摘要生成指令,根据所述文本摘要生成指令获取待处理数据;Responding to a text summary generation instruction, acquiring data to be processed according to the text summary generation instruction;
根据任务场景获取词典对所述待处理数据进行切分处理,得到多个分句;Obtaining a dictionary according to the task scene to perform segmentation processing on the data to be processed to obtain multiple clauses;
计算所述多个分句中每两个分句间的相互推荐度;calculating the degree of mutual recommendation between every two clauses in the plurality of clauses;
计算所述多个分句中每两个分句间的语义相似度;calculating the semantic similarity between every two clauses in the plurality of clauses;
计算所述多个分句中每两个分句间的位置相似度;calculating the positional similarity between every two clauses in the multiple clauses;
对每两个分句间的相互推荐度、每两个分句间的语义相似度以及每两个分句间的位置相似度进行融合处理,得到图邻接矩阵;The mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses are fused to obtain a graph adjacency matrix;
将所述图邻接矩阵输入至TextRank算法计算每个分句的重要度;The graph adjacency matrix is input to the importance of TextRank algorithm to calculate each clause;
根据每个分句的重要度进行筛选,得到备选分句;Filter according to the importance of each clause to obtain alternative clauses;
对所述备选分句进行后处理,得到摘要句子。Perform post-processing on the candidate clauses to obtain a summary sentence.
第四方面,本申请实施例还提供了一种计算机可读存储介质,其中所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行以下步骤:In a fourth aspect, the embodiment of the present application further provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor performs the following steps :
响应于文本摘要生成指令,根据所述文本摘要生成指令获取待处理数据;Responding to a text summary generation instruction, acquiring data to be processed according to the text summary generation instruction;
根据任务场景获取词典对所述待处理数据进行切分处理,得到多个分句;Obtaining a dictionary according to the task scene to perform segmentation processing on the data to be processed to obtain multiple clauses;
计算所述多个分句中每两个分句间的相互推荐度;calculating the degree of mutual recommendation between every two clauses in the plurality of clauses;
计算所述多个分句中每两个分句间的语义相似度;calculating the semantic similarity between every two clauses in the plurality of clauses;
计算所述多个分句中每两个分句间的位置相似度;calculating the positional similarity between every two clauses in the multiple clauses;
对每两个分句间的相互推荐度、每两个分句间的语义相似度以及每两个分句间的位置相似度进行融合处理,得到图邻接矩阵;The mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses are fused to obtain a graph adjacency matrix;
将所述图邻接矩阵输入至TextRank算法计算每个分句的重要度;The graph adjacency matrix is input to the importance of TextRank algorithm to calculate each clause;
根据每个分句的重要度进行筛选,得到备选分句;Filter according to the importance of each clause to obtain alternative clauses;
对所述备选分句进行后处理,得到摘要句子。Perform post-processing on the candidate clauses to obtain a summary sentence.
本申请实施例提供了一种文本摘要生成方法、装置、计算机设备及存储介质,能够有效克服传统TextRank算法在计算纯文本相似度时,没有区分不同词条的重要性,也没有按词性过滤掉不重要的词的缺陷,提升了业务关联性强的句子被选为摘要的可能性,在建模的过程中充分考虑了句子前后邻近关系以及它们在原文章中的位置,有效克服了传统方式中由于未考虑句子在文章中位置顺序的重要性而导致的文本摘要生成不准确的问题,在传统文本摘要生成的基础上加入了后处理,对图算法获取的摘要结果做修正,提升了最终输出的摘要质量,进而基于人工智能手段实现更加准确的文本摘要生成。The embodiment of the present application provides a text summary generation method, device, computer equipment and storage medium, which can effectively overcome the traditional TextRank algorithm, which does not distinguish the importance of different entries when calculating the similarity of plain text, and does not filter out by part of speech The defect of unimportant words improves the possibility of sentences with strong business relevance being selected as summaries. In the process of modeling, the relationship between the front and back of sentences and their positions in the original article are fully considered, which effectively overcomes the problems in traditional methods. Due to the inaccurate generation of text summaries due to the lack of consideration of the importance of the position order of sentences in the article, post-processing is added on the basis of traditional text summaries, and the summarization results obtained by the graph algorithm are corrected to improve the final output. The quality of abstracts, and then based on artificial intelligence means to achieve more accurate text summary generation.
附图说明Description of drawings
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present application more clearly, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can also obtain other drawings based on these drawings on the premise of not paying creative work.
图1为本申请实施例提供的文本摘要生成方法的流程示意图;FIG. 1 is a schematic flow diagram of a method for generating a text abstract provided in an embodiment of the present application;
图2为本申请实施例提供的文本摘要生成装置的示意性框图;FIG. 2 is a schematic block diagram of a text abstract generation device provided by an embodiment of the present application;
图3为本申请实施例提供的计算机设备的示意性框图。Fig. 3 is a schematic block diagram of a computer device provided by an embodiment of the present application.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that when used in this specification and the appended claims, the terms "comprising" and "comprises" indicate the presence of described features, integers, steps, operations, elements and/or components, but do not exclude one or Presence or addition of multiple other features, integers, steps, operations, elements, components and/or collections thereof.
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。It should also be understood that the terminology used in the specification of this application is for the purpose of describing particular embodiments only and is not intended to limit the application. As used in this specification and the appended claims, the singular forms "a", "an" and "the" are intended to include plural referents unless the context clearly dictates otherwise.
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be further understood that the term "and/or" used in the description of the present application and the appended claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations .
请参阅图1,为本申请实施例提供的文本摘要生成方法的流程示意图。Please refer to FIG. 1 , which is a schematic flowchart of a method for generating a text abstract provided in an embodiment of the present application.
S10,响应于文本摘要生成指令,根据所述文本摘要生成指令获取待处理数据。S10, in response to a text summary generation instruction, acquire data to be processed according to the text summary generation instruction.
在本实施例中,所述文本摘要生成指令可以由相关工作人员触发,如:媒体监控者、在线教育者等。In this embodiment, the text summary generation instruction may be triggered by relevant staff, such as media monitors, online educators, and the like.
在本申请的至少一个实施例中,所述根据所述文本摘要生成指令获取待处理数据包括:In at least one embodiment of the present application, said obtaining the data to be processed according to said text summary generation instruction includes:
检测触发所述文本摘要生成指令时同步上传的信息;Detecting synchronously uploaded information when the text summary generation instruction is triggered;
从所述信息中获取地址作为目标地址;Obtain an address from said information as a target address;
链接至所述目标地址,并获取所述目标地址处存储的数据作为所述待处理数据。Link to the target address, and acquire the data stored at the target address as the data to be processed.
其中,所述目标地址可以包括,但不限于:网页页面地址、文件夹地址、数据库地址等。Wherein, the target address may include, but not limited to: a web page address, a folder address, a database address, and the like.
当然,在其他实施例中,当所述同步上传的信息中包括所述待处理数据时,则直接提取所述待处理数据。例如:用户在触发所述文本摘要生成指令时同步上传了所述待处理数据,则可以直接从所述文本摘要生成指令中获取到所述待处理数据。Of course, in other embodiments, when the synchronously uploaded information includes the data to be processed, the data to be processed is directly extracted. For example, if the user synchronously uploads the data to be processed when triggering the text summary generation instruction, the data to be processed can be obtained directly from the text summary generation instruction.
S11,根据任务场景获取词典对所述待处理数据进行切分处理,得到多个分句。S11. Segment the data to be processed according to the task scene acquisition dictionary to obtain multiple clauses.
在本实施例中,所述根据任务场景获取词典对所述待处理数据进行切分处理,得到多个分句包括:In this embodiment, the dictionary obtained according to the task scene performs segmentation processing on the data to be processed, and obtains a plurality of clauses including:
识别当前任务场景;Identify the current task scenario;
调取与所述当前任务场景匹配的词典作为目标词典;Retrieving a dictionary matching the current task scene as a target dictionary;
根据所述目标词典切分所述待处理数据,得到所述多个分句。Segmenting the data to be processed according to the target dictionary to obtain the plurality of clauses.
举例而言,当所述当前任务场景为财务场景时,获取与财务场景相匹配的财务词典作为所述目标词典,并利用所述财务词典对所述待处理数据进行句词切分,得到与财务关联的相关词条,sents=[s 1,s 2,…,s i],其中,s i=[w 1/t 1,w 2/t 2,…,w n/t n],sents是所述待处理数据的分句,s i是sents中的第i个分句,w n是分句的第n个分词,t n是分句的第n个分词对应的词性,i、n为正整数。 For example, when the current task scenario is a financial scenario, obtain a financial dictionary that matches the financial scenario as the target dictionary, and use the financial dictionary to segment the data to be processed into sentences to obtain the Related terms related to finance, sents=[s 1 ,s 2 ,…,s i ], where, s i =[w 1 /t 1 ,w 2 /t 2 ,…,w n /t n ], sents is the clause of the data to be processed, s i is the ith clause in sents, w n is the nth participle of the clause, t n is the part of speech corresponding to the nth participle of the clause, i, n is a positive integer.
本实施例可以按照句子的标点符号,如句号、问号、感叹号等对所述待处理数据进行切分。本实施例可以采用分词工具(如中文分词工具)加载所述目标词典,以便很好的切分出业务相关的词条。In this embodiment, the data to be processed can be segmented according to the punctuation mark of the sentence, such as a period, a question mark, an exclamation mark, and the like. In this embodiment, a word segmentation tool (such as a Chinese word segmentation tool) can be used to load the target dictionary, so as to segment business-related entries well.
通过上述实施方式,能够根据与具体任务场景相关联的特定词典执行对句子的切分,以便更好地切分出业务相关的词条。Through the above embodiments, sentences can be segmented according to specific dictionaries associated with specific task scenarios, so as to better segment business-related entries.
S12,计算所述多个分句中每两个分句间的相互推荐度。S12. Calculate mutual recommendation degrees between every two clauses in the plurality of clauses.
在本实施例中,所述计算所述多个分句中每两个分句间的相互推荐度包括:In this embodiment, the calculating the mutual recommendation degree between every two clauses in the plurality of clauses includes:
根据接收到的配置需求配置所述多个分句中每个单词的词权重;configuring the word weight of each word in the plurality of clauses according to the received configuration requirements;
对于所述多个分句,获取每两个分句中同时出现的单词作为目标词;For the multiple clauses, obtain the words that appear simultaneously in every two clauses as the target word;
确定所述目标词的词权重及词性;Determine the word weight and part of speech of the target word;
根据所述目标词的词权重及词性计算每两个分句文本间的相似度,得到推荐度矩阵;Calculate the similarity between every two sentence texts according to the word weight and the part of speech of the target word, and obtain the recommendation matrix;
对所述推荐度矩阵执行L2正则化,得到每两个分句间的相互推荐度。L2 regularization is performed on the recommendation degree matrix to obtain the mutual recommendation degree between every two clauses.
其中,所述配置需求可以由用户上传。Wherein, the configuration requirement may be uploaded by the user.
其中,在计算文本间的相似度时,采用的公式如下:Among them, when calculating the similarity between texts, the formula used is as follows:
Figure PCTCN2022071791-appb-000001
Figure PCTCN2022071791-appb-000001
其中,mat t(Si,Sj)表示任意两个分句Si、Sj间的相互推荐度,Wk表示Si、Sj中同时出现的单词,TermWeight表示单词的权重,Tk表示Wk的词性,valid_postags表示有效的词性。 Among them, mat t (Si, Sj) represents the mutual recommendation between any two clauses Si and Sj, Wk represents the words that appear simultaneously in Si and Sj, TermWeight represents the weight of the word, Tk represents the part of speech of Wk, and valid_postags represents valid part of speech.
其中,所述有效的词性包括名词、动词、形词、副词四种和句子语义密切相关的词性。Wherein, the effective parts of speech include nouns, verbs, adjectives, and adverbs, which are closely related to sentence semantics.
并且,在计算两个句子的公共词(Wk)分数时,对重要性不同的业务词赋予不同的权重。比如,产品名称词条的权重可以是一般词条的2倍,疾病名称或竞品公司名称词条的权重可以是一般词条的1.5倍,具体的权重值可以基于回归测试效果做参数搜索而得到。Moreover, when calculating the common word (Wk) scores of two sentences, different weights are assigned to business words with different importance. For example, the weight of a product name entry can be twice that of a general entry, and the weight of a disease name or competing product company name entry can be 1.5 times that of a general entry. The specific weight value can be determined by performing a parameter search based on the regression test effect. get.
进一步地,对mat t(Si,Sj)做L2 normalization,即对矩阵的每个元素都除以norm_val;其中,
Figure PCTCN2022071791-appb-000002
根号下是矩阵mat t(Si,Sj)中所有元素的平方和。
Further, do L2 normalization on mat t (Si,Sj), that is, divide each element of the matrix by norm_val; where,
Figure PCTCN2022071791-appb-000002
Under the root sign is the sum of squares of all elements in the matrix mat t (Si,Sj).
L2是正则化项,又叫做惩罚项,是为了限制模型的参数,防止模型过拟合而加在损失函数后面的一项,L2范数符合高斯分布,是完全可微的。L2 is a regularization term, also called a penalty term. It is an item added after the loss function to limit the parameters of the model and prevent the model from overfitting. The L2 norm conforms to the Gaussian distribution and is completely differentiable.
在上述实施方式中,仅保留了名词、动词、形容词以及副词四种和句子语义密切相关的词性,并且在计算公共词(Wk)分数时对重要性不同的业务词赋予了不同的权重,结合正则化,有效克服了传统TextRank算法在计算纯文本相似度时,没有区分不同词条的重要性,也没有按词性过滤掉不重要的词的缺陷,提升了业务关联性强的句子被选为摘要的可能性。In the above embodiment, only nouns, verbs, adjectives, and adverbs are kept, which are closely related to the sentence semantics, and different weights are given to business words with different importance when calculating the common word (Wk) score. Regularization effectively overcomes the defect that the traditional TextRank algorithm does not distinguish the importance of different entries when calculating the similarity of plain text, and does not filter out unimportant words by part of speech, and improves the selection of sentences with strong business relevance. Summary possibility.
S13,计算所述多个分句中每两个分句间的语义相似度。S13. Calculate the semantic similarity between every two clauses in the plurality of clauses.
在本实施例中,所述计算所述多个分句中每两个分句间的语义相似度包括:In this embodiment, the calculating the semantic similarity between every two clauses in the plurality of clauses includes:
对每个分句进行向量化,得到每个分句的嵌入向量表示;Vectorize each clause to obtain the embedded vector representation of each clause;
根据每个分句的嵌入向量表示计算每两个分句间的余弦相似度;Calculate the cosine similarity between each two clauses according to the embedding vector representation of each clause;
将每两个分句间的余弦相似度确定为每两个分句间的语义相似度。The cosine similarity between every two clauses is determined as the semantic similarity between every two clauses.
具体地,在计算句子间的语义相似度时,采用的公式如下:Specifically, when calculating the semantic similarity between sentences, the formula used is as follows:
mat s(Si,Sj)=cosine_similarity(s i-embed,s j-embed) mat s (Si,Sj)=cosine_similarity(s i -embed,s j -embed)
其中,mat s(Si,Sj)表示任意两个分句Si、Sj间的语义相似度,s i-embed表示分句Si的嵌入向量表示,s j-embed表示分句Sj的嵌入向量表示,cosine_similarity表示求解余弦相似度。 Among them, mat s (Si, Sj) represents the semantic similarity between any two clauses Si and Sj, s i -embed represents the embedded vector representation of clause Si, s j -embed represents the embedded vector representation of clause Sj, cosine_similarity means to solve the cosine similarity.
在上述实施方式中,避免了传统算法中只考虑两个句子之间的纯文本相似度,不考虑语义相似度的缺陷。In the above implementation manner, the defect that the traditional algorithm only considers the plain text similarity between two sentences and does not consider the semantic similarity is avoided.
S14,计算所述多个分句中每两个分句间的位置相似度。S14. Calculate the positional similarity between every two clauses in the plurality of clauses.
在本实施例中,所述计算所述多个分句中每两个分句间的位置相似度包括:In this embodiment, the calculating the position similarity between every two clauses in the plurality of clauses includes:
将每两个分句确定为一组分句,其中,每组分句中的两个分句互为推荐句及被推荐句;Determining every two clauses as a group of sentences, wherein the two clauses in each group of sentences are recommended sentences and recommended sentences;
当所述任意分句为所述被推荐句时,确定所述被推荐句在相应段落中的位置,当所述被推荐句在相应段落中排在前预设位或者后预设位时,确定对应的矩阵cell值为第一数值;When the arbitrary clause is the recommended sentence, determine the position of the recommended sentence in the corresponding paragraph, and when the recommended sentence is in the front preset position or the rear preset position in the corresponding paragraph, Determining that the corresponding matrix cell value is the first value;
当所述任意分句为所述推荐句时,确定所述推荐句在相应段落中的位置,当所述推荐句在相应段落中排在所述前预设位或者所述后预设位时,确定对应的矩阵cell值为第二数值;When the arbitrary clause is the recommended sentence, determine the position of the recommended sentence in the corresponding paragraph, when the recommended sentence is arranged in the front preset position or the rear preset position in the corresponding paragraph , determine that the corresponding matrix cell value is the second value;
当任意组分句中的推荐句及被推荐句都在相应段落中排在所述前预设位或者所述后预设位时,确定对应的矩阵cell值为第三数值;When the recommended sentence in any component sentence and the recommended sentence are all arranged in the preceding preset position or the rear preset position in the corresponding paragraph, it is determined that the corresponding matrix cell value is the third value;
当所述任意组分句中的推荐句及被推荐句都不在相应段落中排在所述前预设位或者所述后预设位时,确定对应的矩阵cell值为第四数值;When the recommended sentence and the recommended sentence in any of the component sentences are not arranged in the previous preset position or the rear preset position in the corresponding paragraph, determine that the corresponding matrix cell value is the fourth value;
当所述任意分句为所述被推荐句,且所述任意分句为指定属性时,确定对应的矩阵cell值为所述第一数值;When the arbitrary clause is the recommended sentence, and the arbitrary clause is the specified attribute, determine the corresponding matrix cell value as the first value;
根据所述矩阵cell值进行矩阵转换,得到每两个分句间的位置相似度。Matrix transformation is performed according to the matrix cell value to obtain the positional similarity between every two clauses.
其中,所述第一数值、所述第二数值、所述第三数值及所述第四数值可以进行自定义配置,例如,在本实施例中,可以配置所述第一数值为2,所述第二数值为1.5,所述第三数值为2.5,所述第四数值为1。Wherein, the first numerical value, the second numerical value, the third numerical value and the fourth numerical value can be customized. For example, in this embodiment, the first numerical value can be configured as 2, so The second value is 1.5, the third value is 2.5, and the fourth value is 1.
其中,所述前预设位或者所述后预设位也可以进行自定义配置,例如,所述前预设位可以配置为前5%,相应地,所述后预设位可以配置为后5%。Wherein, the front preset position or the rear preset position can also be customized, for example, the front preset position can be configured as the first 5%, and correspondingly, the rear preset position can be configured as the rear 5%.
其中,所述指定属性可以为总结性属性,即具有所述指定属性的句子为总结性的句子。Wherein, the specified attribute may be a summary attribute, that is, the sentence with the specified attribute is a summary sentence.
通过上述实施方式,在建模的过程中充分考虑了句子前后邻近关系以及它们在原文章中的位置,有效克服了传统方式中由于未考虑句子在文章中位置顺序的重要性而导致的文本摘要生成不准确的问题。Through the above implementation, in the process of modeling, the relationship between the front and back of the sentence and their position in the original article are fully considered, which effectively overcomes the generation of text summarization caused by the lack of consideration of the importance of the order of the sentence in the article in the traditional way. Inaccurate question.
S15,对每两个分句间的相互推荐度、每两个分句间的语义相似度以及每两个分句间的位置相似度进行融合处理,得到图邻接矩阵。S15, performing fusion processing on the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the position similarity between each two clauses to obtain a graph adjacency matrix.
在本申请的至少一个实施例中,采用下述公式对每两个分句间的相互推荐度、每两个分句间的语义相似度以及每两个分句间的位置相似度进行融合处理,得到图邻接矩阵:In at least one embodiment of the present application, the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses are fused using the following formula , to get the graph adjacency matrix:
mat adjc=(αmat t+βmat s) mat o mat adjc =(αmat t +βmat s ) mat o
其中,mat adjc表示所述图邻接矩阵,mat t表示每两个分句间的相互推荐度,mat s表示每两个分句间的语义相似度,mat o表示每两个分句间的位置相似度,α表示所述相互推荐度的权重,β表示所述语义相似度的权重,α>0,β>0,且α+β=1。 Among them, mat adjc represents the graph adjacency matrix, mat t represents the mutual recommendation between each two clauses, mat s represents the semantic similarity between each two clauses, mat o represents the position between each two clauses Similarity, α represents the weight of the mutual recommendation, β represents the weight of the semantic similarity, α>0, β>0, and α+β=1.
在本实施例中,(αmat t+βmat s)和mat o做按元素相乘以后,使得(αmat t+βmat s)这个对称矩阵不再对称,此时,在mat adjc中,句子间的相似度受到了句子在文本中位置的影响。 In this embodiment, (αmat t + βmat s ) and mat o are multiplied element-wise, so that the symmetric matrix (αmat t + βmat s ) is no longer symmetrical. At this time, in mat adjc , the similarity between sentences The degree is affected by the position of the sentence in the text.
需要说明的是,传统的摘要提取方案中两图节点的连边是单条无向边,这条边只有单一权重,从这单条无向边来看,两端节点句子的权值是相等的。但文章中的任意两个句子单独拿出来比较,他们的重要程度也应该有高低之分,对两个句子的重要性做等价处理显然是有误的。It should be noted that in the traditional summary extraction scheme, the edge connecting the nodes of the two graphs is a single undirected edge, and this edge has only a single weight. From the perspective of this single undirected edge, the weights of the sentences at both ends of the node are equal. However, if any two sentences in the article are compared separately, their importance should also be divided into high and low. It is obviously wrong to treat the importance of the two sentences as equivalent.
而在本实施方式中,通过每两个分句间的相互推荐度、每两个分句间的语义相似度以及每两个分句间的位置相似度的融合处理,使最终得到的图邻接矩阵将图节点连边从单条无向边建模成两条有向边,克服了传统方案中只有单条无向边的缺陷。However, in this embodiment, through the fusion processing of the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses, the resulting graph adjacency The matrix models the graph node connection from a single undirected edge to two directed edges, which overcomes the defect of only a single undirected edge in the traditional scheme.
S16,将所述图邻接矩阵输入至TextRank算法计算每个分句的重要度。S16. Input the graph adjacency matrix into the TextRank algorithm to calculate the importance of each clause.
在本实施例中,在将所述图邻接矩阵输入至TextRank算法后,迭代计算出每个节点的TextRank值作为对应的每个分句的重要度,在此不赘述。In this embodiment, after the graph adjacency matrix is input into the TextRank algorithm, the TextRank value of each node is iteratively calculated as the importance of each corresponding clause, which will not be repeated here.
S17,根据每个分句的重要度进行筛选,得到备选分句。S17. Screening is performed according to the importance of each clause to obtain candidate clauses.
在本申请的至少一个实施例中,所述根据每个分句的重要度进行筛选,得到备选分句包括:In at least one embodiment of the present application, the screening is performed according to the importance of each clause, and the alternative clauses obtained include:
获取预设阈值;Get the preset threshold;
获取所述重要度大于或者等于所述预设阈值的分句作为所述备选分句。A clause whose importance is greater than or equal to the preset threshold is acquired as the candidate clause.
其中,所述预设阈值可以进行自定义配置,如95%。Wherein, the preset threshold can be customized, such as 95%.
在本申请的至少一个实施例中,所述根据每个分句的重要度进行筛选,得到备选分句还包括:In at least one embodiment of the present application, the screening according to the importance of each clause to obtain the alternative clauses also includes:
将每个分句的重要度按照由高到低的顺序进行排序;Sort the importance of each clause in descending order;
获取预设位置;Get the default position;
将排在所述预设位置之前的分句确定为所述备选分句。The clauses arranged before the preset position are determined as the candidate clauses.
其中,所述预设位置可以进行自定义配置,如20位。Wherein, the preset positions can be customized, such as 20 positions.
所述预设位置相当于一个超参数,可以通过实验或者调试而获得,例如:基于回归测试集,以摘要的rouge值为指标对所述预设位置做超参数搜索,选择最优化rouge值对应的取值作为所述预设位置。The preset position is equivalent to a hyperparameter, which can be obtained through experiments or debugging. For example, based on the regression test set, the rouge value of the abstract is used as an index to perform a hyperparameter search on the preset position, and the optimal rouge value is selected to correspond to The value of is used as the preset position.
S18,对所述备选分句进行后处理,得到摘要句子。S18. Post-processing the candidate clauses to obtain a summary sentence.
需要说明的是,所述备选分句属于初步得到的摘要,但是其中可能包括问句、结果、递进、转折、引导等句式,这种句子不应独立于上下文出现,所以如果其上下文没有被选为摘要句,就需要进行进一步修正。It should be noted that the alternative clauses belong to the initially obtained summary, but they may include sentence patterns such as questions, results, progress, transitions, and guidance. Such sentences should not appear independently of the context, so if their context If it is not selected as a summary sentence, further revision is required.
具体地,所述对所述备选分句进行后处理,得到摘要句子包括:Specifically, the post-processing of the candidate clauses to obtain a summary sentence includes:
识别所述备选分句中每个分句的类型;identifying the type of each of the alternative clauses;
当在所述备选分句中有目标分句的类型为疑问句时,获取与所述目标分句相邻的下一分句,并将获取的分句添加至所述摘要句子;When the type of the target clause in the alternative clauses is an interrogative sentence, the next clause adjacent to the target clause is obtained, and the obtained clause is added to the summary sentence;
当在所述备选分句中获取到指定关联词组中的其中一个构成单词时,获取与所述构成单词关联的单词所属的分句,并将获取的分句添加至所述摘要句子。When one of the constituent words in the specified associated phrase is obtained in the candidate clause, the clause to which the word associated with the constituent word belongs is obtained, and the obtained clause is added to the summary sentence.
其中,分句的类型可以包括,但不限于:疑问句、由关联词组构成的句子。Wherein, the types of clauses may include, but not limited to: interrogative sentences and sentences composed of associated phrases.
举例而言,可以根据文字识别得到的关键词或者符号判断所述备选分句中每个分句的类型。例如:当识别到“?”时,判断为疑问句。For example, the type of each clause in the candidate clauses may be judged according to keywords or symbols obtained through text recognition. For example: when "?" is recognized, it is judged as an interrogative sentence.
例如:一个摘要句子是问句的话,通常相邻的下一个句子也应该判断为摘要;一个摘要句子是“虽然……但是……”,“因为……所以……”这类句式中的一个成分句子时,另一半成分句子通常也应被判断为摘要。For example: if a summary sentence is a question sentence, usually the next adjacent sentence should also be judged as a summary; a summary sentence is "although...but...", "because...so..." in such sentences When one constituent sentence is used, the other half of the constituent sentence should usually be judged as an abstract as well.
通过上述实施方式,在传统文本摘要生成的基础上加入了后处理,对图算法获取的摘要结果做修正,提升了最终输出的摘要质量。Through the above implementation, post-processing is added on the basis of traditional text summarization, and the summarization result obtained by the graph algorithm is corrected to improve the quality of the final output summarization.
需要说明的是,为了进一步确保数据的安全性,避免数据被恶意篡改,所述摘要句子可以存储于区块链节点上。It should be noted that, in order to further ensure the security of the data and prevent the data from being maliciously tampered with, the summary sentences can be stored on the blockchain nodes.
由以上技术方案可以看出,本申请能够响应于文本摘要生成指令,根据所述文本摘要生成指令获取待处理数据,对所述待处理数据进行切分处理,得到多个分句,根据与具体任务场景相关联的特定词典执行对句子的切分,以便更好地切分出业务相关的词条,计算所述多个分句中每两个分句间的相互推荐度,仅保留了名词、动词、形容词以及副词四种和句子语义密切相关的词性,并且在计算公共词分数时对重要性不同的业务词赋予了不同的权重,结合正则化,有效克服了传统TextRank算法在计算纯文本相似度时,没有区分不同词条的重要性,也没有按词性过滤掉不重要的词的缺陷,提升了业务关联性强的句子被选为摘要的可能性,计算所述多个分句中每两个分句间的语义相似度,避免了传统算法中只考虑两个句子之间的纯文本相似度,不考虑语义相似度的缺陷,计算所述多个分句中每两个分句间的位置相似度,在建模的过程中充分考虑了句子前后邻近关系以及它们在原文章中的位置,有效克服了传统方式中由于未考虑句子在文章中位置顺序的重要性而导致的文本摘要生成不准确的问题,对每两个分句间的相互推荐度、每两个分句间的语义相似度以及每两个分句间的位置相 似度进行融合处理,得到图邻接矩阵,使最终得到的图邻接矩阵将图节点连边从单条无向边建模成两条有向边,克服了传统方案中只有单条无向边的缺陷,将所述图邻接矩阵输入至TextRank算法计算每个分句的重要度,根据每个分句的重要度进行筛选,得到备选分句,对所述备选分句进行后处理,得到摘要句子,在传统文本摘要生成的基础上加入了后处理,对图算法获取的摘要结果做修正,提升了最终输出的摘要质量,进而基于人工智能手段实现更加准确的文本摘要生成。It can be seen from the above technical solutions that the present application can respond to the text summary generation instruction, obtain the data to be processed according to the text summary generation instruction, and perform segmentation processing on the data to be processed to obtain multiple clauses. The specific dictionary associated with the task scenario performs sentence segmentation in order to better segment out business-related entries, calculate the mutual recommendation between each two clauses in the multiple clauses, and only retain nouns , verbs, adjectives and adverbs are four parts of speech that are closely related to sentence semantics, and different weights are given to business words with different importance when calculating common word scores. Combined with regularization, it effectively overcomes the traditional TextRank algorithm in calculating plain text In terms of similarity, there is no distinction between the importance of different entries, and there is no defect in filtering out unimportant words by part of speech, which improves the possibility of sentences with strong business relevance being selected as abstracts, and calculates the number of sentences in the multiple clauses The semantic similarity between every two clauses avoids the defect that the traditional algorithm only considers the plain text similarity between two sentences and does not consider the semantic similarity, and calculates every two clauses in the multiple clauses The location similarity between sentences is fully considered in the process of modeling, and their positions in the original article are fully considered, which effectively overcomes the text summarization caused by the lack of consideration of the importance of the position order of sentences in the article Inaccurate questions are generated, and the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses are fused to obtain a graph adjacency matrix, so that the final The obtained graph adjacency matrix models the graph node connection edge from a single undirected edge into two directed edges, which overcomes the defect of only a single undirected edge in the traditional scheme, and inputs the graph adjacency matrix into the TextRank algorithm to calculate each The importance of clauses is screened according to the importance of each clause to obtain alternative clauses, and post-processing is performed on the alternative clauses to obtain a summary sentence. Post-processing is added on the basis of traditional text summary generation , modify the summary results obtained by the graph algorithm, improve the quality of the final output summary, and then realize more accurate text summary generation based on artificial intelligence means.
本申请实施例还提供一种文本摘要生成装置,该文本摘要生成装置用于执行前述文本摘要生成方法的任一实施例。具体地,请参阅图2,图2是本申请实施例提供的文本摘要生成装置的示意性框图。The embodiment of the present application further provides a text abstract generation device, and the text abstract generation device is configured to execute any embodiment of the foregoing text abstract generation method. Specifically, please refer to FIG. 2 . FIG. 2 is a schematic block diagram of an apparatus for generating a text summary provided by an embodiment of the present application.
如图2所示,文本摘要生成装置100包括:获取单元101、切分单元102、计算单元103、融合单元104、筛选单元105、后处理单元106。As shown in FIG. 2 , the device 100 for generating a text summary includes: an acquisition unit 101 , a segmentation unit 102 , a calculation unit 103 , a fusion unit 104 , a screening unit 105 , and a post-processing unit 106 .
响应于文本摘要生成指令,获取单元101根据所述文本摘要生成指令获取待处理数据。In response to the text summary generation instruction, the obtaining unit 101 obtains the data to be processed according to the text summary generation instruction.
在本实施例中,所述文本摘要生成指令可以由相关工作人员触发,如:媒体监控者、在线教育者等。In this embodiment, the text summary generation instruction may be triggered by relevant staff, such as media monitors, online educators, and the like.
在本申请的至少一个实施例中,所述获取单元101根据所述文本摘要生成指令获取待处理数据包括:In at least one embodiment of the present application, the acquiring unit 101 acquiring the data to be processed according to the text summary generation instruction includes:
检测触发所述文本摘要生成指令时同步上传的信息;Detecting synchronously uploaded information when the text summary generation instruction is triggered;
从所述信息中获取地址作为目标地址;Obtain an address from said information as a target address;
链接至所述目标地址,并获取所述目标地址处存储的数据作为所述待处理数据。Link to the target address, and acquire the data stored at the target address as the data to be processed.
其中,所述目标地址可以包括,但不限于:网页页面地址、文件夹地址、数据库地址等。Wherein, the target address may include, but not limited to: a web page address, a folder address, a database address, and the like.
当然,在其他实施例中,当所述同步上传的信息中包括所述待处理数据时,则直接提取所述待处理数据。例如:用户在触发所述文本摘要生成指令时同步上传了所述待处理数据,则可以直接从所述文本摘要生成指令中获取到所述待处理数据。Of course, in other embodiments, when the synchronously uploaded information includes the data to be processed, the data to be processed is directly extracted. For example, if the user synchronously uploads the data to be processed when triggering the text summary generation instruction, the data to be processed can be obtained directly from the text summary generation instruction.
切分单元102根据任务场景获取词典对所述待处理数据进行切分处理,得到多个分句。The segmentation unit 102 performs segmentation processing on the data to be processed according to the task scene acquisition dictionary to obtain multiple clauses.
在本实施例中,所述切分单元102根据任务场景获取词典对所述待处理数据进行切分处理,得到多个分句包括:In this embodiment, the segmentation unit 102 performs segmentation processing on the data to be processed according to the task scene acquisition dictionary, and obtains a plurality of clauses including:
识别当前任务场景;Identify the current task scenario;
调取与所述当前任务场景匹配的词典作为目标词典;Retrieving a dictionary matching the current task scene as a target dictionary;
根据所述目标词典切分所述待处理数据,得到所述多个分句。Segmenting the data to be processed according to the target dictionary to obtain the plurality of clauses.
举例而言,当所述当前任务场景为财务场景时,获取与财务场景相匹配的财务词典作为所述目标词典,并利用所述财务词典对所述待处理数据进行句词切分,得到与财务关联的相关词条,sents=[s 1,s 2,…,s i],其中,s i=[w 1/t 1,w 2/t 2,…,w n/t n],sents是所述待处理数据的分句,s i是sents中的第i个分句,w n是分句的第n个分词,t n是分句的第n个分词对应的词性,i、n为正整数。 For example, when the current task scenario is a financial scenario, obtain a financial dictionary that matches the financial scenario as the target dictionary, and use the financial dictionary to segment the data to be processed into sentences to obtain the Related terms related to finance, sents=[s 1 ,s 2 ,…,s i ], where, s i =[w 1 /t 1 ,w 2 /t 2 ,…,w n /t n ], sents is the clause of the data to be processed, s i is the ith clause in sents, w n is the nth participle of the clause, t n is the part of speech corresponding to the nth participle of the clause, i, n is a positive integer.
本实施例可以按照句子的标点符号,如句号、问号、感叹号等对所述待处理数据进行切分。本实施例可以采用分词工具(如中文分词工具)加载所述目标词典,以便很好的切分出业务相关的词条。In this embodiment, the data to be processed can be segmented according to the punctuation mark of the sentence, such as a period, a question mark, an exclamation mark, and the like. In this embodiment, a word segmentation tool (such as a Chinese word segmentation tool) can be used to load the target dictionary, so as to segment business-related entries well.
通过上述实施方式,能够根据与具体任务场景相关联的特定词典执行对句子的切分,以便更好地切分出业务相关的词条。Through the above embodiments, sentences can be segmented according to specific dictionaries associated with specific task scenarios, so as to better segment business-related entries.
计算单元103计算所述多个分句中每两个分句间的相互推荐度。The calculation unit 103 calculates the degree of mutual recommendation between every two clauses in the plurality of clauses.
在本实施例中,所述计算单元103计算所述多个分句中每两个分句间的相互推荐度包括:In this embodiment, the calculating unit 103 calculating the mutual recommendation degree between every two clauses in the plurality of clauses includes:
根据接收到的配置需求配置所述多个分句中每个单词的词权重;configuring the word weight of each word in the plurality of clauses according to the received configuration requirements;
对于所述多个分句,获取每两个分句中同时出现的单词作为目标词;For the multiple clauses, obtain the words that appear simultaneously in every two clauses as the target word;
确定所述目标词的词权重及词性;Determine the word weight and part of speech of the target word;
根据所述目标词的词权重及词性计算每两个分句文本间的相似度,得到推荐度矩阵;Calculate the similarity between every two sentence texts according to the word weight and the part of speech of the target word, and obtain the recommendation matrix;
对所述推荐度矩阵执行L2正则化,得到每两个分句间的相互推荐度。L2 regularization is performed on the recommendation degree matrix to obtain the mutual recommendation degree between every two clauses.
其中,所述配置需求可以由用户上传。Wherein, the configuration requirement may be uploaded by the user.
其中,在计算文本间的相似度时,采用的公式如下:Among them, when calculating the similarity between texts, the formula used is as follows:
Figure PCTCN2022071791-appb-000003
Figure PCTCN2022071791-appb-000003
其中,mat t(Si,Sj)表示任意两个分句Si、Sj间的相互推荐度,Wk表示Si、Sj中同时出现的单词,TermWeight表示单词的权重,Tk表示Wk的词性,valid_postags表示有效的词性。 Among them, mat t (Si, Sj) represents the mutual recommendation between any two clauses Si and Sj, Wk represents the words that appear simultaneously in Si and Sj, TermWeight represents the weight of the word, Tk represents the part of speech of Wk, and valid_postags represents valid part of speech.
其中,所述有效的词性包括名词、动词、形词、副词四种和句子语义密切相关的词性。Wherein, the effective parts of speech include nouns, verbs, adjectives, and adverbs, which are closely related to sentence semantics.
并且,在计算两个句子的公共词(Wk)分数时,对重要性不同的业务词赋予不同的权重。比如,产品名称词条的权重可以是一般词条的2倍,疾病名称或竞品公司名称词条的权重可以是一般词条的1.5倍,具体的权重值可以基于回归测试效果做参数搜索而得到。Moreover, when calculating the common word (Wk) scores of two sentences, different weights are assigned to business words with different importance. For example, the weight of a product name entry can be twice that of a general entry, and the weight of a disease name or competing product company name entry can be 1.5 times that of a general entry. The specific weight value can be determined by performing a parameter search based on the regression test effect. get.
进一步地,对mat t(Si,Sj)做L2 normalization,即对矩阵的每个元素都除以norm_val;其中,
Figure PCTCN2022071791-appb-000004
根号下是矩阵mat t(Si,Sj)中所有元素的平方和。
Further, do L2 normalization on mat t (Si,Sj), that is, divide each element of the matrix by norm_val; where,
Figure PCTCN2022071791-appb-000004
Under the root sign is the sum of squares of all elements in the matrix mat t (Si,Sj).
L2是正则化项,又叫做惩罚项,是为了限制模型的参数,防止模型过拟合而加在损失函数后面的一项,L2范数符合高斯分布,是完全可微的。L2 is a regularization term, also called a penalty term. It is an item added after the loss function to limit the parameters of the model and prevent the model from overfitting. The L2 norm conforms to the Gaussian distribution and is completely differentiable.
在上述实施方式中,仅保留了名词、动词、形容词以及副词四种和句子语义密切相关的词性,并且在计算公共词(Wk)分数时对重要性不同的业务词赋予了不同的权重,结合正则化,有效克服了传统TextRank算法在计算纯文本相似度时,没有区分不同词条的重要性,也没有按词性过滤掉不重要的词的缺陷,提升了业务关联性强的句子被选为摘要的可能性。In the above embodiment, only nouns, verbs, adjectives, and adverbs are kept, which are closely related to the sentence semantics, and different weights are given to business words with different importance when calculating the common word (Wk) score. Regularization effectively overcomes the defect that the traditional TextRank algorithm does not distinguish the importance of different entries when calculating the similarity of plain text, and does not filter out unimportant words by part of speech, and improves the selection of sentences with strong business relevance. Summary possibility.
所述计算单元103计算所述多个分句中每两个分句间的语义相似度。The calculation unit 103 calculates the semantic similarity between every two clauses in the plurality of clauses.
在本实施例中,所述计算单元103计算所述多个分句中每两个分句间的语义相似度包括:In this embodiment, the calculating unit 103 calculating the semantic similarity between every two clauses in the plurality of clauses includes:
对每个分句进行向量化,得到每个分句的嵌入向量表示;Vectorize each clause to obtain the embedded vector representation of each clause;
根据每个分句的嵌入向量表示计算每两个分句间的余弦相似度;Calculate the cosine similarity between each two clauses according to the embedding vector representation of each clause;
将每两个分句间的余弦相似度确定为每两个分句间的语义相似度。The cosine similarity between every two clauses is determined as the semantic similarity between every two clauses.
具体地,在计算句子间的语义相似度时,采用的公式如下:Specifically, when calculating the semantic similarity between sentences, the formula used is as follows:
mat s(Si,Sj)=cosine_similarity(s i-embed,s j-embed) mat s (Si,Sj)=cosine_similarity(s i -embed,s j -embed)
其中,mat s(Si,Sj)表示任意两个分句Si、Sj间的语义相似度,s i-embed表示分句Si的嵌入向量表示,s j-embed表示分句Sj的嵌入向量表示,cosine_similarity表示求解余弦相似度。 Among them, mat s (Si, Sj) represents the semantic similarity between any two clauses Si and Sj, s i -embed represents the embedded vector representation of clause Si, s j -embed represents the embedded vector representation of clause Sj, cosine_similarity means to solve the cosine similarity.
在上述实施方式中,避免了传统算法中只考虑两个句子之间的纯文本相似度,不考虑语义相似度的缺陷。In the above implementation manner, the defect that the traditional algorithm only considers the plain text similarity between two sentences and does not consider the semantic similarity is avoided.
所述计算单元103计算所述多个分句中每两个分句间的位置相似度。The calculation unit 103 calculates the position similarity between every two clauses in the plurality of clauses.
在本实施例中,所述计算单元103计算所述多个分句中每两个分句间的位置相似度包括:In this embodiment, the computing unit 103 calculating the position similarity between every two clauses in the plurality of clauses includes:
将每两个分句确定为一组分句,其中,每组分句中的两个分句互为推荐句及被推荐句;Determining every two clauses as a group of sentences, wherein the two clauses in each group of sentences are recommended sentences and recommended sentences;
当所述任意分句为所述被推荐句时,确定所述被推荐句在相应段落中的位置,当所述被推荐句在相应段落中排在前预设位或者后预设位时,确定对应的矩阵cell值为第一数值;When the arbitrary clause is the recommended sentence, determine the position of the recommended sentence in the corresponding paragraph, and when the recommended sentence is in the front preset position or the rear preset position in the corresponding paragraph, Determining that the corresponding matrix cell value is the first value;
当所述任意分句为所述推荐句时,确定所述推荐句在相应段落中的位置,当所述推荐句在相应段落中排在所述前预设位或者所述后预设位时,确定对应的矩阵cell值为第二数值;When the arbitrary clause is the recommended sentence, determine the position of the recommended sentence in the corresponding paragraph, when the recommended sentence is arranged in the front preset position or the rear preset position in the corresponding paragraph , determine that the corresponding matrix cell value is the second value;
当任意组分句中的推荐句及被推荐句都在相应段落中排在所述前预设位或者所述后预设位时,确定对应的矩阵cell值为第三数值;When the recommended sentence in any component sentence and the recommended sentence are all arranged in the preceding preset position or the rear preset position in the corresponding paragraph, it is determined that the corresponding matrix cell value is the third value;
当所述任意组分句中的推荐句及被推荐句都不在相应段落中排在所述前预设位或者所述后预设位时,确定对应的矩阵cell值为第四数值;When the recommended sentence and the recommended sentence in any of the component sentences are not arranged in the previous preset position or the rear preset position in the corresponding paragraph, determine that the corresponding matrix cell value is the fourth value;
当所述任意分句为所述被推荐句,且所述任意分句为指定属性时,确定对应的矩阵cell值为所述第一数值;When the arbitrary clause is the recommended sentence, and the arbitrary clause is the specified attribute, determine the corresponding matrix cell value as the first value;
根据所述矩阵cell值进行矩阵转换,得到每两个分句间的位置相似度。Matrix transformation is performed according to the matrix cell value to obtain the positional similarity between every two clauses.
其中,所述第一数值、所述第二数值、所述第三数值及所述第四数值可以进行自定义配置,例如,在本实施例中,可以配置所述第一数值为2,所述第二数值为1.5,所述第三数值 为2.5,所述第四数值为1。Wherein, the first numerical value, the second numerical value, the third numerical value and the fourth numerical value can be customized. For example, in this embodiment, the first numerical value can be configured as 2, so The second value is 1.5, the third value is 2.5, and the fourth value is 1.
其中,所述前预设位或者所述后预设位也可以进行自定义配置,例如,所述前预设位可以配置为前5%,相应地,所述后预设位可以配置为后5%。Wherein, the front preset position or the rear preset position can also be customized, for example, the front preset position can be configured as the first 5%, and correspondingly, the rear preset position can be configured as the rear 5%.
其中,所述指定属性可以为总结性属性,即具有所述指定属性的句子为总结性的句子。Wherein, the specified attribute may be a summary attribute, that is, the sentence with the specified attribute is a summary sentence.
通过上述实施方式,在建模的过程中充分考虑了句子前后邻近关系以及它们在原文章中的位置,有效克服了传统方式中由于未考虑句子在文章中位置顺序的重要性而导致的文本摘要生成不准确的问题。Through the above implementation, in the process of modeling, the relationship between the front and back of the sentence and their position in the original article are fully considered, which effectively overcomes the generation of text summarization caused by the lack of consideration of the importance of the order of the sentence in the article in the traditional way. Inaccurate question.
融合单元104对每两个分句间的相互推荐度、每两个分句间的语义相似度以及每两个分句间的位置相似度进行融合处理,得到图邻接矩阵。The fusion unit 104 performs fusion processing on the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the position similarity between each two clauses to obtain a graph adjacency matrix.
在本申请的至少一个实施例中,采用下述公式对每两个分句间的相互推荐度、每两个分句间的语义相似度以及每两个分句间的位置相似度进行融合处理,得到图邻接矩阵:In at least one embodiment of the present application, the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses are fused using the following formula , to get the graph adjacency matrix:
mat adjc=(αmat t+βmat s) mat o mat adjc =(αmat t +βmat s ) mat o
其中,mat adjc表示所述图邻接矩阵,mat t表示每两个分句间的相互推荐度,mat s表示每两个分句间的语义相似度,mat o表示每两个分句间的位置相似度,α表示所述相互推荐度的权重,β表示所述语义相似度的权重,α>0,β>0,且α+β=1。 Among them, mat adjc represents the graph adjacency matrix, mat t represents the mutual recommendation between each two clauses, mat s represents the semantic similarity between each two clauses, mat o represents the position between each two clauses Similarity, α represents the weight of the mutual recommendation, β represents the weight of the semantic similarity, α>0, β>0, and α+β=1.
在本实施例中,(αmat t+βmat s)和mat o做按元素相乘以后,使得(αmat t+βmat s)这个对称矩阵不再对称,此时,在mat adjc中,句子间的相似度受到了句子在文本中位置的影响。 In this embodiment, (αmat t + βmat s ) and mat o are multiplied element-wise, so that the symmetric matrix (αmat t + βmat s ) is no longer symmetrical. At this time, in mat adjc , the similarity between sentences The degree is affected by the position of the sentence in the text.
需要说明的是,传统的摘要提取方案中两图节点的连边是单条无向边,这条边只有单一权重,从这单条无向边来看,两端节点句子的权值是相等的。但文章中的任意两个句子单独拿出来比较,他们的重要程度也应该有高低之分,对两个句子的重要性做等价处理显然是有误的。It should be noted that in the traditional summary extraction scheme, the edge connecting the nodes of the two graphs is a single undirected edge, and this edge has only a single weight. From the perspective of this single undirected edge, the weights of the sentences at both ends of the node are equal. However, if any two sentences in the article are compared separately, their importance should also be divided into high and low. It is obviously wrong to treat the importance of the two sentences as equivalent.
而在本实施方式中,通过每两个分句间的相互推荐度、每两个分句间的语义相似度以及每两个分句间的位置相似度的融合处理,使最终得到的图邻接矩阵将图节点连边从单条无向边建模成两条有向边,克服了传统方案中只有单条无向边的缺陷。However, in this embodiment, through the fusion processing of the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses, the resulting graph adjacency The matrix models the graph node connection from a single undirected edge to two directed edges, which overcomes the defect of only a single undirected edge in the traditional scheme.
所述计算单元103将所述图邻接矩阵输入至TextRank算法计算每个分句的重要度。The calculation unit 103 inputs the graph adjacency matrix to the TextRank algorithm to calculate the importance of each clause.
在本实施例中,在将所述图邻接矩阵输入至TextRank算法后,迭代计算出每个节点的TextRank值作为对应的每个分句的重要度,在此不赘述。In this embodiment, after the graph adjacency matrix is input into the TextRank algorithm, the TextRank value of each node is iteratively calculated as the importance of each corresponding clause, which will not be repeated here.
筛选单元105根据每个分句的重要度进行筛选,得到备选分句。The screening unit 105 screens according to the importance of each clause to obtain candidate clauses.
在本申请的至少一个实施例中,所述筛选单元105根据每个分句的重要度进行筛选,得到备选分句包括:In at least one embodiment of the present application, the screening unit 105 screens according to the importance of each clause, and obtains alternative clauses including:
获取预设阈值;Get the preset threshold;
获取所述重要度大于或者等于所述预设阈值的分句作为所述备选分句。A clause whose importance is greater than or equal to the preset threshold is acquired as the candidate clause.
其中,所述预设阈值可以进行自定义配置,如95%。Wherein, the preset threshold can be customized, such as 95%.
在本申请的至少一个实施例中,所述筛选单元105根据每个分句的重要度进行筛选,得到备选分句还包括:In at least one embodiment of the present application, the screening unit 105 performs screening according to the importance of each clause, and the obtained alternative clauses also include:
将每个分句的重要度按照由高到低的顺序进行排序;Sort the importance of each clause in descending order;
获取预设位置;Get the default position;
将排在所述预设位置之前的分句确定为所述备选分句。The clauses arranged before the preset position are determined as the candidate clauses.
其中,所述预设位置可以进行自定义配置,如20位。Wherein, the preset positions can be customized, such as 20 positions.
所述预设位置相当于一个超参数,可以通过实验或者调试而获得,例如:基于回归测试集,以摘要的rouge值为指标对所述预设位置做超参数搜索,选择最优化rouge值对应的取值作为所述预设位置。The preset position is equivalent to a hyperparameter, which can be obtained through experiments or debugging. For example, based on the regression test set, the rouge value of the abstract is used as an index to perform a hyperparameter search on the preset position, and the optimal rouge value is selected to correspond to The value of is used as the preset position.
后处理单元106对所述备选分句进行后处理,得到摘要句子。The post-processing unit 106 performs post-processing on the candidate clauses to obtain a summary sentence.
需要说明的是,所述备选分句属于初步得到的摘要,但是其中可能包括问句、结果、递进、转折、引导等句式,这种句子不应独立于上下文出现,所以如果其上下文没有被选为摘要句,就需要进行进一步修正。It should be noted that the alternative clauses belong to the initially obtained summary, but they may include sentence patterns such as questions, results, progress, transitions, and guidance. Such sentences should not appear independently of the context, so if their context If it is not selected as a summary sentence, further revision is required.
具体地,所述后处理单元106对所述备选分句进行后处理,得到摘要句子包括:Specifically, the post-processing unit 106 performs post-processing on the candidate clauses to obtain a summary sentence including:
识别所述备选分句中每个分句的类型;identifying the type of each of the alternative clauses;
当在所述备选分句中有目标分句的类型为疑问句时,获取与所述目标分句相邻的下一分句,并将获取的分句添加至所述摘要句子;When the type of the target clause in the alternative clauses is an interrogative sentence, the next clause adjacent to the target clause is obtained, and the obtained clause is added to the summary sentence;
当在所述备选分句中获取到指定关联词组中的其中一个构成单词时,获取与所述构成单词关联的单词所属的分句,并将获取的分句添加至所述摘要句子。When one of the constituent words in the specified associated phrase is obtained in the candidate clause, the clause to which the word associated with the constituent word belongs is obtained, and the obtained clause is added to the summary sentence.
其中,分句的类型可以包括,但不限于:疑问句、由关联词组构成的句子。Wherein, the types of clauses may include, but not limited to: interrogative sentences and sentences composed of associated phrases.
举例而言,可以根据文字识别得到的关键词或者符号判断所述备选分句中每个分句的类型。例如:当识别到“?”时,判断为疑问句。For example, the type of each clause in the candidate clauses may be judged according to keywords or symbols obtained through text recognition. For example: when "?" is recognized, it is judged as an interrogative sentence.
例如:一个摘要句子是问句的话,通常相邻的下一个句子也应该判断为摘要;一个摘要句子是“虽然……但是……”,“因为……所以……”这类句式中的一个成分句子时,另一半成分句子通常也应被判断为摘要。For example: if a summary sentence is a question sentence, usually the next adjacent sentence should also be judged as a summary; a summary sentence is "although...but...", "because...so..." in such sentences When one constituent sentence is used, the other half of the constituent sentence should usually be judged as an abstract as well.
通过上述实施方式,在传统文本摘要生成的基础上加入了后处理,对图算法获取的摘要结果做修正,提升了最终输出的摘要质量。Through the above implementation, post-processing is added on the basis of traditional text summarization, and the summarization result obtained by the graph algorithm is corrected to improve the quality of the final output summarization.
需要说明的是,为了进一步确保数据的安全性,避免数据被恶意篡改,所述摘要句子可以存储于区块链节点上。It should be noted that, in order to further ensure the security of the data and prevent the data from being maliciously tampered with, the summary sentences can be stored on the blockchain nodes.
由以上技术方案可以看出,本申请能够响应于文本摘要生成指令,根据所述文本摘要生成指令获取待处理数据,对所述待处理数据进行切分处理,得到多个分句,根据与具体任务场景相关联的特定词典执行对句子的切分,以便更好地切分出业务相关的词条,计算所述多个分句中每两个分句间的相互推荐度,仅保留了名词、动词、形容词以及副词四种和句子语义密切相关的词性,并且在计算公共词分数时对重要性不同的业务词赋予了不同的权重,结合正则化,有效克服了传统TextRank算法在计算纯文本相似度时,没有区分不同词条的重要性,也没有按词性过滤掉不重要的词的缺陷,提升了业务关联性强的句子被选为摘要的可能性,计算所述多个分句中每两个分句间的语义相似度,避免了传统算法中只考虑两个句子之间的纯文本相似度,不考虑语义相似度的缺陷,计算所述多个分句中每两个分句间的位置相似度,在建模的过程中充分考虑了句子前后邻近关系以及它们在原文章中的位置,有效克服了传统方式中由于未考虑句子在文章中位置顺序的重要性而导致的文本摘要生成不准确的问题,对每两个分句间的相互推荐度、每两个分句间的语义相似度以及每两个分句间的位置相似度进行融合处理,得到图邻接矩阵,使最终得到的图邻接矩阵将图节点连边从单条无向边建模成两条有向边,克服了传统方案中只有单条无向边的缺陷,将所述图邻接矩阵输入至TextRank算法计算每个分句的重要度,根据每个分句的重要度进行筛选,得到备选分句,对所述备选分句进行后处理,得到摘要句子,在传统文本摘要生成的基础上加入了后处理,对图算法获取的摘要结果做修正,提升了最终输出的摘要质量,进而基于人工智能手段实现更加准确的文本摘要生成。It can be seen from the above technical solutions that the present application can respond to the text summary generation instruction, obtain the data to be processed according to the text summary generation instruction, and perform segmentation processing on the data to be processed to obtain multiple clauses. The specific dictionary associated with the task scenario performs sentence segmentation in order to better segment out business-related entries, calculate the mutual recommendation between each two clauses in the multiple clauses, and only retain nouns , verbs, adjectives and adverbs are four parts of speech that are closely related to sentence semantics, and different weights are given to business words with different importance when calculating common word scores. Combined with regularization, it effectively overcomes the traditional TextRank algorithm in calculating plain text In terms of similarity, there is no distinction between the importance of different entries, and there is no defect in filtering out unimportant words by part of speech, which improves the possibility of sentences with strong business relevance being selected as abstracts, and calculates the number of sentences in the multiple clauses The semantic similarity between every two clauses avoids the defect that the traditional algorithm only considers the plain text similarity between two sentences and does not consider the semantic similarity, and calculates every two clauses in the multiple clauses The location similarity between sentences is fully considered in the process of modeling, and their positions in the original article are fully considered, which effectively overcomes the text summarization caused by the lack of consideration of the importance of the position order of sentences in the article Inaccurate questions are generated, and the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses are fused to obtain a graph adjacency matrix, so that the final The obtained graph adjacency matrix models the graph node connection edge from a single undirected edge into two directed edges, which overcomes the defect of only a single undirected edge in the traditional scheme, and inputs the graph adjacency matrix into the TextRank algorithm to calculate each The importance of clauses is screened according to the importance of each clause to obtain alternative clauses, and post-processing is performed on the alternative clauses to obtain a summary sentence. Post-processing is added on the basis of traditional text summary generation , modify the summary results obtained by the graph algorithm, improve the quality of the final output summary, and then realize more accurate text summary generation based on artificial intelligence means.
上述文本摘要生成装置可以实现为计算机程序的形式,该计算机程序可以在如图3所示的计算机设备上运行。The above-mentioned apparatus for generating a text summary can be realized in the form of a computer program, and the computer program can be run on the computer device as shown in FIG. 3 .
请参阅图3,图3是本申请实施例提供的计算机设备的示意性框图。该计算机设备500是服务器,服务器可以是独立的服务器,也可以是多个服务器组成的服务器集群。Please refer to FIG. 3 . FIG. 3 is a schematic block diagram of a computer device provided by an embodiment of the present application. The computer device 500 is a server, and the server may be an independent server or a server cluster composed of multiple servers.
参阅图3,该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括存储介质503和内存储器504。Referring to FIG. 3 , the computer device 500 includes a processor 502 connected through a system bus 501 , a memory and a network interface 505 , wherein the memory may include a storage medium 503 and an internal memory 504 .
该存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032被执行时,可使得处理器502执行文本摘要生成方法。The storage medium 503 can store an operating system 5031 and a computer program 5032 . When the computer program 5032 is executed, it can cause the processor 502 to execute the method for generating a text summary.
该处理器502用于提供计算和控制能力,支撑整个计算机设备500的运行。The processor 502 is used to provide calculation and control capabilities and support the operation of the entire computer device 500 .
该内存储器504为存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行文本摘要生成方法。The internal memory 504 provides an environment for the running of the computer program 5032 in the storage medium 503. When the computer program 5032 is executed by the processor 502, the processor 502 can execute the method for generating a text summary.
该网络接口505用于进行网络通信,如提供数据信息的传输等。本领域技术人员可以理 解,图3中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备500的限定,具体的计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。The network interface 505 is used for network communication, such as providing data transmission and the like. Those skilled in the art can understand that the structure shown in FIG. 3 is only a block diagram of a partial structure related to the solution of this application, and does not constitute a limitation to the computer device 500 on which the solution of this application is applied. The specific computer device 500 may include more or fewer components than shown, or combine certain components, or have a different arrangement of components.
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现本申请实施例公开的文本摘要生成方法。Wherein, the processor 502 is configured to run a computer program 5032 stored in the memory, so as to implement the method for generating a text abstract disclosed in the embodiment of the present application.
本领域技术人员可以理解,图3中示出的计算机设备的实施例并不构成对计算机设备具体构成的限定,在其他实施例中,计算机设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存储器及处理器,在这样的实施例中,存储器及处理器的结构及功能与图3所示实施例一致,在此不再赘述。Those skilled in the art can understand that the embodiment of the computer device shown in FIG. 3 does not constitute a limitation on the specific composition of the computer device. In other embodiments, the computer device may include more or less components than those shown in the illustration. Or combine certain components, or different component arrangements. For example, in some embodiments, the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in FIG. 3 , and will not be repeated here.
应当理解,在本申请实施例中,处理器502可以是中央处理单元(Central ProcessingUnit,CPU),该处理器502还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that in the embodiment of the present application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Wherein, the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
在本申请的另一实施例中提供计算机可读存储介质。该计算机可读存储介质可以为非易失性的计算机可读存储介质,也可以为易失性的计算机可读存储介质。该计算机可读存储介质存储有计算机程序,其中计算机程序被处理器执行时实现本申请实施例公开的文本摘要生成方法。所述计算机可读存储介质可以是非易失性,也可以是易失性。In another embodiment of the present application a computer readable storage medium is provided. The computer-readable storage medium may be a non-volatile computer-readable storage medium, or a volatile computer-readable storage medium. The computer-readable storage medium stores a computer program, wherein when the computer program is executed by a processor, the method for generating a text abstract disclosed in the embodiment of the present application is implemented. The computer-readable storage medium may be non-volatile or volatile.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的设备、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those skilled in the art can clearly understand that for the convenience and brevity of description, the specific working process of the above-described devices, devices, and units can refer to the corresponding process in the foregoing method embodiments, and details are not repeated here. Those of ordinary skill in the art can realize that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of the two. In order to clearly illustrate the relationship between hardware and software Interchangeability. In the above description, the composition and steps of each example have been generally described according to their functions. Whether these functions are implemented by hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.
在本申请所提供的几个实施例中,应该理解到,所揭露的设备、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为逻辑功能划分,实际实现时可以有另外的划分方式,也可以将具有相同功能的单元集合成一个单元,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。In the several embodiments provided in this application, it should be understood that the disclosed devices, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only logical function division. In actual implementation, there may be other division methods, and units with the same function may also be combined into one Units such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present application.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of software products, and the computer software products are stored in a storage medium In, several instructions are included to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: various media that can store program codes such as U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk or optical disk.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above is only a specific embodiment of the application, but the scope of protection of the application is not limited thereto. Any person familiar with the technical field can easily think of various equivalents within the scope of the technology disclosed in the application. Modifications or replacements, these modifications or replacements shall be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims (20)

  1. 一种文本摘要生成方法,其中,包括:A method for generating text summarization, including:
    响应于文本摘要生成指令,根据所述文本摘要生成指令获取待处理数据;Responding to a text summary generation instruction, acquiring data to be processed according to the text summary generation instruction;
    根据任务场景获取词典对所述待处理数据进行切分处理,得到多个分句;Obtaining a dictionary according to the task scene to perform segmentation processing on the data to be processed to obtain multiple clauses;
    计算所述多个分句中每两个分句间的相互推荐度;calculating the degree of mutual recommendation between every two clauses in the plurality of clauses;
    计算所述多个分句中每两个分句间的语义相似度;calculating the semantic similarity between every two clauses in the plurality of clauses;
    计算所述多个分句中每两个分句间的位置相似度;calculating the positional similarity between every two clauses in the multiple clauses;
    对每两个分句间的相互推荐度、每两个分句间的语义相似度以及每两个分句间的位置相似度进行融合处理,得到图邻接矩阵;The mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses are fused to obtain a graph adjacency matrix;
    将所述图邻接矩阵输入至TextRank算法计算每个分句的重要度;The graph adjacency matrix is input to the importance of TextRank algorithm to calculate each clause;
    根据每个分句的重要度进行筛选,得到备选分句;Filter according to the importance of each clause to obtain alternative clauses;
    对所述备选分句进行后处理,得到摘要句子。Perform post-processing on the candidate clauses to obtain a summary sentence.
  2. 根据权利要求1所述的文本摘要生成方法,其中,所述根据任务场景获取词典对所述待处理数据进行切分处理,得到多个分句包括:The method for generating a text abstract according to claim 1, wherein said acquiring a dictionary according to a task scene performs segmentation processing on said data to be processed, and obtaining a plurality of clauses includes:
    识别当前任务场景;Identify the current task scenario;
    调取与所述当前任务场景匹配的词典作为目标词典;Retrieving a dictionary matching the current task scene as a target dictionary;
    根据所述目标词典切分所述待处理数据,得到所述多个分句。Segmenting the data to be processed according to the target dictionary to obtain the plurality of clauses.
  3. 根据权利要求1所述的文本摘要生成方法,其中,所述计算所述多个分句中每两个分句间的相互推荐度包括:The method for generating a text abstract according to claim 1, wherein said calculating the degree of mutual recommendation between every two clauses in said plurality of clauses comprises:
    根据接收到的配置需求配置所述多个分句中每个单词的词权重;configuring the word weight of each word in the plurality of clauses according to the received configuration requirements;
    对于所述多个分句,获取每两个分句中同时出现的单词作为目标词;For the multiple clauses, obtain the words that appear simultaneously in every two clauses as the target word;
    确定所述目标词的词权重及词性;Determine the word weight and part of speech of the target word;
    根据所述目标词的词权重及词性计算每两个分句文本间的相似度,得到推荐度矩阵;Calculate the similarity between every two sentence texts according to the word weight and the part of speech of the target word, and obtain the recommendation matrix;
    对所述推荐度矩阵执行L2正则化,得到每两个分句间的相互推荐度。L2 regularization is performed on the recommendation degree matrix to obtain the mutual recommendation degree between every two clauses.
  4. 根据权利要求1所述的文本摘要生成方法,其中,所述计算所述多个分句中每两个分句间的语义相似度包括:The text summary generating method according to claim 1, wherein said calculating the semantic similarity between every two clauses in said plurality of clauses comprises:
    对每个分句进行向量化,得到每个分句的嵌入向量表示;Vectorize each clause to obtain the embedded vector representation of each clause;
    根据每个分句的嵌入向量表示计算每两个分句间的余弦相似度;Calculate the cosine similarity between each two clauses according to the embedding vector representation of each clause;
    将每两个分句间的余弦相似度确定为每两个分句间的语义相似度。The cosine similarity between every two clauses is determined as the semantic similarity between every two clauses.
  5. 根据权利要求1所述的文本摘要生成方法,其中,所述计算所述多个分句中每两个分句间的位置相似度包括:The text summary generating method according to claim 1, wherein said calculating the positional similarity between every two clauses in said plurality of clauses comprises:
    将每两个分句确定为一组分句,其中,每组分句中的两个分句互为推荐句及被推荐句;Determining every two clauses as a group of sentences, wherein the two clauses in each group of sentences are recommended sentences and recommended sentences;
    当任意分句为所述被推荐句时,确定所述被推荐句在相应段落中的位置,当所述被推荐句在相应段落中排在前预设位或者后预设位时,确定对应的矩阵cell值为第一数值;When any clause is the recommended sentence, determine the position of the recommended sentence in the corresponding paragraph, and determine the corresponding The matrix cell value of is the first value;
    当所述任意分句为所述推荐句时,确定所述推荐句在相应段落中的位置,当所述推荐句在相应段落中排在所述前预设位或者所述后预设位时,确定对应的矩阵cell值为第二数值;When the arbitrary clause is the recommended sentence, determine the position of the recommended sentence in the corresponding paragraph, when the recommended sentence is arranged in the front preset position or the rear preset position in the corresponding paragraph , determine that the corresponding matrix cell value is the second value;
    当任意组分句中的推荐句及被推荐句都在相应段落中排在所述前预设位或者所述后预设位时,确定对应的矩阵cell值为第三数值;When the recommended sentence in any component sentence and the recommended sentence are all arranged in the preceding preset position or the rear preset position in the corresponding paragraph, it is determined that the corresponding matrix cell value is the third value;
    当所述任意组分句中的推荐句及被推荐句都不在相应段落中排在所述前预设位或者所述后预设位时,确定对应的矩阵cell值为第四数值;When the recommended sentence and the recommended sentence in any of the component sentences are not arranged in the previous preset position or the rear preset position in the corresponding paragraph, determine that the corresponding matrix cell value is the fourth value;
    当所述任意分句为所述被推荐句,且所述任意分句为指定属性时,确定对应的矩阵cell值为所述第一数值;When the arbitrary clause is the recommended sentence, and the arbitrary clause is the specified attribute, determine the corresponding matrix cell value as the first value;
    根据所述矩阵cell值进行矩阵转换,得到每两个分句间的位置相似度。Matrix transformation is performed according to the matrix cell value to obtain the positional similarity between every two clauses.
  6. 根据权利要求1所述的文本摘要生成方法,其中,采用下述公式对每两个分句间的相互推荐度、每两个分句间的语义相似度以及每两个分句间的位置相似度进行融合处理,得到图邻接矩阵:The text abstract generation method according to claim 1, wherein the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses and the similar position between each two clauses are calculated by using the following formula The degrees are fused to obtain the graph adjacency matrix:
    mat adjc=(αmat t+βmat s)⊙mat o mat adjc =(αmat t +βmat s )⊙mat o
    其中,mat adjc表示所述图邻接矩阵,mat t表示每两个分句间的相互推荐度,mat s表示每两个分句间的语义相似度,mat o表示每两个分句间的位置相似度,α表示所述相互推荐度的权重,β表示所述语义相似度的权重,α>0,β>0,且α+β=1。 Among them, mat adjc represents the graph adjacency matrix, mat t represents the mutual recommendation between each two clauses, mat s represents the semantic similarity between each two clauses, mat o represents the position between each two clauses Similarity, α represents the weight of the mutual recommendation, β represents the weight of the semantic similarity, α>0, β>0, and α+β=1.
  7. 根据权利要求1所述的文本摘要生成方法,其中,所述对所述备选分句进行后处理,得到摘要句子包括:The method for generating a text abstract according to claim 1, wherein said post-processing said candidate clauses to obtain an abstract sentence comprises:
    识别所述备选分句中每个分句的类型;identifying the type of each of the alternative clauses;
    当在所述备选分句中有目标分句的类型为疑问句时,获取与所述目标分句相邻的下一分句,并将获取的分句添加至所述摘要句子;When the type of the target clause in the alternative clauses is an interrogative sentence, the next clause adjacent to the target clause is obtained, and the obtained clause is added to the summary sentence;
    当在所述备选分句中获取到指定关联词组中的其中一个构成单词时,获取与所述构成单词关联的单词所属的分句,并将获取的分句添加至所述摘要句子。When one of the constituent words in the specified associated phrase is obtained in the candidate clause, the clause to which the word associated with the constituent word belongs is obtained, and the obtained clause is added to the summary sentence.
  8. 一种文本摘要生成装置,其中,包括:A text summarization generating device, including:
    获取单元,用于响应于文本摘要生成指令,根据所述文本摘要生成指令获取待处理数据;An acquisition unit, configured to respond to a text summary generation instruction, and obtain data to be processed according to the text summary generation instruction;
    切分单元,用于根据任务场景获取词典对所述待处理数据进行切分处理,得到多个分句;A segmentation unit, configured to segment the data to be processed according to the task scene acquisition dictionary to obtain multiple clauses;
    计算单元,用于计算所述多个分句中每两个分句间的相互推荐度;a calculation unit, configured to calculate the mutual recommendation degree between every two clauses in the plurality of clauses;
    所述计算单元,还用于计算所述多个分句中每两个分句间的语义相似度;The calculation unit is also used to calculate the semantic similarity between every two clauses in the plurality of clauses;
    所述计算单元,还用于计算所述多个分句中每两个分句间的位置相似度;The calculation unit is also used to calculate the positional similarity between every two clauses in the plurality of clauses;
    融合单元,用于对每两个分句间的相互推荐度、每两个分句间的语义相似度以及每两个分句间的位置相似度进行融合处理,得到图邻接矩阵;The fusion unit is used to fuse the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses to obtain a graph adjacency matrix;
    所述计算单元,还用于将所述图邻接矩阵输入至TextRank算法计算每个分句的重要度;The calculation unit is also used to input the graph adjacency matrix to the TextRank algorithm to calculate the importance of each clause;
    筛选单元,用于根据每个分句的重要度进行筛选,得到备选分句;A screening unit is used to screen according to the importance of each clause to obtain alternative clauses;
    后处理单元,用于对所述备选分句进行后处理,得到摘要句子。A post-processing unit, configured to post-process the candidate clauses to obtain a summary sentence.
  9. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现以下步骤:A computer device, comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein the processor implements the following steps when executing the computer program:
    响应于文本摘要生成指令,根据所述文本摘要生成指令获取待处理数据;Responding to a text summary generation instruction, acquiring data to be processed according to the text summary generation instruction;
    根据任务场景获取词典对所述待处理数据进行切分处理,得到多个分句;Obtaining a dictionary according to the task scene to perform segmentation processing on the data to be processed to obtain multiple clauses;
    计算所述多个分句中每两个分句间的相互推荐度;calculating the degree of mutual recommendation between every two clauses in the plurality of clauses;
    计算所述多个分句中每两个分句间的语义相似度;calculating the semantic similarity between every two clauses in the plurality of clauses;
    计算所述多个分句中每两个分句间的位置相似度;calculating the positional similarity between every two clauses in the multiple clauses;
    对每两个分句间的相互推荐度、每两个分句间的语义相似度以及每两个分句间的位置相似度进行融合处理,得到图邻接矩阵;The mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses are fused to obtain a graph adjacency matrix;
    将所述图邻接矩阵输入至TextRank算法计算每个分句的重要度;The graph adjacency matrix is input to the importance of TextRank algorithm to calculate each clause;
    根据每个分句的重要度进行筛选,得到备选分句;Filter according to the importance of each clause to obtain alternative clauses;
    对所述备选分句进行后处理,得到摘要句子。Perform post-processing on the candidate clauses to obtain a summary sentence.
  10. 如权利要求9所述的计算机设备,其中,所述根据任务场景获取词典对所述待处理数据进行切分处理,得到多个分句包括:The computer device according to claim 9, wherein said acquiring a dictionary according to a task scene performs segmentation processing on said data to be processed, and obtaining a plurality of clauses includes:
    识别当前任务场景;Identify the current task scenario;
    调取与所述当前任务场景匹配的词典作为目标词典;Retrieving a dictionary matching the current task scene as a target dictionary;
    根据所述目标词典切分所述待处理数据,得到所述多个分句。Segmenting the data to be processed according to the target dictionary to obtain the plurality of clauses.
  11. 如权利要求9所述的计算机设备,其中,所述计算所述多个分句中每两个分句间的相互推荐度包括:The computer device according to claim 9, wherein said calculating the degree of mutual recommendation between every two clauses in said plurality of clauses comprises:
    根据接收到的配置需求配置所述多个分句中每个单词的词权重;configuring the word weight of each word in the plurality of clauses according to the received configuration requirements;
    对于所述多个分句,获取每两个分句中同时出现的单词作为目标词;For the multiple clauses, obtain the words that appear simultaneously in every two clauses as the target word;
    确定所述目标词的词权重及词性;Determine the word weight and part of speech of the target word;
    根据所述目标词的词权重及词性计算每两个分句文本间的相似度,得到推荐度矩阵;Calculate the similarity between every two sentence texts according to the word weight and the part of speech of the target word, and obtain the recommendation matrix;
    对所述推荐度矩阵执行L2正则化,得到每两个分句间的相互推荐度。L2 regularization is performed on the recommendation degree matrix to obtain the mutual recommendation degree between every two clauses.
  12. 如权利要求9所述的计算机设备,其中,所述计算所述多个分句中每两个分句间的语义相似度包括:The computer device according to claim 9, wherein said calculating the semantic similarity between every two clauses in said plurality of clauses comprises:
    对每个分句进行向量化,得到每个分句的嵌入向量表示;Vectorize each clause to obtain the embedded vector representation of each clause;
    根据每个分句的嵌入向量表示计算每两个分句间的余弦相似度;Calculate the cosine similarity between each two clauses according to the embedding vector representation of each clause;
    将每两个分句间的余弦相似度确定为每两个分句间的语义相似度。The cosine similarity between every two clauses is determined as the semantic similarity between every two clauses.
  13. 如权利要求9所述的计算机设备,其中,所述计算所述多个分句中每两个分句间的位置相似度包括:The computer device according to claim 9, wherein said calculating the positional similarity between every two clauses in said plurality of clauses comprises:
    将每两个分句确定为一组分句,其中,每组分句中的两个分句互为推荐句及被推荐句;Determining every two clauses as a group of sentences, wherein the two clauses in each group of sentences are recommended sentences and recommended sentences;
    当任意分句为所述被推荐句时,确定所述被推荐句在相应段落中的位置,当所述被推荐句在相应段落中排在前预设位或者后预设位时,确定对应的矩阵cell值为第一数值;When any clause is the recommended sentence, determine the position of the recommended sentence in the corresponding paragraph, and determine the corresponding The matrix cell value of is the first value;
    当所述任意分句为所述推荐句时,确定所述推荐句在相应段落中的位置,当所述推荐句在相应段落中排在所述前预设位或者所述后预设位时,确定对应的矩阵cell值为第二数值;When the arbitrary clause is the recommended sentence, determine the position of the recommended sentence in the corresponding paragraph, when the recommended sentence is arranged in the front preset position or the rear preset position in the corresponding paragraph , determine that the corresponding matrix cell value is the second value;
    当任意组分句中的推荐句及被推荐句都在相应段落中排在所述前预设位或者所述后预设位时,确定对应的矩阵cell值为第三数值;When the recommended sentence in any component sentence and the recommended sentence are all arranged in the preceding preset position or the rear preset position in the corresponding paragraph, it is determined that the corresponding matrix cell value is the third value;
    当所述任意组分句中的推荐句及被推荐句都不在相应段落中排在所述前预设位或者所述后预设位时,确定对应的矩阵cell值为第四数值;When the recommended sentence and the recommended sentence in any of the component sentences are not arranged in the previous preset position or the rear preset position in the corresponding paragraph, determine that the corresponding matrix cell value is the fourth value;
    当所述任意分句为所述被推荐句,且所述任意分句为指定属性时,确定对应的矩阵cell值为所述第一数值;When the arbitrary clause is the recommended sentence, and the arbitrary clause is the specified attribute, determine the corresponding matrix cell value as the first value;
    根据所述矩阵cell值进行矩阵转换,得到每两个分句间的位置相似度。Matrix transformation is performed according to the matrix cell value to obtain the positional similarity between every two clauses.
  14. 如权利要求9所述的计算机设备,其中,采用下述公式对每两个分句间的相互推荐度、每两个分句间的语义相似度以及每两个分句间的位置相似度进行融合处理,得到图邻接矩阵:The computer device as claimed in claim 9, wherein, the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses and the positional similarity between each two clauses are calculated using the following formula Fusion processing to get the graph adjacency matrix:
    mat adjc=(αmat t+βmat s)⊙mat o mat adjc =(αmat t +βmat s )⊙mat o
    其中,mat adjc表示所述图邻接矩阵,mat t表示每两个分句间的相互推荐度,mat s表示每两个分句间的语义相似度,mat o表示每两个分句间的位置相似度,α表示所述相互推荐度的权重,β表示所述语义相似度的权重,α>0,β>0,且α+β=1。 Among them, mat adjc represents the graph adjacency matrix, mat t represents the mutual recommendation between each two clauses, mat s represents the semantic similarity between each two clauses, mat o represents the position between each two clauses Similarity, α represents the weight of the mutual recommendation, β represents the weight of the semantic similarity, α>0, β>0, and α+β=1.
  15. 如权利要求9所述的计算机设备,其中,所述对所述备选分句进行后处理,得到摘要句子包括:The computer device according to claim 9, wherein said performing post-processing on said candidate clauses to obtain a summary sentence comprises:
    识别所述备选分句中每个分句的类型;identifying the type of each of the alternative clauses;
    当在所述备选分句中有目标分句的类型为疑问句时,获取与所述目标分句相邻的下一分句,并将获取的分句添加至所述摘要句子;When the type of the target clause in the alternative clauses is an interrogative sentence, the next clause adjacent to the target clause is obtained, and the obtained clause is added to the summary sentence;
    当在所述备选分句中获取到指定关联词组中的其中一个构成单词时,获取与所述构成单词关联的单词所属的分句,并将获取的分句添加至所述摘要句子。When one of the constituent words in the specified associated phrase is obtained in the candidate clause, the clause to which the word associated with the constituent word belongs is obtained, and the obtained clause is added to the summary sentence.
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器以下步骤:A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to:
    响应于文本摘要生成指令,根据所述文本摘要生成指令获取待处理数据;Responding to a text summary generation instruction, acquiring data to be processed according to the text summary generation instruction;
    根据任务场景获取词典对所述待处理数据进行切分处理,得到多个分句;Obtaining a dictionary according to the task scene to perform segmentation processing on the data to be processed to obtain multiple clauses;
    计算所述多个分句中每两个分句间的相互推荐度;calculating the degree of mutual recommendation between every two clauses in the plurality of clauses;
    计算所述多个分句中每两个分句间的语义相似度;calculating the semantic similarity between every two clauses in the plurality of clauses;
    计算所述多个分句中每两个分句间的位置相似度;calculating the positional similarity between every two clauses in the multiple clauses;
    对每两个分句间的相互推荐度、每两个分句间的语义相似度以及每两个分句间的位置相似度进行融合处理,得到图邻接矩阵;The mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses are fused to obtain a graph adjacency matrix;
    将所述图邻接矩阵输入至TextRank算法计算每个分句的重要度;The graph adjacency matrix is input to the importance of TextRank algorithm to calculate each clause;
    根据每个分句的重要度进行筛选,得到备选分句;Filter according to the importance of each clause to obtain alternative clauses;
    对所述备选分句进行后处理,得到摘要句子。Perform post-processing on the candidate clauses to obtain a summary sentence.
  17. 如权利要求16所述的计算机可读存储介质,其中,所述根据任务场景获取词典对 所述待处理数据进行切分处理,得到多个分句包括:The computer-readable storage medium as claimed in claim 16, wherein, said acquisition dictionary according to the task scene performs segmentation processing on the data to be processed, and obtains a plurality of clauses comprising:
    识别当前任务场景;Identify the current task scenario;
    调取与所述当前任务场景匹配的词典作为目标词典;Retrieving a dictionary matching the current task scene as a target dictionary;
    根据所述目标词典切分所述待处理数据,得到所述多个分句。Segmenting the data to be processed according to the target dictionary to obtain the plurality of clauses.
  18. 如权利要求16所述的计算机可读存储介质,其中,所述计算所述多个分句中每两个分句间的相互推荐度包括:The computer-readable storage medium according to claim 16, wherein said calculating the degree of mutual recommendation between every two clauses in said plurality of clauses comprises:
    根据接收到的配置需求配置所述多个分句中每个单词的词权重;configuring the word weight of each word in the plurality of clauses according to the received configuration requirements;
    对于所述多个分句,获取每两个分句中同时出现的单词作为目标词;For the multiple clauses, obtain the words that appear simultaneously in every two clauses as the target word;
    确定所述目标词的词权重及词性;Determine the word weight and part of speech of the target word;
    根据所述目标词的词权重及词性计算每两个分句文本间的相似度,得到推荐度矩阵;Calculate the similarity between every two sentence texts according to the word weight and the part of speech of the target word, and obtain the recommendation matrix;
    对所述推荐度矩阵执行L2正则化,得到每两个分句间的相互推荐度。L2 regularization is performed on the recommendation degree matrix to obtain the mutual recommendation degree between every two clauses.
  19. 如权利要求16所述的计算机可读存储介质,其中,所述计算所述多个分句中每两个分句间的语义相似度包括:The computer-readable storage medium according to claim 16, wherein said calculating the semantic similarity between every two clauses in said plurality of clauses comprises:
    对每个分句进行向量化,得到每个分句的嵌入向量表示;Vectorize each clause to obtain the embedded vector representation of each clause;
    根据每个分句的嵌入向量表示计算每两个分句间的余弦相似度;Calculate the cosine similarity between each two clauses according to the embedding vector representation of each clause;
    将每两个分句间的余弦相似度确定为每两个分句间的语义相似度。The cosine similarity between every two clauses is determined as the semantic similarity between every two clauses.
  20. 如权利要求16所述的计算机可读存储介质,其中,所述计算所述多个分句中每两个分句间的位置相似度包括:The computer-readable storage medium according to claim 16, wherein said calculating the positional similarity between every two clauses in said plurality of clauses comprises:
    将每两个分句确定为一组分句,其中,每组分句中的两个分句互为推荐句及被推荐句;Determining every two clauses as a group of sentences, wherein the two clauses in each group of sentences are recommended sentences and recommended sentences;
    当任意分句为所述被推荐句时,确定所述被推荐句在相应段落中的位置,当所述被推荐句在相应段落中排在前预设位或者后预设位时,确定对应的矩阵cell值为第一数值;When any clause is the recommended sentence, determine the position of the recommended sentence in the corresponding paragraph, and determine the corresponding The matrix cell value of is the first value;
    当所述任意分句为所述推荐句时,确定所述推荐句在相应段落中的位置,当所述推荐句在相应段落中排在所述前预设位或者所述后预设位时,确定对应的矩阵cell值为第二数值;When the arbitrary clause is the recommended sentence, determine the position of the recommended sentence in the corresponding paragraph, when the recommended sentence is arranged in the front preset position or the rear preset position in the corresponding paragraph , determine that the corresponding matrix cell value is the second value;
    当任意组分句中的推荐句及被推荐句都在相应段落中排在所述前预设位或者所述后预设位时,确定对应的矩阵cell值为第三数值;When the recommended sentence in any component sentence and the recommended sentence are all arranged in the preceding preset position or the rear preset position in the corresponding paragraph, it is determined that the corresponding matrix cell value is the third value;
    当所述任意组分句中的推荐句及被推荐句都不在相应段落中排在所述前预设位或者所述后预设位时,确定对应的矩阵cell值为第四数值;When the recommended sentence and the recommended sentence in any of the component sentences are not arranged in the previous preset position or the rear preset position in the corresponding paragraph, determine that the corresponding matrix cell value is the fourth value;
    当所述任意分句为所述被推荐句,且所述任意分句为指定属性时,确定对应的矩阵cell值为所述第一数值;When the arbitrary clause is the recommended sentence, and the arbitrary clause is the specified attribute, determine the corresponding matrix cell value as the first value;
    根据所述矩阵cell值进行矩阵转换,得到每两个分句间的位置相似度。Matrix transformation is performed according to the matrix cell value to obtain the positional similarity between every two clauses.
PCT/CN2022/071791 2021-06-18 2022-01-13 Text abstract generation method and apparatus, and computer device and storage medium WO2022262266A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110679639.7 2021-06-18
CN202110679639.7A CN113254593B (en) 2021-06-18 2021-06-18 Text abstract generation method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2022262266A1 true WO2022262266A1 (en) 2022-12-22

Family

ID=77188647

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/071791 WO2022262266A1 (en) 2021-06-18 2022-01-13 Text abstract generation method and apparatus, and computer device and storage medium

Country Status (2)

Country Link
CN (1) CN113254593B (en)
WO (1) WO2022262266A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116628186A (en) * 2023-07-17 2023-08-22 乐麦信息技术(杭州)有限公司 Text abstract generation method and system

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254593B (en) * 2021-06-18 2021-10-19 平安科技(深圳)有限公司 Text abstract generation method and device, computer equipment and storage medium
CN113590811A (en) * 2021-08-19 2021-11-02 平安国际智慧城市科技股份有限公司 Text abstract generation method and device, electronic equipment and storage medium
CN113779978A (en) * 2021-09-26 2021-12-10 上海一者信息科技有限公司 Method for realizing unsupervised cross-language sentence alignment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016125949A1 (en) * 2015-02-02 2016-08-11 숭실대학교 산학협력단 Automatic document summarizing method and server
CN107133213A (en) * 2017-05-06 2017-09-05 广东药科大学 A kind of text snippet extraction method and system based on algorithm
CN110781291A (en) * 2019-10-25 2020-02-11 北京市计算中心 Text abstract extraction method, device, server and readable storage medium
CN111858912A (en) * 2020-07-03 2020-10-30 黑龙江阳光惠远知识产权运营有限公司 Abstract generation method based on single long text
CN112347241A (en) * 2020-11-10 2021-02-09 华夏幸福产业投资有限公司 Abstract extraction method, device, equipment and storage medium
CN113254593A (en) * 2021-06-18 2021-08-13 平安科技(深圳)有限公司 Text abstract generation method and device, computer equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955772B (en) * 2011-08-17 2015-11-25 北京百度网讯科技有限公司 A kind of similarity calculating method based on semanteme and device
CN112347240A (en) * 2020-10-16 2021-02-09 小牛思拓(北京)科技有限公司 Text abstract extraction method and device, readable storage medium and electronic equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016125949A1 (en) * 2015-02-02 2016-08-11 숭실대학교 산학협력단 Automatic document summarizing method and server
CN107133213A (en) * 2017-05-06 2017-09-05 广东药科大学 A kind of text snippet extraction method and system based on algorithm
CN110781291A (en) * 2019-10-25 2020-02-11 北京市计算中心 Text abstract extraction method, device, server and readable storage medium
CN111858912A (en) * 2020-07-03 2020-10-30 黑龙江阳光惠远知识产权运营有限公司 Abstract generation method based on single long text
CN112347241A (en) * 2020-11-10 2021-02-09 华夏幸福产业投资有限公司 Abstract extraction method, device, equipment and storage medium
CN113254593A (en) * 2021-06-18 2021-08-13 平安科技(深圳)有限公司 Text abstract generation method and device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI NANA, LIU PEIYU; LIU WENFENG; LIU WEITONG: "Automatic digest optimization algorithm based on TextRank", APPLICATION RESEARCH OF COMPUTERS, CHENGDU, CN, vol. 36, no. 4, 30 April 2019 (2019-04-30), CN , pages 1045 - 1050, XP093015759, ISSN: 1001-3695, DOI: 10.19734/j.issn.1001-3695.2017.11.0786 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116628186A (en) * 2023-07-17 2023-08-22 乐麦信息技术(杭州)有限公司 Text abstract generation method and system
CN116628186B (en) * 2023-07-17 2023-10-24 乐麦信息技术(杭州)有限公司 Text abstract generation method and system

Also Published As

Publication number Publication date
CN113254593B (en) 2021-10-19
CN113254593A (en) 2021-08-13

Similar Documents

Publication Publication Date Title
WO2022262266A1 (en) Text abstract generation method and apparatus, and computer device and storage medium
CN110993081B (en) Doctor online recommendation method and system
CN107193959B (en) Pure text-oriented enterprise entity classification method
CN105824922B (en) A kind of sensibility classification method merging further feature and shallow-layer feature
WO2019200806A1 (en) Device for generating text classification model, method, and computer readable storage medium
WO2017167067A1 (en) Method and device for webpage text classification, method and device for webpage text recognition
RU2686000C1 (en) Retrieval of information objects using a combination of classifiers analyzing local and non-local signs
CN109670039B (en) Semi-supervised e-commerce comment emotion analysis method based on three-part graph and cluster analysis
WO2022126810A1 (en) Text clustering method
RU2679988C1 (en) Extracting information objects with the help of a classifier combination
US20190317986A1 (en) Annotated text data expanding method, annotated text data expanding computer-readable storage medium, annotated text data expanding device, and text classification model training method
CN112347778A (en) Keyword extraction method and device, terminal equipment and storage medium
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
US11593557B2 (en) Domain-specific grammar correction system, server and method for academic text
WO2022042297A1 (en) Text clustering method, apparatus, electronic device, and storage medium
JP7281905B2 (en) Document evaluation device, document evaluation method and program
WO2022222300A1 (en) Open relationship extraction method and apparatus, electronic device, and storage medium
CN109829151B (en) Text segmentation method based on hierarchical dirichlet model
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN112966508B (en) Universal automatic term extraction method
CN109325122A (en) Vocabulary generation method, file classification method, device, equipment and storage medium
CN112307336A (en) Hotspot information mining and previewing method and device, computer equipment and storage medium
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
US11650996B1 (en) Determining query intent and complexity using machine learning
CN115248890B (en) User interest portrait generation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22823762

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE