WO2023035330A1 - Long text event extraction method and apparatus, and computer device and storage medium - Google Patents

Long text event extraction method and apparatus, and computer device and storage medium Download PDF

Info

Publication number
WO2023035330A1
WO2023035330A1 PCT/CN2021/120030 CN2021120030W WO2023035330A1 WO 2023035330 A1 WO2023035330 A1 WO 2023035330A1 CN 2021120030 W CN2021120030 W CN 2021120030W WO 2023035330 A1 WO2023035330 A1 WO 2023035330A1
Authority
WO
WIPO (PCT)
Prior art keywords
event
text
long text
truncated
role
Prior art date
Application number
PCT/CN2021/120030
Other languages
French (fr)
Chinese (zh)
Inventor
谢翀
罗伟杰
陈永红
黄开梅
Original Assignee
深圳前海环融联易信息科技服务有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海环融联易信息科技服务有限公司 filed Critical 深圳前海环融联易信息科技服务有限公司
Publication of WO2023035330A1 publication Critical patent/WO2023035330A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

Disclosed in the present application are a long text event extraction method and apparatus, and a computer device and a storage medium. The method comprises: acquiring a trigger word in long text of an event to be extracted, and performing text truncation on the long text according to the trigger word, so as to obtain truncated text; using a deep learning model to classify and predict a plurality of event types corresponding to the truncated text; in combination with machine reading comprehension technology and a pointer network model, extracting corresponding event role information for each event type; and on the basis of a sequence generation algorithm, combining all event role information into a target event, and outputting the target event as an event extraction result. In the present application, by means of performing event classification, event role extraction and event combination on long text, the event extraction efficiency and extraction accuracy of the long text are improved.

Description

一种长文本事件抽取方法、装置、计算机设备及存储介质A long text event extraction method, device, computer equipment and storage medium
本申请是以申请号为202111065602.1、申请日为2021年9月13日的中国专利申请为基础,并主张其优先权,该申请的全部内容在此作为整体引入本申请中。This application is based on a Chinese patent application with application number 202111065602.1 and a filing date of September 13, 2021, and claims its priority. The entire content of this application is hereby incorporated into this application as a whole.
技术领域technical field
本申请涉及计算机技术领域,特别涉及一种长文本事件抽取方法、装置、计算机设备及存储介质。The present application relates to the field of computer technology, in particular to a long text event extraction method, device, computer equipment and storage medium.
背景技术Background technique
当前,各大新闻媒体、公众号、推文博主等每天都会产生大量的资讯信息,包括但不限于新闻报道,评论预测,分析解读等。这些文本往往篇幅很长,同时内容复杂,观点不一,而服务公司往往需要监控这些文本信息以及时获得行业动态和事件信息等。传统的事件抽取方法主要需要通过领域专家的规范制定以及大量的人工筛选校验,这种方法工作量大,效率和准确性都较低,因此本申请基于深度学习技术,能够实现全自动化的事件抽取,大幅提升效率,并且在准确性上超过人工校验。At present, major news media, public accounts, tweet bloggers, etc. generate a large amount of information every day, including but not limited to news reports, comment predictions, analysis and interpretation, etc. These texts are often very long, complex in content, and have different opinions, and service companies often need to monitor these text information to obtain industry dynamics and event information in a timely manner. The traditional event extraction method mainly requires specification formulation by domain experts and a large number of manual screening and verification. This method has a large workload and low efficiency and accuracy. Therefore, this application is based on deep learning technology and can realize fully automated event extraction. Extraction greatly improves efficiency and exceeds manual verification in accuracy.
目前已有的长文本的事件抽取方法对于事件的定义一般较为简单。如一些金融类的舆情分析平台主要针对金融文本进行主要事件角色抽取,通过关键词等形式进行展示,同时会对整篇文本的情感倾向进行评估,这一类平台主要应用了简单的事件分类及NER(Named Entity Recognition,即命名实体识别技术)对长文本进行事件抽取。事件分类技术是对原始文本打上分类标签,同一篇文本有可能存在多个标签;命名实体识别技术是对原始文本中可能存在的一些关键词信息进行识别抽取,例如公司、时间等。The existing event extraction methods for long text generally have a relatively simple definition of events. For example, some financial public opinion analysis platforms mainly extract the main event roles of financial texts, display them through keywords and other forms, and at the same time evaluate the emotional tendency of the entire text. This type of platform mainly uses simple event classification and NER (Named Entity Recognition, named entity recognition technology) extracts events from long text. The event classification technology is to classify the original text with classification labels, and the same text may have multiple labels; the named entity recognition technology is to identify and extract some keyword information that may exist in the original text, such as company, time, etc.
第二种较为相似的方法是针对较短文本的关系抽取。主要针对的是文章标题、概要、总结等,同时更关注于文本中的主体、客体及它们之间的关系。这类方法主要应用了关系抽取的技术,在大方向上有两种实现方式,第一种使用命名实体技术将文本中的主体识别出来,再通过其他模型将客体及它们之间的关系进行联合抽取;第二种使用命名实体技术同时将文本中的主体和客体抽取出来,如果存在多个主体或客体,需要通过二分类模型将不同的主体客体进行配对分组。A second, more similar approach is relation extraction for shorter texts. It is mainly aimed at the title, summary, summary, etc. of the article, and at the same time pays more attention to the subject, object and the relationship between them in the text. This type of method mainly applies the technology of relation extraction, and there are two implementation methods in the general direction. The first one uses named entity technology to identify the subjects in the text, and then combines the objects and their relationships through other models. Extraction; the second method uses named entity technology to extract the subject and object in the text at the same time. If there are multiple subjects or objects, it is necessary to pair and group different subjects and objects through a binary classification model.
针对上述提到的第一种现有方法,首先是现有现有方法的事件抽取的信息较少,如在“公司上市”类型的长文本中,已有方法主要关注具体的上市公司和时间即可,其余像“融资规模”,“上市市值”,“融资轮数”等重要信息并未被抽取或展示。其次现有方法仅在情感分类层面给予用户提醒,在重要性,时效性,权威性等方面并没有相关提示。For the first existing method mentioned above, first of all, the information extracted by the existing existing method is less. For example, in the long text of the "company listing" type, the existing method mainly focuses on the specific listed company and time That is, other important information such as "financing scale", "listed market value", and "financing rounds" have not been extracted or displayed. Secondly, the existing methods only give users reminders at the level of emotion classification, and there are no relevant reminders in terms of importance, timeliness, authority, etc.
针对上述提到的第二种关系抽取方法,仅仅抽取主体、客体及关联关系也是较为简单的。其次方法的应用面较窄,由于抽取信息简单的限制,这种方法一般只用于短文本的信息抽取,这大大影响了落地的应用范围。同时,关系抽取方法要求主体客体必须同时存在,现实中的文本经常缺失主体或者客体,如“A公司上市”,就只有主体“A公司”,并不存在相应的客体,无法应用此方法,因此第二种关系抽取方法有很大的局限性。For the second relationship extraction method mentioned above, it is relatively simple to extract only the subject, object and association relationship. Secondly, the application of the method is relatively narrow. Due to the limitation of simple information extraction, this method is generally only used for information extraction of short texts, which greatly affects the scope of application. At the same time, the relationship extraction method requires that the subject and object must exist at the same time. In reality, the text often lacks the subject or object. For example, "A company goes public", there is only the subject "A company", and there is no corresponding object, so this method cannot be applied. The second relation extraction method has significant limitations.
申请内容application content
本申请实施例提供了一种长文本事件抽取方法、装置、计算机设备及存储介质,旨在提高对于长文本的事件抽取效率和精度。Embodiments of the present application provide a long text event extraction method, device, computer equipment, and storage medium, aiming at improving the efficiency and accuracy of event extraction for long text.
第一方面,本申请实施例提供了一种长文本事件抽取方法,包括:In the first aspect, the embodiment of the present application provides a long text event extraction method, including:
获取待抽取事件的长文本中的触发词,并根据所述触发词对长文本进行文本截断,得到截断文本;Acquiring the trigger word in the long text of the event to be extracted, and performing text truncation on the long text according to the trigger word to obtain the truncated text;
利用深度学习模型分类预测所述截断文本对应的多个事件类型;Using a deep learning model to classify and predict multiple event types corresponding to the truncated text;
结合机器阅读理解技术和指针网络模型,对每一所述事件类型抽取对应的事件角色信息;Combining machine reading comprehension technology and pointer network model to extract corresponding event role information for each event type;
基于序列生成算法,将所有的所述事件角色信息组合为一目标事件,并将所述目标事件作为事件抽取结果输出。Based on a sequence generation algorithm, all the event role information is combined into a target event, and the target event is output as an event extraction result.
第二方面,本申请实施例提供了一种长文本事件抽取装置,包括:In the second aspect, the embodiment of the present application provides a long text event extraction device, including:
第一截断单元,用于获取待抽取事件的长文本中的触发词,并根据所述触发词对长文本进行文本截断,得到截断文本;The first truncation unit is configured to obtain trigger words in the long text of the event to be extracted, and perform text truncation on the long text according to the trigger words to obtain the truncated text;
第一分类预测单元,用于利用深度学习模型分类预测所述截断文本对应的多个事件类型;A first classification prediction unit, configured to use a deep learning model to classify and predict multiple event types corresponding to the truncated text;
第一抽取单元,用于结合机器阅读理解技术和指针网络模型,对每一所述事件类型抽取对应的事件角色信息;The first extraction unit is used to combine machine reading comprehension technology and pointer network model to extract corresponding event role information for each event type;
结果输出单元,用于基于序列生成算法,将所有的所述事件角色信息组合为一目标事件,并将所述目标事件作为事件抽取结果输出。The result output unit is configured to combine all the event role information into a target event based on a sequence generation algorithm, and output the target event as an event extraction result.
第三方面,本申请实施例提供了一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如第一方面所述的长文本事件抽取方法。In a third aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and operable on the processor. When the processor executes the computer program, Realize the method for extracting long text events as described in the first aspect.
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如第一方面所述的长文本事件抽取方法。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the long text as described in the first aspect is implemented. Event extraction method.
本申请实施例提供了一种长文本事件抽取方法、装置、计算机设备及存储介质,该方法包括:获取待抽取事件的长文本中的触发词,并根据所述触发词对长文本进行文本截断,得到截断文本;利用深度学习模型分类预测所述截断文本对应的多个事件类型;结合机器阅读理解技术和指针网络模型,对每一所述事件类型抽取对应的事件角色信息;基于序列生成算法,将所有的所述事件角色信息组合为一目标事件,并将所述目标事件作为事件抽取结果输出。本申请实施例通过对长文本进行事件分类、事件角色抽取以及事件组合,提高了对于长文本的事件抽取效率和抽取精度。An embodiment of the present application provides a long text event extraction method, device, computer equipment, and storage medium, the method including: acquiring trigger words in the long text of the event to be extracted, and performing text truncation on the long text according to the trigger words , to obtain the truncated text; use the deep learning model to classify and predict multiple event types corresponding to the truncated text; combine machine reading comprehension technology and pointer network model to extract corresponding event role information for each event type; based on the sequence generation algorithm , combining all the event role information into a target event, and outputting the target event as an event extraction result. The embodiments of the present application improve the event extraction efficiency and extraction accuracy for long texts by performing event classification, event role extraction, and event combination on long texts.
附图说明Description of drawings
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present application more clearly, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can also obtain other drawings based on these drawings on the premise of not paying creative work.
图1为本申请实施例提供的一种长文本事件抽取方法的流程示意图;Fig. 1 is a schematic flow chart of a long text event extraction method provided by the embodiment of the present application;
图2为本申请实施例提供的一种长文本事件抽取方法的子流程示意图;Fig. 2 is a schematic subflow diagram of a long text event extraction method provided in the embodiment of the present application;
图3为本申请实施例提供的一种长文本事件抽取方法的子流程示意图;FIG. 3 is a schematic subflow diagram of a long text event extraction method provided in an embodiment of the present application;
图4为本申请实施例提供的一种长文本事件抽取装置的示意性框图;FIG. 4 is a schematic block diagram of a long text event extraction device provided in an embodiment of the present application;
图5为本申请实施例提供的一种长文本事件抽取装置的子示意性框图;FIG. 5 is a sub-schematic block diagram of a long text event extraction device provided in an embodiment of the present application;
图6为本申请实施例提供的一种长文本事件抽取装置的子示意性框图。FIG. 6 is a sub-schematic block diagram of an apparatus for extracting long text events provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和 “包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that when used in this specification and the appended claims, the terms "comprising" and "comprises" indicate the presence of described features, integers, steps, operations, elements and/or components, but do not exclude one or Presence or addition of multiple other features, integers, steps, operations, elements, components and/or collections thereof.
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。It should also be understood that the terminology used in the specification of this application is for the purpose of describing particular embodiments only and is not intended to limit the application. As used in this specification and the appended claims, the singular forms "a", "an" and "the" are intended to include plural referents unless the context clearly dictates otherwise.
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/ 或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should be further understood that the term "and/or" used in the description of the present application and the appended claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations .
下面请参见图1,图1为本申请实施例提供的一种长文本事件抽取方法的流程示意图,具体包括:步骤S101~S104。Please refer to FIG. 1 below. FIG. 1 is a schematic flowchart of a method for extracting long text events provided by an embodiment of the present application, which specifically includes steps S101 to S104.
S101、获取待抽取事件的长文本中的触发词,并根据所述触发词对长文本进行文本截断,得到截断文本;S101. Obtain a trigger word in the long text of the event to be extracted, and perform text truncation on the long text according to the trigger word to obtain the truncated text;
S102、利用深度学习模型分类预测所述截断文本对应的多个事件类型;S102. Using a deep learning model to classify and predict multiple event types corresponding to the truncated text;
S103、结合机器阅读理解技术和指针网络模型,对每一所述事件类型抽取对应的事件角色信息;S103. Combine machine reading comprehension technology and pointer network model to extract corresponding event role information for each event type;
S104、基于序列生成算法,将所有的所述事件角色信息组合为一目标事件,并将所述目标事件作为事件抽取结果输出。S104. Based on the sequence generation algorithm, combine all the event role information into a target event, and output the target event as an event extraction result.
本实施例中,将抽取事件过程具体划分为事件分类、事件角色抽取以及事件组合三个阶段。其中,在事件分类阶段,首先利用触发词对长文本进行文本截断,然后利用深度学习模型对截断文本进行分类预测。在事件角色抽取阶段,由于在事件分类阶段中获得了截断文本,以及所有截断文本的事件分类信息,因此需要在事件角色抽取阶段针对每种事件类型抽取其所属的事件角色信息,即采取MRC(Machine Reading Comprehension,即机器阅读理解技术)+指针网络的策略进行事件角色信息抽取。在事件组合阶段,通过前两个阶段的模型抽取,获得了每个截断文本隶属于某个事件类型下的所有事件角色,因此本阶段通过生成序列的方式将所有事件角色信息组合为一个完整的事件(即所述目标事件)对外输出。In this embodiment, the event extraction process is specifically divided into three stages: event classification, event role extraction, and event combination. Among them, in the event classification stage, the trigger word is used to truncate the long text, and then the deep learning model is used to classify and predict the truncated text. In the event role extraction stage, since the truncated text and event classification information of all truncated texts are obtained in the event classification stage, it is necessary to extract the event role information for each event type in the event role extraction stage, that is, MRC ( Machine Reading Comprehension (machine reading comprehension technology) + pointer network strategy to extract event role information. In the event combination stage, through the model extraction of the first two stages, all event roles of each truncated text belonging to a certain event type are obtained, so this stage combines all event role information into a complete The event (that is, the target event) is output externally.
本实施例通过对长文本进行事件分类、事件角色抽取以及事件组合,提高了对于长文本的事件抽取效率和抽取精度。本实施例所述的长文本可以是论文文献、新闻报道、杂志期刊等等。例如针对新闻报道的事件抽取更为详细,能够支持更细粒度的查询,减少用户阅读原始文本的时间。并且提供事件角色的重要性排序,可以使用户能够选择性关注一些重点。同时,本实施例采用深度学习的相关技术,大大节省了后期运营和审核的工作量。This embodiment improves the event extraction efficiency and extraction accuracy for long texts by performing event classification, event role extraction, and event combination on long texts. The long text described in this embodiment may be papers, news reports, magazines and so on. For example, the event extraction for news reports is more detailed, which can support more fine-grained queries and reduce the time for users to read the original text. In addition, the order of importance of event roles is provided, so that users can selectively focus on some key points. At the same time, this embodiment adopts related technologies of deep learning, which greatly saves the workload of later operations and audits.
需要说明的是,在事件分类阶段,文本截取上虽然存在现有技术,例如随机截断,首尾截断等,但这两者都会存在不同程度的信息丢失。在多标签分类上虽然可以采用多个二分类等方案,该方案可能会有样本不平衡的问题,对于实际事件较少的文本,预测效果较差。It should be noted that, in the stage of event classification, although there are existing technologies for text truncation, such as random truncation, head-to-tail truncation, etc., both of them will have different degrees of information loss. Although multiple binary classification schemes can be used in multi-label classification, this scheme may have the problem of sample imbalance, and the prediction effect is poor for texts with fewer actual events.
在事件角色抽取阶段,现有技术在大数据量和类型复杂多变的情况下验证的效果还未知。而本实施例在全流程F1已经达到0.7+。目前评价指标设置为全流程F1,指从最一开始文本输入开始,输出n个事件,每个事件输出m个事件角色,F1的计算公式为2 * (p * r) / (p + r),其中p为准确率,代表m * n个事件角色中正确的占比;r为召回率,代表m * n个事件角色中正确的个数,相对于标签总数的占比。In the stage of event role extraction, the effectiveness of existing technologies in the case of large data volume and complex and changeable types of verification is still unknown. However, in this embodiment, the F1 of the whole process has reached 0.7+. The current evaluation index is set to the whole process F1, which means that from the very beginning of text input, n events are output, and each event outputs m event roles. The calculation formula of F1 is 2 * (p * r) / (p + r) , where p is the accuracy rate, which represents the correct proportion of m * n event roles; r is the recall rate, which represents the correct number of m * n event roles, relative to the proportion of the total number of labels.
在事件组合阶段,现有方案也只有通过业务人员不断的更新规则引擎进行配对,这种方案效率低,准确性不高且成本高,而本实施例的则可以解决上述缺陷。In the stage of event combination, the existing solutions can only be paired by business personnel constantly updating the rule engine, which is inefficient, inaccurate and costly, but this embodiment can solve the above defects.
在一实施例中,如图2所示,所述步骤S101包括:步骤S201~S204。In an embodiment, as shown in FIG. 2 , the step S101 includes: steps S201-S204.
S201、通过触发词词典在长文本中选取触发词,并利用触发词对长文本进行预截断;S201. Select a trigger word in the long text through the trigger word dictionary, and use the trigger word to pre-truncate the long text;
S202、基于预截断的长文本,统计不同触发词之间的句子数量和总字数;S202. Based on the pre-truncated long text, count the number of sentences and the total number of words between different trigger words;
S203、根据不同触发词之间的总字数构建离散区间,并基于所述离散区间选取分布占比最多的字数区间;S203. Construct a discrete interval according to the total number of words between different trigger words, and select the interval with the largest number of words distributed based on the discrete interval;
S204、在所述字数区间中选取众数作为字数阈值,并利用所述字数阈值对长文本进行文本截断。S204. Select the mode number in the word count interval as the word count threshold, and use the word count threshold to truncate the long text.
本实施例中,在事件分类阶段,由于新闻报道存在文本长度过长,所含事件类型多样等两大痛点。针对痛点1(即文本长度过长),首先会有领域专家梳理的触发词字典。触发词是指文本中如果存在相应的关键词,则存在一定概率存在对应类型的事件。该阶段主要结合事件触发词进行文本截断,具体做法是:先找出文本中所有存在的触发词,将触发词上下文的一定字数阈值的句子进行截断,该字数阈值主要通过统计决定。由于中文预训练模型为了保证效果,一般会限制最大的输入文本长度,因此需要对原始文本进行截断出来,具体过程为:In this embodiment, in the stage of event classification, there are two major pain points in news reports, such as the length of the text is too long, and the types of events contained are various. For pain point 1 (that is, the length of the text is too long), there will first be a dictionary of trigger words sorted out by domain experts. The trigger word means that if there is a corresponding keyword in the text, there is a certain probability that there is a corresponding type of event. In this stage, text truncation is mainly combined with event trigger words. The specific method is: first find out all the trigger words in the text, and truncate sentences with a certain word count threshold in the trigger word context. The word count threshold is mainly determined through statistics. Since the Chinese pre-training model generally limits the maximum input text length in order to ensure the effect, the original text needs to be truncated. The specific process is:
对长文本按不同事件维度进行分开统计,首先将长文本按照句号、问号、感叹号等进行截断。Separate statistics on long texts according to different event dimensions. First, truncate long texts according to periods, question marks, exclamation points, etc.
统计不同触发词之间的句子数量及总字数。例如“公司上市”事件中存在触发词“上市”,同时下文的“公司退市”事件中存在触发词“退市”,则该阶段会在“上市”和“退市”之间统计字数作为“上市”触发词的下文字数,上文字数进行同样处理。Count the number of sentences and the total number of words between different trigger words. For example, there is a trigger word "listing" in the "company listing" event, and there is a trigger word "delisting" in the following "company delisting" event, then at this stage, the number of words between "listing" and "delisting" will be counted as The number of words below and the number of words above the trigger word of "listing" are processed in the same way.
统计完成后将具体字数离散为具体区间,如(50字以下),(50-100字)等等,统计各大区间的分布,最终决定分布占比相对最多的字数区间挑选众数作为触发词前后的字数阈值进行文本切割。After the statistics are completed, the specific number of words will be discretized into specific intervals, such as (below 50 words), (50-100 words), etc., and the distribution of each major interval will be counted, and finally the word number interval with the largest distribution proportion will be selected as the trigger word Before and after the word count threshold for text segmentation.
在一实施例中,如图3所示,所述步骤S102包括:步骤S301~S304。In an embodiment, as shown in FIG. 3 , the step S102 includes: steps S301 to S304.
S201、获取包含截断训练文本和事件类型的训练集,并对训练集中的截断训练文本按照事件标签拼接;S201. Obtain a training set including truncated training text and event types, and stitch the truncated training text in the training set according to event labels;
S202、通过增加卷积核的深度学习模型对拼接后的截断训练文本进行卷积处理;S202. Perform convolution processing on the spliced truncated training text by adding a deep learning model of a convolution kernel;
S203、采用focal-loss损失函数对改进的深度学习模型进行优化更新;S203. Optimizing and updating the improved deep learning model by using a focal-loss loss function;
S204、利用更新后的深度学习模型对截断文本进行事件分类预测。S204. Use the updated deep learning model to perform event classification prediction on the truncated text.
本实施例中,针对上述事件分类阶段的痛点2(即所含事件类型多样),对深度学习模型的训练和预测结构进行了改造,应用了一种多标签分类的技术,保证每个被截断的文本都可以被预测为多个事件类型,具体过程为:In this embodiment, aiming at the pain point 2 of the above-mentioned event classification stage (that is, the types of events contained are diverse), the training and prediction structure of the deep learning model is modified, and a multi-label classification technology is applied to ensure that each truncated The text of can be predicted as multiple event types, the specific process is:
在训练阶段,本实施例将截断后的文本和每个事件类型进行文本拼接,并用特殊字符进行隔开。例如存在10个事件类型,则原始单条训练文本将会变成10条训练文本,此时对应的训练标签变成二分类的标签,即模型的训练目标优化为判断该文本是否属于其中一个事件标签,这样能很好的解决样本数量少的问题。在模型层面为了适应流程上的变动也做了一些更改,模型不再对原文本进行卷积,而是对原文本拼接事件标签后进行卷积。此时文本语义可能相差较远,为了处理这种问题,本实施例在保留原有步长为1卷积核的情况下增加少量步长为2的卷积核,提升距离较远文本的信息抽取能力。In the training phase, this embodiment performs text splicing on the truncated text and each event type, and separates them with special characters. For example, if there are 10 event types, the original single training text will become 10 training texts. At this time, the corresponding training label becomes a binary classification label, that is, the training goal of the model is optimized to determine whether the text belongs to one of the event labels. , which can well solve the problem of small sample size. At the model level, some changes have been made to adapt to the changes in the process. The model no longer performs convolution on the original text, but performs convolution on the original text after splicing event labels. At this time, the semantics of the text may be quite different. In order to deal with this problem, this embodiment adds a small number of convolution kernels with a step size of 2 while retaining the original convolution kernel with a step size of 1, so as to improve the information of texts that are far away. extraction capacity.
另外,本实施例在最终的损失计算上也进行了一定的改造,由于原始模型处理多标签文本,原始的损失计算已不适合现有的二分类模型,同时避免二分类后产生的大量负样本,本实施例采用focal-loss损失函数,从而能够有效避免负样本数量过多造成模型倾向于拟合负样本的二分类损失函数。In addition, this embodiment has also carried out some modifications in the final loss calculation. Since the original model handles multi-label text, the original loss calculation is no longer suitable for the existing binary classification model, and at the same time avoids a large number of negative samples generated after binary classification , this embodiment adopts the focal-loss loss function, so as to effectively avoid the binary classification loss function that the model tends to fit negative samples caused by too many negative samples.
在预测阶段,同样将所有事件类型拼接在原始文本之后。例如,同样一条预测文本会被扩充到10条预测文本,模型经过同样的推理得到是否属于该事件类型的2分类结果,通过后处理汇总所有预测为1的事件类型,就可以得到该文本所有的事件类型。预测阶段在模型层面的改造,前馈计算和训练阶段保持一致,同样有少量步长为2的卷积核,主要是为了保证训练阶段的参数能够在预测进行完整重现。另外,预测结果的输出并不需要进过focal-loss的损失计算,直接输出前一层的激活函数结果即可。In the prediction stage, all event types are also spliced after the original text. For example, the same predicted text will be expanded to 10 predicted texts. After the same reasoning, the model can obtain the 2 classification results of whether it belongs to the event type. After post-processing and summarizing all the event types predicted to be 1, all the texts can be obtained. event type. In the prediction stage, the transformation at the model level, the feedforward calculation is consistent with the training stage, and there are also a small number of convolution kernels with a step size of 2, mainly to ensure that the parameters in the training stage can be completely reproduced in the prediction. In addition, the output of the prediction result does not need to go through the focal-loss loss calculation, and the activation function result of the previous layer can be directly output.
在一实施例中,所述步骤S103包括:In one embodiment, the step S103 includes:
采用问答式架构在所述截断文本的每一事件类型后拼接问句;Using a question-and-answer structure to splice question sentences after each event type of the truncated text;
通过指针网络模型,根据拼接问句构建标签列表,并利用所述标签列表预测所述问句在所述截断文本中的起始位置概率值和终止位置概率值;Constructing a label list according to the concatenated questions through the pointer network model, and using the label list to predict the starting position probability value and the ending position probability value of the question sentence in the truncated text;
选取概率值最大的起始位置和终止位置,并将所述起始位置和终止位置之间的文本内容作为对应事件类型下属的事件角色信息。Select the start position and the end position with the highest probability value, and use the text content between the start position and the end position as the event role information of the corresponding event type.
本实施例中,由于事件角色抽取也存在多项痛点,例如角色标签多样,重合,分拆等,事件约束下部分角色不能被识别等,这些痛点都是传统用NER技术无法解决的。为了解决这些痛点,本实施例采取了MRC(Machine Reading Comprehension,即机器阅读理解技术)+指针网络的策略。其中,MRC技术(Machine Reading Comprehension,即机器阅读理解技术)主要采用了问答式的整体架构,即在输入的截断文本后拼接问句,这样做能够极大的丰富截断文本,并且加入问句后能够更聚焦于本次事件角色信息的抽取。如在“A公司于今年10月上市。”的截断文本后添加问句“在事件公司上市中,上市企业是什么?”组成一条新的截断文本“A公司于今年10月上市。在事件公司上市中,上市企业是什么?”,在输入的截断文本中能够学习到“上市企业”和A文本存在共现关系,对于模型的学习十分重要。In this embodiment, there are many pain points in the extraction of event roles, such as various role tags, overlapping, splitting, etc., some roles cannot be identified under event constraints, etc. These pain points cannot be solved by traditional NER technology. In order to solve these pain points, this embodiment adopts the strategy of MRC (Machine Reading Comprehension, that is, machine reading comprehension technology) + pointer network. Among them, MRC technology (Machine Reading Comprehension, that is, machine reading comprehension technology) mainly adopts a question-and-answer overall structure, that is, splicing questions after the input truncated text, which can greatly enrich the truncated text, and after adding questions It can focus more on the extraction of role information in this event. For example, after the truncated text of "Company A went public in October this year." Add the question "In the event company's listing, what is the listed company?" to form a new truncated text "A company went public in October this year. In the event company In listing, what is a listed company?", in the input truncated text, it is very important for the learning of the model to learn that there is a co-occurrence relationship between "listed company" and A text.
另外,还需要预测拼接的问句的答案在截断文本中的起始位置和终止位置。并且,针对每种事件类型下的每个事件角色都设置单独的问题,即如果一种事件类型下存在10个事件角色,则原始文本会被拼接10个问句组成10条训练样本进行训练。In addition, it is also necessary to predict the start and end positions of the answers to the concatenated questions in the truncated text. In addition, separate questions are set for each event role under each event type, that is, if there are 10 event roles under one event type, the original text will be spliced into 10 questions to form 10 training samples for training.
事件角色识别(即事件角色信息获取)最重要的训练目标是获得该角色在截断文本中的起始位置和终止位置,但是如果起始位置和终止位置之间的起始位置和终止位置同时也存在其他的事件角色例如“深圳华为科技公司”中的“深圳”即是公司名称,也是所在地区,传统的事件角色识别技术并不能很好的解决这个问题。而指针网络主要是通过两组标签值来分别拟合起始位置和终止位置,同时针对每个事件角色都有独立的两组标签列表进行隔离,模型需要单独给每个事件角色预测两组预测值,分别与两组标签列表计算损失,最终保证在每个事件角色下都能得到最优解。指针网络的输入仍然是MRC结构下的拼接有问句的截断文本。The most important training goal of event role recognition (that is, event role information acquisition) is to obtain the starting position and ending position of the character in the truncated text, but if the starting position and ending position between the starting position and the ending position are also There are other event roles. For example, "Shenzhen" in "Shenzhen Huawei Technology Company" is both the company name and the region where it is located. The traditional event role recognition technology cannot solve this problem well. The pointer network mainly uses two sets of label values to fit the start position and end position respectively. At the same time, there are two independent sets of label lists for each event role to isolate. The model needs to predict two sets of predictions for each event role separately. value, calculate the loss with two sets of label lists respectively, and finally ensure that the optimal solution can be obtained under each event role. The input of the pointer network is still the truncated text of the concatenated questions under the MRC structure.
例如,拼接有问句的截断文本的长度为100,则指针网络会构建两个长度为100的标签列表。第一标签列表主要负责预测事件角色的起始位置,每个位置都会输出是否为起始位置的概率值,找到概率值最大的位置作为事件角色的起始位置。具体过程可以有多种基本网络,在本实施例中可以采用transformer的编码器进行处理,transformer在NLP领域应用十分广泛,拥有强大的特征变化及处理能力,能够很好抽取输入文本的表层句法结构信息和深层语义信息。整体过程类似于指针在长度为100的文本上前后移动,直到找到起始位位置。第二个标签列表与第一个标签列表处理过程的原理相同,只是将拟合目标(即起始位置)变换为事件角色的终止位置。For example, if the length of the truncated text of the spliced question is 100, the pointer network will construct two label lists of length 100. The first label list is mainly responsible for predicting the starting position of the event role. Each position will output the probability value of whether it is the starting position, and find the position with the highest probability value as the starting position of the event role. The specific process can have a variety of basic networks. In this embodiment, the encoder of Transformer can be used for processing. Transformer is widely used in the field of NLP. It has powerful feature changes and processing capabilities, and can well extract the surface syntactic structure of the input text. information and deep semantic information. The overall process is similar to the pointer moving back and forth on the text with a length of 100 until the start bit position is found. The second label list works in the same way as the first label list, except it transforms the fit target (i.e., the start position) to the end position of the event actor.
针对同一个实体有多个事件角色标签,同一个实体前半部分和后半部分属于不同类型的标签等问题,本实施例采用指针网络,将多标签识别的问题转化成大量单标签的二分类问题,避免信息混杂。针对事件约束先的部分角色不能被识别的问题,本实施例采用MRC技术,MRC技术主要是将原始文本进行转化,将原始文本拼接问题文本一起送入预训练的语言模型中。模型需要预测问题文本的答案的所在位置,其中的问题文本与事件类型强相关,因此能够实现事件类型对于事件角色的强约束,保证每个事件下的事件角色信息都符合领域专家制定的规则。For the problem that the same entity has multiple event role labels, and the first half and the second half of the same entity belong to different types of labels, this embodiment uses a pointer network to convert the multi-label recognition problem into a large number of single-label binary classification problems , to avoid information confusion. Aiming at the problem that some characters cannot be recognized before event constraints, this embodiment adopts MRC technology, which mainly converts the original text, and sends the original text splicing question text together into the pre-trained language model. The model needs to predict the location of the answer to the question text. The question text is strongly related to the event type, so it can realize the strong constraint of the event type on the event role, and ensure that the event role information under each event conforms to the rules formulated by domain experts.
在一实施例中,所述序列生成算法为DOC2EDAG算法。In one embodiment, the sequence generation algorithm is DOC2EDAG algorithm.
本实施例中,EDAG全称为Entity-based Directed Acyclic Graph,意为基于实体的有向无环图,即将长文本中抽取得到的一系列事件角色构建成一个有向无环图,也就是生成一个由事件角色组成的序列作为单一事件。In this embodiment, the full name of EDAG is Entity-based Directed Acyclic Graph, which means an entity-based directed acyclic graph, constructs a series of event roles extracted from long texts into a directed acyclic graph, that is, generates a sequence composed of event roles as a single event.
在一实施例中,所述步骤S104包括:In one embodiment, the step S104 includes:
基于所述事件角色信息对每一事件类型下属的所有事件角色进行排序;sorting all event roles under each event type based on the event role information;
通过一状态变量对每一事件类型下属的事件角色进行状态更新;Update the state of the event role under each event type through a state variable;
根据排序结果和状态更新结果,通过DOC2EDAG算法为所有的事件角色构建有向无环图,得到所有的所述事件角色信息组合的序列,并将所述序列作为所述目标事件输出。According to the sorting result and the state update result, construct a directed acyclic graph for all event roles through the DOC2EDAG algorithm, obtain a sequence of all event role information combinations, and output the sequence as the target event.
本实施例中,在事件组合阶段的痛点在于任何事件的任何事件角色都有可能是一个实体,多个实体,甚至没有实体,因此在配对组合上会面对及其复杂的逻辑处理。目前该痛点在工业界主要通过规则处理,学术界存在一定的模型实现。而本实施例则基于DOC2EDAG算法,将事件组合转化成序列生成的任务。具体的,对于每种事件类型,为下属的所有事件角色定义一个顺序,并逐步更新每个事件角色。定义顺序的标准可以由领域知识专家确定,标准为单一事件维度下的角色重要性排序。如“公司上市”事件中的角色重要性为:上市公司,上市环节,上市证券所,上市时间等等。In this embodiment, the pain point in the event combination stage is that any event role of any event may be one entity, multiple entities, or even no entity, so it will face extremely complex logic processing in pairing and combination. At present, this pain point is mainly dealt with by rules in the industry, and there are certain models in the academic world. However, this embodiment is based on the DOC2EDAG algorithm, which converts event combinations into sequence generation tasks. Specifically, for each event type, define a sequence for all event roles of subordinates, and update each event role step by step. The standard for defining the order can be determined by domain knowledge experts, and the standard is the role importance ordering under a single event dimension. For example, the importance of roles in the "company listing" event is: listed company, listing link, listing stock exchange, listing time and so on.
同时,通过所述状态变量m,记录每一事件类型更新到某个事件角色时整个事件的状态,在扩展下一个事件角色节点时,会根据此时的状态变量m和新加入事件角色节点的特征e进行综合判断。At the same time, through the state variable m, record the state of the entire event when each event type is updated to a certain event role. When expanding the next event role node, it will be based on the current state variable m and the newly added event role node. Feature e is used for comprehensive judgment.
然后根据排序结果和状态更新结果,对事件角色信息生成序列组合,并以此作为事件抽取结果输出。Then, according to the sorting result and the state update result, a sequence combination is generated for the event role information, which is output as the event extraction result.
在一实施例中,所述通过一状态变量对每一事件类型下属的事件角色进行状态更新,包括:In an embodiment, updating the state of the event role subordinate to each event type through a state variable includes:
获取至少一新增事件角色节点,并利用全连接层对每一所述事件角色节点进行特征变换;Obtaining at least one newly added event role node, and performing feature transformation on each of the event role nodes by using a fully connected layer;
将特征变换结果与所述状态变量进行拼接,并将拼接结果依次输入值全连接层和激活函数,得到每一所述事件角色节点与对应事件角色的匹配概率值;Splicing the feature transformation result with the state variable, and inputting the splicing result into a fully connected layer and an activation function in turn to obtain a matching probability value between each event role node and the corresponding event role;
选择匹配概率值最大的事件角色节点作为对应事件角色的预测结果,并更新对应的事件类型。Select the event role node with the largest matching probability value as the prediction result of the corresponding event role, and update the corresponding event type.
本实施例中,综合判断主要由神经网络的全连接层决定,主要流程是新加入的事件角色节点的节点特征e经过全连接层进行特征变换,再与此时的状态变量进行拼接,然后经过一层全连接层和激活函数,得到该事件角色节点与该事件角色匹配的概率值。选取匹配概率值最高的事件角色节点作为该事件角色的预测结果。In this embodiment, the comprehensive judgment is mainly determined by the fully connected layer of the neural network. The main process is that the node feature e of the newly added event role node is transformed through the fully connected layer, and then spliced with the state variable at this time, and then passed A fully connected layer and an activation function to obtain the probability value that the event role node matches the event role. The event role node with the highest matching probability value is selected as the prediction result of the event role.
每个事件角色节点可能是真实的实体,也可能是空值,最终把公共前缀进行合并,形成每个单独的事件。Each event role node may be a real entity, or it may be a null value, and finally the common prefixes are merged to form each individual event.
还需注意的是,由于事件抽取的整体流程过长,因此需要进行流程拆解后利用不同模型的组合来分而治之。在不同的阶段也存在不同的痛点,而本实施例则可以完美解决存在痛点。各个阶段之间的联系主要通过串联实现,以输入一条长文本为例,一阶段(即事件分类阶段)主要输出该长文本的所有截断文本的事件类型(多分类);二阶段(即事件角色抽取阶段)输入这些截断文本,主要输出每条截断文本的每个事件类型下识别得到的所有事件角色;三阶段(即事件组合阶段)输入所有事件角色,通过序列生成模型获得包含一批事件角色的所有事件,最终实现事件抽取的需求。It should also be noted that since the overall process of event extraction is too long, it is necessary to divide and conquer by combining different models after dismantling the process. There are different pain points at different stages, and this embodiment can perfectly solve the existing pain points. The connection between the various stages is mainly achieved through series connection. Taking the input of a long text as an example, the first stage (that is, the event classification stage) mainly outputs the event types (multi-classification) of all truncated texts of the long text; the second stage (that is, the event role Extraction stage) input these truncated texts, and mainly output all event roles identified under each event type of each truncated text; the third stage (ie, event combination stage) input all event roles, and obtained a batch of event roles through the sequence generation model All events, and finally realize the requirements of event extraction.
图4为本申请实施例提供的一种长文本事件抽取装置400的示意性框图,该装置400包括:FIG. 4 is a schematic block diagram of a long text event extraction device 400 provided in an embodiment of the present application. The device 400 includes:
第一截断单元401,用于获取待抽取事件的长文本中的触发词,并根据所述触发词对长文本进行文本截断,得到截断文本;The first truncation unit 401 is configured to obtain trigger words in the long text of the event to be extracted, and perform text truncation on the long text according to the trigger words to obtain the truncated text;
第一分类预测单元402,用于利用深度学习模型分类预测所述截断文本对应的多个事件类型;The first classification prediction unit 402 is configured to use a deep learning model to classify and predict multiple event types corresponding to the truncated text;
第一抽取单元403,用于结合机器阅读理解技术和指针网络模型,对每一所述事件类型抽取对应的事件角色信息;The first extraction unit 403 is used to combine machine reading comprehension technology and pointer network model to extract corresponding event role information for each event type;
结果输出单元404,用于基于序列生成算法,将所有的所述事件角色信息组合为一目标事件,并将所述目标事件作为事件抽取结果输出。The result output unit 404 is configured to combine all the event role information into a target event based on a sequence generation algorithm, and output the target event as an event extraction result.
在一实施例中,如图5所示,所述第一截断单元401包括:In an embodiment, as shown in FIG. 5, the first truncation unit 401 includes:
触发词选取单元501,用于通过触发词词典在长文本中选取触发词,并利用触发词对长文本进行预截断;The trigger word selection unit 501 is used to select the trigger word in the long text through the trigger word dictionary, and utilize the trigger word to pre-truncate the long text;
统计单元502,用于基于预截断的长文本,统计不同触发词之间的句子数量和总字数;A statistical unit 502, configured to count the number of sentences and the total number of words between different trigger words based on the pre-truncated long text;
区间选取单元503,用于根据不同触发词之间的总字数构建离散区间,并基于所述离散区间选取分布占比最多的字数区间;The interval selection unit 503 is used to construct a discrete interval according to the total number of words between different trigger words, and select the interval with the largest number of words distributed based on the discrete interval;
字数阈值设置单元504,用于在所述字数区间中选取众数作为字数阈值,并利用所述字数阈值对长文本进行文本截断。The word count threshold setting unit 504 is configured to select the mode number in the word count interval as the word count threshold, and use the word count threshold to perform text truncation on long texts.
在一实施例中,如图6所示,所述第一分类预测单元402包括:In one embodiment, as shown in FIG. 6, the first category prediction unit 402 includes:
标签拼接单元601,用于获取包含截断训练文本和事件类型的训练集,并对训练集中的截断训练文本按照事件标签拼接;A label splicing unit 601, configured to obtain a training set comprising truncated training text and event types, and stitch the truncated training text in the training set according to the event label;
卷积处理单元602,用于通过增加卷积核的深度学习模型对拼接后的截断训练文本进行卷积处理;The convolution processing unit 602 is used to perform convolution processing on the spliced truncated training text by increasing the deep learning model of the convolution kernel;
优化更新单元603,用于采用focal-loss损失函数对改进的深度学习模型进行优化更新;An optimization update unit 603, configured to optimize and update the improved deep learning model using a focal-loss loss function;
第二分类预测单元604,用于利用更新后的深度学习模型对截断文本进行事件分类预测。The second category prediction unit 604 is configured to use the updated deep learning model to perform event category prediction on the truncated text.
在一实施例中,所述第一抽取单元403包括:In an embodiment, the first extraction unit 403 includes:
问句拼接单元,用于采用问答式架构在所述截断文本的每一事件类型后拼接问句;A question splicing unit for splicing questions after each event type of the truncated text using a question-and-answer structure;
概率预测单元,用于通过指针网络模型,根据拼接问句构建标签列表,并利用所述标签列表预测所述问句在所述截断文本中的起始位置概率值和终止位置概率值;The probability prediction unit is used to construct a label list according to the concatenated questions through the pointer network model, and use the label list to predict the starting position probability value and the ending position probability value of the question sentence in the truncated text;
位置选取单元,用于选取概率值最大的起始位置和终止位置,并将所述起始位置和终止位置之间的文本内容作为对应事件类型下属的事件角色信息。The position selection unit is configured to select the start position and the end position with the highest probability value, and use the text content between the start position and the end position as the event role information under the corresponding event type.
在一实施例中,所述序列生成算法为DOC2EDAG算法。In one embodiment, the sequence generation algorithm is DOC2EDAG algorithm.
在一实施例中,所述结果输出单元404包括:In one embodiment, the result output unit 404 includes:
角色排序单元,用于基于所述事件角色信息对每一事件类型下属的所有事件角色进行排序;a role sorting unit, configured to sort all event roles under each event type based on the event role information;
状态更新单元,用于通过一状态变量对每一事件类型下属的事件角色进行状态更新;A state update unit, configured to update the state of the event roles under each event type through a state variable;
序列输出单元,用于根据排序结果和状态更新结果,通过DOC2EDAG算法为所有的事件角色构建有向无环图,得到所有的所述事件角色信息组合的序列,并将所述序列作为所述目标事件输出。The sequence output unit is used to construct a directed acyclic graph for all event roles through the DOC2EDAG algorithm according to the sorting result and the state update result, to obtain a sequence of all the event role information combinations, and use the sequence as the target event output.
在一实施例中,所述状态更新单元包括:In one embodiment, the status update unit includes:
特征变换单元,用于获取至少一新增事件角色节点,并利用全连接层对每一所述事件角色节点进行特征变换;A feature transformation unit, configured to obtain at least one newly added event role node, and perform feature transformation on each of the event role nodes by using a fully connected layer;
特征拼接单元,用于将特征变换结果与所述状态变量进行拼接,并将拼接结果依次输入值全连接层和激活函数,得到每一所述事件角色节点与对应事件角色的匹配概率值;The feature splicing unit is used to splice the feature transformation result and the state variable, and input the splicing result into the fully connected layer and the activation function in turn to obtain the matching probability value between each of the event role nodes and the corresponding event role;
节点选择单元,用于选择匹配概率值最大的事件角色节点作为对应事件角色的预测结果,并更新对应的事件类型。The node selection unit is configured to select the event role node with the highest matching probability value as the prediction result of the corresponding event role, and update the corresponding event type.
由于装置部分的实施例与方法部分的实施例相互对应,因此装置部分的实施例请参见方法部分的实施例的描述,这里暂不赘述。Since the embodiment of the device part corresponds to the embodiment of the method part, please refer to the description of the embodiment of the method part for the embodiment of the device part, and details will not be repeated here.
本申请实施例还提供了一种计算机可读存储介质,其上存有计算机程序,该计算机程序被执行时可以实现上述实施例所提供的步骤。该存储介质可以包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored. When the computer program is executed, the steps provided in the above-mentioned embodiments can be realized. The storage medium may include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk, and other media capable of storing program codes.
本申请实施例还提供了一种计算机设备,可以包括存储器和处理器,存储器中存有计算机程序,处理器调用存储器中的计算机程序时,可以实现上述实施例所提供的步骤。当然计算机设备还可以包括各种网络接口,电源等组件。The embodiment of the present application also provides a computer device, which may include a memory and a processor. A computer program is stored in the memory. When the processor invokes the computer program in the memory, the steps provided in the above embodiments can be implemented. Of course, the computer equipment may also include components such as various network interfaces and power supplies.
说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以对本申请进行若干改进和修饰,这些改进和修饰也落入本申请权利要求的保护范围内。Each embodiment in the description is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same and similar parts of each embodiment can be referred to each other. As for the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and for the related information, please refer to the description of the method part. It should be pointed out that those skilled in the art can make some improvements and modifications to the application without departing from the principles of the application, and these improvements and modifications also fall within the protection scope of the claims of the application.
还需要说明的是,在本说明书中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的状况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should also be noted that in this specification, relative terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or operations There is no such actual relationship or order between the operations. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

Claims (10)

  1. 一种长文本事件抽取方法,其特征在于,包括:A method for extracting long text events, comprising:
    获取待抽取事件的长文本中的触发词,并根据所述触发词对长文本进行文本截断,得到截断文本;Acquiring the trigger word in the long text of the event to be extracted, and performing text truncation on the long text according to the trigger word to obtain the truncated text;
    利用深度学习模型分类预测所述截断文本对应的多个事件类型;Using a deep learning model to classify and predict multiple event types corresponding to the truncated text;
    结合机器阅读理解技术和指针网络模型,对每一所述事件类型抽取对应的事件角色信息;Combining machine reading comprehension technology and pointer network model to extract corresponding event role information for each event type;
    基于序列生成算法,将所有的所述事件角色信息组合为一目标事件,并将所述目标事件作为事件抽取结果输出。Based on a sequence generation algorithm, all the event role information is combined into a target event, and the target event is output as an event extraction result.
  2. 根据权利要求1所述的长文本事件抽取方法,其特征在于,所述获取待抽取事件的长文本中的触发词,并根据所述触发词对长文本进行文本截断,得到截断文本,包括:The long text event extraction method according to claim 1, wherein said acquiring the trigger word in the long text of the event to be extracted, and performing text truncation on the long text according to the trigger word, to obtain the truncated text includes:
    通过触发词词典在长文本中选取触发词,并利用触发词对长文本进行预截断;Select trigger words in the long text through the trigger word dictionary, and use the trigger words to pre-truncate the long text;
    基于预截断的长文本,统计不同触发词之间的句子数量和总字数;Based on the pre-truncated long text, count the number of sentences and the total number of words between different trigger words;
    根据不同触发词之间的总字数构建离散区间,并基于所述离散区间选取分布占比最多的字数区间;Construct discrete intervals according to the total word count between different trigger words, and select the word count interval with the largest distribution ratio based on the discrete intervals;
    在所述字数区间中选取众数作为字数阈值,并利用所述字数阈值对长文本进行文本截断。Select the mode number in the word count interval as the word count threshold, and use the word count threshold to truncate the long text.
  3. 根据权利要求1所述的长文本事件抽取方法,其特征在于,所述利用深度学习模型分类预测所述截断文本对应的多个事件类型,包括:The long text event extraction method according to claim 1, wherein said utilizing a deep learning model to classify and predict multiple event types corresponding to said truncated text comprises:
    获取包含截断训练文本和事件类型的训练集,并对训练集中的截断训练文本按照事件标签拼接;Obtain a training set containing truncated training texts and event types, and stitch the truncated training texts in the training set according to event labels;
    通过增加卷积核的深度学习模型对拼接后的截断训练文本进行卷积处理;Convolute the spliced truncated training text by adding a deep learning model of the convolution kernel;
    采用focal-loss损失函数对改进的深度学习模型进行优化更新;Optimize and update the improved deep learning model by using the focal-loss loss function;
    利用更新后的深度学习模型对截断文本进行事件分类预测。Event classification prediction on truncated text with an updated deep learning model.
  4. 根据权利要求1所述的长文本事件抽取方法,其特征在于,所述结合机器阅读理解技术和指针网络模型,对每一所述事件类型抽取对应的事件角色信息,包括:The long text event extraction method according to claim 1, characterized in that, the combination of machine reading comprehension technology and pointer network model extracts corresponding event role information for each of the event types, including:
    采用问答式架构在所述截断文本的每一事件类型后拼接问句;Using a question-and-answer structure to splice question sentences after each event type of the truncated text;
    通过指针网络模型,根据拼接问句构建标签列表,并利用所述标签列表预测所述问句在所述截断文本中的起始位置概率值和终止位置概率值;Constructing a label list according to the concatenated questions through the pointer network model, and using the label list to predict the starting position probability value and the ending position probability value of the question sentence in the truncated text;
    选取概率值最大的起始位置和终止位置,并将所述起始位置和终止位置之间的文本内容作为对应事件类型下属的事件角色信息。Select the start position and the end position with the highest probability value, and use the text content between the start position and the end position as the event role information of the corresponding event type.
  5. 根据权利要求1所述的长文本事件抽取方法,其特征在于,所述序列生成算法为DOC2EDAG算法。The long text event extraction method according to claim 1, wherein the sequence generation algorithm is a DOC2EDAG algorithm.
  6. 根据权利要求5所述的长文本事件抽取方法,其特征在于,所述基于序列生成算法,将所有的所述事件角色信息组合为一目标事件,并将所述目标事件作为事件抽取结果输出,包括:The long text event extraction method according to claim 5, wherein the sequence-based generation algorithm combines all the event role information into a target event, and outputs the target event as an event extraction result, include:
    基于所述事件角色信息对每一事件类型下属的所有事件角色进行排序;sorting all event roles under each event type based on the event role information;
    通过一状态变量对每一事件类型下属的事件角色进行状态更新;Update the state of the event role under each event type through a state variable;
    根据排序结果和状态更新结果,通过DOC2EDAG算法为所有的事件角色构建有向无环图,得到所有的所述事件角色信息组合的序列,并将所述序列作为所述目标事件输出。According to the sorting result and the state update result, construct a directed acyclic graph for all event roles through the DOC2EDAG algorithm, obtain a sequence of all event role information combinations, and output the sequence as the target event.
  7. 根据权利要求6所述的长文本事件抽取方法,其特征在于,所述通过一状态变量对每一事件类型下属的事件角色进行状态更新,包括:The method for extracting long text events according to claim 6, wherein said updating the state of the event roles subordinate to each event type through a state variable includes:
    获取至少一新增事件角色节点,并利用全连接层对每一所述事件角色节点进行特征变换;Obtaining at least one newly added event role node, and performing feature transformation on each of the event role nodes by using a fully connected layer;
    将特征变换结果与所述状态变量进行拼接,并将拼接结果依次输入值全连接层和激活函数,得到每一所述事件角色节点与对应事件角色的匹配概率值;Splicing the feature transformation result with the state variable, and inputting the splicing result into a fully connected layer and an activation function in turn to obtain a matching probability value between each event role node and the corresponding event role;
    选择匹配概率值最大的事件角色节点作为对应事件角色的预测结果,并更新对应的事件类型。Select the event role node with the largest matching probability value as the prediction result of the corresponding event role, and update the corresponding event type.
  8. 一种长文本事件抽取装置,其特征在于,包括:A long text event extraction device is characterized in that it comprises:
    第一截断单元,用于获取待抽取事件的长文本中的触发词,并根据所述触发词对长文本进行文本截断,得到截断文本;The first truncation unit is configured to obtain trigger words in the long text of the event to be extracted, and perform text truncation on the long text according to the trigger words to obtain the truncated text;
    第一分类预测单元,用于利用深度学习模型分类预测所述截断文本对应的多个事件类型;A first classification prediction unit, configured to use a deep learning model to classify and predict multiple event types corresponding to the truncated text;
    第一抽取单元,用于结合机器阅读理解技术和指针网络模型,对每一所述事件类型抽取对应的事件角色信息;The first extraction unit is used to combine machine reading comprehension technology and pointer network model to extract corresponding event role information for each event type;
    结果输出单元,用于基于序列生成算法,将所有的所述事件角色信息组合为一目标事件,并将所述目标事件作为事件抽取结果输出。The result output unit is configured to combine all the event role information into a target event based on a sequence generation algorithm, and output the target event as an event extraction result.
  9. 一种计算机设备,其特征在于,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如权利要求1至7任一项所述的长文本事件抽取方法。A computer device, characterized in that it includes a memory, a processor, and a computer program stored on the memory and operable on the processor, when the processor executes the computer program, it realizes the requirements of claims 1 to 1. 7. The long text event extraction method described in any one.
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述的长文本事件抽取方法。A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the long text event according to any one of claims 1 to 7 is realized extraction method.
PCT/CN2021/120030 2021-09-13 2021-09-24 Long text event extraction method and apparatus, and computer device and storage medium WO2023035330A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111065602.1A CN113535963B (en) 2021-09-13 2021-09-13 Long text event extraction method and device, computer equipment and storage medium
CN202111065602.1 2021-09-13

Publications (1)

Publication Number Publication Date
WO2023035330A1 true WO2023035330A1 (en) 2023-03-16

Family

ID=78093162

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/120030 WO2023035330A1 (en) 2021-09-13 2021-09-24 Long text event extraction method and apparatus, and computer device and storage medium

Country Status (2)

Country Link
CN (1) CN113535963B (en)
WO (1) WO2023035330A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116501898A (en) * 2023-06-29 2023-07-28 之江实验室 Financial text event extraction method and device suitable for few samples and biased data
CN116776886A (en) * 2023-08-15 2023-09-19 浙江同信企业征信服务有限公司 Information extraction method, device, equipment and storage medium
CN117648397A (en) * 2023-11-07 2024-03-05 中译语通科技股份有限公司 Chapter event extraction method, system, equipment and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115292568B (en) * 2022-03-02 2023-11-17 内蒙古工业大学 Civil news event extraction method based on joint model
CN114996434B (en) * 2022-08-08 2022-11-08 深圳前海环融联易信息科技服务有限公司 Information extraction method and device, storage medium and computer equipment
CN115982339A (en) * 2023-03-15 2023-04-18 上海蜜度信息技术有限公司 Method, system, medium and electronic device for extracting emergency

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009205372A (en) * 2008-02-27 2009-09-10 Mitsubishi Electric Corp Information processor, information processing method and program
US20200226218A1 (en) * 2019-01-14 2020-07-16 International Business Machines Corporation Automatic classification of adverse event text fragments
CN111522915A (en) * 2020-04-20 2020-08-11 北大方正集团有限公司 Extraction method, device and equipment of Chinese event and storage medium
CN112861527A (en) * 2021-03-17 2021-05-28 合肥讯飞数码科技有限公司 Event extraction method, device, equipment and storage medium
CN112905868A (en) * 2021-03-22 2021-06-04 京东方科技集团股份有限公司 Event extraction method, device, equipment and storage medium
CN113312916A (en) * 2021-05-28 2021-08-27 北京航空航天大学 Financial text event extraction method and device based on triggered word morphological learning

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004006133A1 (en) * 2002-07-03 2004-01-15 Iotapi., Com, Inc. Text-machine code, system and method
CN110210027B (en) * 2019-05-30 2023-01-24 杭州远传新业科技股份有限公司 Fine-grained emotion analysis method, device, equipment and medium based on ensemble learning
CN111090763B (en) * 2019-11-22 2024-04-05 北京视觉大象科技有限公司 Picture automatic labeling method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009205372A (en) * 2008-02-27 2009-09-10 Mitsubishi Electric Corp Information processor, information processing method and program
US20200226218A1 (en) * 2019-01-14 2020-07-16 International Business Machines Corporation Automatic classification of adverse event text fragments
CN111522915A (en) * 2020-04-20 2020-08-11 北大方正集团有限公司 Extraction method, device and equipment of Chinese event and storage medium
CN112861527A (en) * 2021-03-17 2021-05-28 合肥讯飞数码科技有限公司 Event extraction method, device, equipment and storage medium
CN112905868A (en) * 2021-03-22 2021-06-04 京东方科技集团股份有限公司 Event extraction method, device, equipment and storage medium
CN113312916A (en) * 2021-05-28 2021-08-27 北京航空航天大学 Financial text event extraction method and device based on triggered word morphological learning

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116501898A (en) * 2023-06-29 2023-07-28 之江实验室 Financial text event extraction method and device suitable for few samples and biased data
CN116501898B (en) * 2023-06-29 2023-09-01 之江实验室 Financial text event extraction method and device suitable for few samples and biased data
CN116776886A (en) * 2023-08-15 2023-09-19 浙江同信企业征信服务有限公司 Information extraction method, device, equipment and storage medium
CN116776886B (en) * 2023-08-15 2023-12-05 浙江同信企业征信服务有限公司 Information extraction method, device, equipment and storage medium
CN117648397A (en) * 2023-11-07 2024-03-05 中译语通科技股份有限公司 Chapter event extraction method, system, equipment and storage medium

Also Published As

Publication number Publication date
CN113535963A (en) 2021-10-22
CN113535963B (en) 2021-12-21

Similar Documents

Publication Publication Date Title
Huq et al. Sentiment analysis on Twitter data using KNN and SVM
WO2023035330A1 (en) Long text event extraction method and apparatus, and computer device and storage medium
US20210382878A1 (en) Systems and methods for generating a contextually and conversationally correct response to a query
Kaur Incorporating sentimental analysis into development of a hybrid classification model: A comprehensive study
CN101140588A (en) Method and apparatus for ordering incidence relation search result
Sun et al. Pre-processing online financial text for sentiment classification: A natural language processing approach
CN112214614A (en) Method and system for mining risk propagation path based on knowledge graph
CN112036842A (en) Intelligent matching platform for scientific and technological services
Shekhawat Sentiment classification of current public opinion on brexit: Naïve Bayes classifier model vs Python’s Textblob approach
CN110826315B (en) Method for identifying timeliness of short text by using neural network system
Kim et al. Business environmental analysis for textual data using data mining and sentence-level classification
Zhao RETRACTED ARTICLE: Application of deep learning algorithm in college English teaching process evaluation
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
CN111859955A (en) Public opinion data analysis model based on deep learning
CN113379432B (en) Sales system customer matching method based on machine learning
Khekare et al. Design of Automatic Key Finder for Search Engine Optimization in Internet of Everything
Zhu et al. Sentiment analysis methods: Survey and evaluation
Rizinski et al. Sentiment Analysis in Finance: From Transformers Back to eXplainable Lexicons (XLex)
CN113536772A (en) Text processing method, device, equipment and storage medium
CN111967251A (en) Intelligent customer sound insight system
Bharadi Sentiment Analysis of Twitter Data Using Named Entity Recognition
CN117852553B (en) Language processing system for extracting component transaction scene information based on chat record
CN117708308B (en) RAG natural language intelligent knowledge base management method and system
CN117453895B (en) Intelligent customer service response method, device, equipment and readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21956506

Country of ref document: EP

Kind code of ref document: A1