CN114281998A - Multi-level annotator-oriented event annotation system construction method based on crowdsourcing technology - Google Patents

Multi-level annotator-oriented event annotation system construction method based on crowdsourcing technology Download PDF

Info

Publication number
CN114281998A
CN114281998A CN202111624377.0A CN202111624377A CN114281998A CN 114281998 A CN114281998 A CN 114281998A CN 202111624377 A CN202111624377 A CN 202111624377A CN 114281998 A CN114281998 A CN 114281998A
Authority
CN
China
Prior art keywords
event
text
entity
labeling
crowdsourcing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111624377.0A
Other languages
Chinese (zh)
Other versions
CN114281998B (en
Inventor
纪婉婷
马宇航
张磊
李冬
宋宝燕
武子涵
鲁闻一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning University
Original Assignee
Liaoning University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning University filed Critical Liaoning University
Priority to CN202111624377.0A priority Critical patent/CN114281998B/en
Priority claimed from CN202111624377.0A external-priority patent/CN114281998B/en
Publication of CN114281998A publication Critical patent/CN114281998A/en
Application granted granted Critical
Publication of CN114281998B publication Critical patent/CN114281998B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to a construction method of a multilevel annotator-oriented event annotation system based on a crowdsourcing technology, which comprises the following steps of: 1. collecting field data and constructing a complete entity library and an event information library; 2. preprocessing a corpus and constructing a complete corpus access mechanism to be annotated: filtering out invalid texts, and performing clause processing; 3. constructing a complete labeling mechanism, and performing entity labeling and event labeling; 4. constructing a complete crowdsourcing task allocation mechanism and a crowdsourcing result aggregation mechanism; 5. and constructing a complete data set export mechanism, and dynamically regulating and constructing the required event extraction data set according to the data set format required by the downstream event extraction model. The invention can effectively apply various crowdsourcing techniques to the marking process of the markers with different professional degrees, thereby effectively utilizing the background knowledge of the markers and maximally playing the crowdsourcing role.

Description

Multi-level annotator-oriented event annotation system construction method based on crowdsourcing technology
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a method for constructing a multilevel annotator-oriented event annotation system based on a crowdsourcing technology, in particular to a method for storing a large-scale corpus, visually inquiring, distributing multilevel crowdsourcing tasks, annotating event information from the corpus, and deriving a data set.
Background
Text annotation is a fundamental task in the field of natural language processing. Because most of the existing natural language processing models are based on machine learning or deep learning, a large amount of high-quality texts (or corpora) are required to be used for model training so as to ensure the accuracy and the effectiveness of the models in processing various natural language processing tasks. The mode of manually marking the text can form standardized corpora, reduce the difficulty of natural language processing model training and contribute to improving the accuracy and effectiveness of the model.
In recent years, the way of manually labeling texts is usually combined with crowdsourcing technology to improve the quality and efficiency of field data labeling and reduce the labeling cost. In particular, since the corpus is typically created by inviting a domain expert to act as a annotator and based on domain data, for example, the financial domain expert is invited to manually annotate financial text (e.g., financial news or bulletins) and construct a financial domain corpus. Therefore, the mass manual labeling can be carried out on the text data in a certain field by using a labeling system in combination with a crowdsourcing technology so as to improve the labeling efficiency, reduce the labeling cost and ensure the labeling quality, thereby constructing a high-quality field corpus.
The existing manual labeling system based on the crowdsourcing technology mainly has two problems: firstly, crowdsourcing technique is too single, and in the face of the label person of different professional degree, current artifical mark system adopts the same mark flow usually, can't make full use of field expert's background knowledge at the mark in-process, can not do real equilibrium on manpower and material resources. The key to the success of the crowdsourcing task lies in the maximization unification of data labeling quality, manual labeling cost and labeling efficiency. Secondly, a long text with multiple event labels cannot be performed, a long text usually contains a large number of entity nouns, complex sentence patterns and the like, and a plurality of events (an event refers to a series of activities performed around a certain theme and participated by one or more roles (event subjects) under a specific space-time) also exist, for example, a financial news may have a plurality of events belonging to the same financial event type or a plurality of events belonging to different financial event types. The existing manual labeling system cannot process the text with the chapter-level length and simultaneously label a plurality of events in the text. Therefore, a multi-level annotator-oriented event annotation system based on a crowdsourcing technology is constructed, and efficient access, visual display and accurate annotation of a large-scale corpus are very necessary for improving the performance of the event annotation system.
Disclosure of Invention
In order to overcome the defects of the conventional crowdsourcing data annotation system, the invention provides a method for constructing a multilevel annotator-oriented event annotation system based on a crowdsourcing technology.
The technical scheme adopted by the invention is as follows:
step 1, collecting field data and constructing a complete entity library and an event information library, wherein the entity library comprises entity types (such as name, place name, organization name and the like) which are common in a named entity identification task, and the event information library comprises a plurality of financial event types, a set of event argument roles (also called event element roles) corresponding to each event type and accessory information thereof. Each event type has a predefined trigger word set, and the occurrence of the trigger word can represent the occurrence of the corresponding type of event, so that the filtering and event classification can be carried out on the corpus in the data preprocessing stage; each event type has a predefined event argument role set, and the task of event marking is to mark event arguments matched with the event argument roles from the text so as to fill complete event information; each event type has a predefined key event argument role set which is subordinate to the event argument role set, and the key event argument roles must be labeled in the labeling process, otherwise, the labeling result cannot be submitted; each event type has a predefined submission threshold, the number of event arguments labeled after the labeling is completed must exceed the submission threshold, otherwise, the event arguments cannot be submitted; the key event argument role and the setting of the submission threshold value simply and effectively improve the labeling quality.
For the event marking task, an authoritative entity library and an event information library approved by an expert need to be constructed in advance. The entity refers to words with specific meanings or strong referenceness in the text, and is generally divided into three major categories (entity, time and number) or seven minor categories (person name, place name, organization name, time, date, currency and percentage).
TABLE 1 entity type library
Figure BDA0003438461320000021
Event extraction is a process of extracting events in which a user is interested from unstructured text information and presenting the events to the user in a structured form, wherein any event has an own event type and a matched event argument (the event argument refers to some entity in the text and is matched with the role of the event argument under the corresponding event type). Therefore, in the event annotation task, a professional event information base must be constructed. The invention constructs a complete event information base aiming at data in the financial field, predefines fourteen event types including asset reorganization, debt default, share right transfer-share right admission, bankruptcy clearing, important asset loss, important external payment, important safety accidents, share right freezing, share right quality guarantee, shareholder increase and hold, shareholder decrease and the like, and the number of event argument roles corresponding to each event type is different. A schematic diagram of a financial event information base constructed in the experiment of the present invention is shown in fig. 3.
Step 2, preprocessing the corpus and constructing a complete corpus access mechanism to be annotated. The corpus is preprocessed before being uploaded, invalid texts (texts containing no events) are filtered out, and the corpus is subjected to sentence division processing (most deep learning models in the natural language processing field are vectorized by taking sentences as units at present).
The corpus is preprocessed before being uploaded, classified by event type and filtered to remove invalid text (text that does not contain events). In the experiment of the invention, the step is realized based on the query of the self-defined trigger word, namely the core word which can dynamically represent the occurrence of the event, such as the announcement of 12 months and 30 days in 2019, and Shenyang company A declares that Shenyang company B is purchased in 5.5 hundred million yuan RMB. The project has constituted a significant reorganization of assets, and the relevant information is continuously advancing! The year 2020, 11, 29, declares that the proposed credit is 15000 ten thousand yuan. "the message immediately after popping up causes no small booming in the industry". In this example sentence, "acquisition" and "holding up" trigger two financial event types of "asset reorganization" and "stockholder holding", respectively, so that the text can be sorted and filtered according to the trigger. The trigger for each financial event type is not unique and may be a synonym or a proximity, etc. Although the text filtering based on the query customized trigger word can cause the text discarding rate to be too high, the quality of the existing text is improved to a certain extent, the subsequent labeling of a labeling person is facilitated, and the quality of the generated data set is improved.
The corpus is preprocessed and the corpus is divided into sentences before being uploaded. The treatment at this step in the experiments of the present invention is in ". ","? ","! The method comprises the steps of' waiting for an end character (full angle) to segment a sentence, converting a half-angle symbol in a text into a full-angle symbol and removing a large amount of blank spaces and other margins in the text. When the sentence segmented by the terminator is too long, the text is segmented according to the value of the maximum sentence length, wherein the maximum sentence length can be dynamically regulated and controlled according to the requirements of the model in the downstream event extraction task.
After the preprocessing aiming at the corpus is finished, the corpus can be stored in a database (uploaded corpus), because the texts in the corpus belong to unstructured data and have no fixed format, each text can be processed into a json format, a json object is created, besides the preprocessed text content, information related to the texts, such as text titles, text time and the like, can be added into the object entry, and therefore a complete json text object is formed. The storage database adopted in the experiment of the invention is MySQL, and the processed json object is used as the attribute of the variable length character string type to be combined with other text attributes to form the tuple in the database table and is stored in the database.
The text table designed in the experiment in the present invention is shown in table 2, in which "id" is the primary key; "text _ num" is a text number, and unlike the primary key, this field is not unique, e.g., "text _ num" in the first and fourth row records in the example text table is identical, and this field is set up for text containing multiple events, specifically used for subsequent crowdsourcing task allocation and data set derivation module description; "type _ name" is an event type corresponding to a text, and one text may correspond to a plurality of event types; the text _ ner _ status and the text _ event _ status represent the labeling state of the text, are fields which need to be used under a single labeling mechanism, and the specific usage is described in the first mechanism in the step 4; "text _ content" is the text information, stored as the json object described above, and includes text attributes such as text content, text time, text title, and text number, for example, the text with "id" of 1 in table 2 corresponds to the "text _ content" field of "{" sensees ": published on 30/12/2014, shenyang a company announces that shenyang B company is purchased in 5.5 billion renowned currency. "," the project has already constituted a significant asset reorganization, and the relevant information is continuously pushing! "]," id ": 1-1-1", "time": 2021-08-3014:14:00"," title ": Shenyang A company announces asset reorganization" } "and" text _ event _ times "are the event annotation times of the text, and the specific usage is described in step 4 mechanisms two and three.
Table 2 text table
Figure BDA0003438461320000031
When a annotator reads a text visually, firstly a certain unary group in a database text table needs to be read so as to read a 'text _ content' field to obtain a complete text json object, and then the json object is analyzed and the contained object information is obtained. The experiment of the invention adopts java language to develop and design, a json object is analyzed by means of a fastjson toolkit under the flag of Alibap, and a corresponding text class is designed to store the analyzed text information (such as text title, text time, text content, text number and the like).
It should be noted that when preprocessing a corpus, the corpus needs to be classified according to event types, and for a text containing multiple events, the text needs to be repeatedly stored in corpus files of different event types, so that the database text table also has repeated text information (only primary key IDs are inconsistent). This is because the annotation process is crowd-sourced by event type. Therefore, for a multi-event text containing different event types, repeated labeling is required by different annotators/accounts (one account is only responsible for labeling one event type) to label the event information of each event type, and for a multi-event text containing the same event type, the problem can be solved through the same type multi-event labeling mechanism. In the experiment of the present invention, a schematic diagram of a corpus processing flow is shown in fig. 4, and a schematic diagram of an upload format corresponding to a financial event type "asset reorganization" corpus file is shown in fig. 5.
And 3, constructing a complete labeling mechanism, namely, firstly carrying out entity labeling and then carrying out event labeling, wherein the labeled event argument in the event labeling task is the labeled entity in the entity labeling task.
In the experiment of the invention, the entity labels and the event labels are commonly used as a label table "db _ mark", the tuple in the table takes an entity as a unit and stores each labeled entity information, and the table mainly comprises attributes such as "entity content", "number of text where the entity is located", "array subscript of sentence where the entity is located (the text is processed by sentence division to form a sentence array)", "initial position of the entity in the sentence", "end position of the entity in the sentence", "entity type", "event type where the text belongs", "event argument role", "same type multiple event index", and the like, wherein the "same type multiple event index" is set up to indicate that the currently labeled event argument belongs to the second event of a certain event type corresponding to the text. The entity label can fill most information of the event argument, and the event extraction task can fill the rest fields.
Step 3-1 entity labeling, namely, extracting entities from the text and completing matching (labeling) with entity types, as shown in table 1. In the experiment of the present invention, the information marked to be stored by an entity includes "entity content", "number of text where the entity is located", "event type of the text", "array subscript of sentence where the entity is located", "initial position of the entity in the sentence", "end position of the entity in the sentence", and "entity type". The schematic diagram of the entity tagging information is shown in fig. 6.
And 3-2, event marking, namely extracting event information from the text. An event refers to a relationship dependency generated by a plurality of entities around a certain subject, and in brief, a sentence in a mode of 'who does what', 'main predicate object', and the like describes an event. The event information base is shown in fig. 3. Because the event information is generated by a plurality of event arguments which serve as the roles of the specific event arguments, the core of event annotation is to label the event arguments from the text so as to complete the matching (labeling) with the corresponding event argument roles. It is noted that the event argument that can be annotated at the time of event annotation must be the entity that was annotated by the entity annotation task. In the experiment of the invention, when the event marking is carried out, except the entity marked by the entity marking, other character segments are all in a locking state, so that the marking efficiency is improved. The information required to be stored by labeling one event argument comprises an event argument role and a same-type multi-event index, and the rest information is filled when the entity is labeled, so that the information is not repeatedly transmitted and stored in the system. Fig. 7 is a schematic diagram of event annotation information.
For the labeling problem of multiple events, only the processing of multiple events of the same type is involved in the labeling process, because the processing of multiple events of different types is controlled by the algorithm logic when the data set is derived. In the experiment of the invention, a 'continuation mark' button and a 'submit' button are designed respectively aiming at texts of 'multiple events of the same type' and 'single event' to submit labeling results differently, a labeling person needs to label each event information sequentially, after the labeling is finished for one event of the corresponding event type, the 'continuation mark' button is clicked to store the labeling information, the field value of 'multiple event index of the same type' in a table is added with 1 to represent which event the labeled event argument belongs to, finally, the same text is returned, and the rest of events are continuously labeled until the last event is labeled, the 'submit' button is clicked to submit the labeling results and the next text is returned. A schematic diagram of the same type of multiple event submission case is shown in fig. 8.
And 4, constructing a complete crowdsourcing task allocation mechanism and a crowdsourcing result aggregation mechanism, and controlling the annotation quality by adopting different allocation mechanisms (single annotation and multiple annotation) and aggregation mechanisms (a majority voting algorithm and a group cognition aggregation model algorithm) for annotators with different professional degrees.
For annotators with different professional levels, different crowdsourcing task allocation mechanisms and crowdsourcing result aggregation mechanisms are adopted, so that annotation efficiency and annotation quality are balanced to the maximum extent. The specific task allocation and result aggregation mechanisms are classified into the following three types:
the first mechanism is as follows: "Single annotation". The mechanism is suitable for experts in related fields and other high-level talents. Each text only needs to be randomly submitted to one of all annotators in charge of the event type, and it is worth noting that the event type in charge of each annotator account is fixed, namely, the event type and the account are in one-to-many relationship, and if one annotator needs to annotate a corpus of multiple event types, multiple accounts are allocated for the annotator. The advantage of designing like this is that annotator can annotate the corpus of certain type in batches according to the event type, thereby helps improving annotator's concentration degree and promote the efficiency of annotating of its certain time quantum. Because each text is labeled only once, a crowdsourcing aggregation algorithm is not needed, and only the text needs to be derived from a label table when a subsequent data set is derived.
In the experiment of the invention, the 'single marking' in the first mechanism means that each text can be marked only once, and two fields of 'entity marking state' and 'event marking state' need to be added in a text table to display the state of each text, so that the effect of mutual exclusion access is achieved by avoiding repeated marking. The 'entity labeling state' is '0', '1', '2' and '3' respectively representing four different states of 'entity labeling task idle', 'entity labeling in progress', 'entity labeling completion' and 'invalid text discarded', so that a annotator only needs to inquire a text corresponding to 'entity labeling state' field '0' when visualizing the next text to be labeled in the entity labeling task, wherein the entity labeling state '1' can be understood as applying a mutual exclusion lock to the text, namely when one text is being subjected to entity labeling by one annotator, other annotators cannot access the text, thereby achieving the effect of the mutual exclusion lock; the entity marking state is '2', which means that the text is entity marked, so that event marking can be carried out; an entity annotation status of "3" indicates that the text contains no entities (i.e., invalid text) and is therefore discarded by the annotator. Similarly, "event labeling states" are "0", "1", "2" and "3", which respectively represent four different states of the text, "event labeling task idle", "event labeling in progress", "event labeling completed", and "invalid text discarded". It should be noted that, when the event annotation task is executed to query the next free text, in addition to determining that the "event annotation state" is "0", it is also necessary to determine that the "entity annotation state" is "2", that is, the text queried by the event annotation task is definitely the text where the entity annotation task is annotated.
In the experiment of the invention, once the mechanism does not adopt crowdsourcing labeling, a crowdsourcing aggregation algorithm is not needed, and the mechanism only needs to traverse the current labeling table when exporting the data set.
And a second mechanism: "multiple labeling" + "Majority Vote (Majority Vote) crowdsourcing aggregation algorithm". The mechanism is applicable to talents that are cognizant of knowledge in the relevant field. Different from the first mechanism, because the professional authority of the annotator is not enough, the same text needs to be repeatedly annotated by multiple people, and finally, the best annotation result is obtained through a Majority Vote (Majority Vote) crowdsourcing and aggregating algorithm. The core idea of a Majority voting (Majority Vote) crowdsourcing aggregation algorithm is that in a labeling result of a certain event argument role, a labeling result with the occurrence frequency exceeding half of the number of all labeling results is selected as a final labeling result of the certain event argument role.
In the experiment of the invention, the "marking many times" in the mechanism two means that each text needs to be given to multiple markers for marking, that is, multiple accounts need to be allocated for each event type for marking, and each account is used for marking all the texts (each text is marked repeatedly) of the responsible event type in turn without influencing each other. The method includes the steps that a 'marked times' field is added in a text table to display the marked times of each text, for example, the field is '3' to indicate that the text is marked three times, a 'text maximum marked times' field is also required to be set in the times table and can be dynamically modified by an administrator background, so that when a next text to be marked is visualized for a marker, the text with the 'marked times' field value smaller than the 'text maximum marked times' field value in the corresponding event type is inquired, meanwhile, in order to avoid that one text can be marked by the same user for multiple times, a 'current event marked text ID' field is added in a user table, one text needs to be obtained according to the sorting of primary key values during inquiry, and the field value is automatically updated after marking is completed.
In the experiment of the invention, the 'Majority voting (Majority Vote) crowdsourcing aggregation algorithm' in the mechanism two is that the result agreed by most of the annotators is simply taken as the aggregation result and is taken as the estimation of the standard annotation, the specific design of the algorithm is equivalent to the calculation of the element with more than half of the occurrence frequency in the array, except that an array is created for each event argument role under the event type corresponding to the text in the experiment, the elements in the array are all event arguments drawn by a plurality of annotators aiming at a certain event argument role, the event arguments marked by each annotator may or may not be consistent, and the system only needs to apply the Majority voting algorithm to calculate the final labeling result, namely, the event argument matched with the final event argument role is selected. It should be noted that, for a certain event argument role, if all annotators do not annotate an event argument, the final annotation result of the event argument role is also null. The aggregated labeling results are stored in a new labeling table labeled table II, so that the mechanism only needs to traverse the labeled table II when exporting the data set.
And a third mechanism: the 'multiple labeling' + 'group cognition crowdsourcing aggregation algorithm' is applicable to the non-related field talents. Unlike mechanisms one and two, because the annotator does not know the knowledge of the related field or because the annotation task simply learns the knowledge of the related field, a more rigorous crowdsourcing aggregation algorithm is needed to aggregate the annotation results. The crowd-sourcing aggregation algorithm for group cognition mainly carries out credibility calculation through a social trust evaluation mechanism (part). The social trust evaluation mechanism needs to specially set some evaluation users in the system to evaluate the annotated event argument information, the system establishes a trust evaluation value UT for each evaluation user according to the professional degree, and for each annotated event argument, the evaluation users can approve or disapprove tickets. The system respectively creates an array for all event argument roles in the event to store all marked event arguments, and finally selects the event argument with the maximum social credibility F (x) value in the array as the final marking result of the event argument role. The algorithm for social confidence F (x) is as follows:
Figure BDA0003438461320000061
wherein F (x) is a labeled social credibility, n is the number of voting users, UTi(i-1, …, n) represents the confidence (weight) of a particular voting user, and K represents the user's opinion (vote positive with a value of 1, vote negative with a value of-1).
In the experiment of the invention, the multiple labeling in the third mechanism is completely the same as the multiple labeling in the second mechanism, and the repeated description is omitted.
In the experiment of the invention, a group cognition crowdsourcing aggregation algorithm (part) in the mechanism III respectively creates an array for all event argument roles in each event in a text, elements in the array, namely all event arguments drawn by a plurality of annotators for an event argument role, after a user to be evaluated votes for all event arguments in the array, respectively calculates the value F (x) of each event argument, and finally selects the event argument with the maximum value F (x) in the array as a final labeling result, namely the label is considered to be socially recognized. The aggregated labeling results are stored in a new labeling table labeled table III, so that the mechanism only needs to traverse the labeled table III when exporting the data set.
It is noted that, since the "entity annotation task" does not particularly require knowledge of the related field, the entity annotation task can be efficiently completed only by using a single annotation mechanism, and the "event annotation task" requires different crowdsourcing task allocation mechanisms and crowdsourcing result aggregation mechanisms to be used according to the professional degree of the annotator in the related field. A schematic diagram of the crowdsourcing mechanism is shown in fig. 9.
And 5, constructing a complete data set export mechanism, and dynamically regulating and constructing the required event extraction data set according to the data set format required by the downstream event extraction model.
After the annotator completes all annotation, the required event extraction data set needs to be dynamically integrated according to the data set format required by the downstream event extraction model. In the experiment of the invention, the data in the label table is traversed by exporting the data set, the event argument label matched with the text is dynamically inquired by taking the text as a unit, and the data set in the json format is dynamically generated through json array and json object provided by a fastjson toolkit under the flag of the bar. As described above, the label tables traversed by the crowdsourcing mechanism one, two and three are different, the mechanism one corresponds to the original ecological label table, the mechanism two corresponds to the aggregated label table ii, and the mechanism three corresponds to the aggregated label table iii.
Aiming at the problem of multi-event derivation, the method is mainly controlled by marking fields of 'number of text of entity', 'type of event to which text belongs' and 'index of multi-event of the same type' in a table. When the event argument is exported, the multi-event information of each text is sequentially inquired through the number of the text where the entity is located in the history table, how many events (including the same-type multi-events and different-type multi-events) exist in each text can be inquired through combining two fields of the event type to which the text belongs and the multi-event index of the same type, each event information is sequentially traversed, and then the event arguments are sequentially filled through traversing all event argument roles corresponding to the current event type (namely, event arguments with the attribute of the event argument role not being empty in the annotation table or the aggregation table are sequentially filled into the corresponding event information, and if the event argument role attribute is empty, the entity does not serve as an event role in a certain event).
In order to further improve the quality of the data set, some texts with unqualified labeling quality still need to be filtered after the aggregation is finished, for example, critical conditions such as 'too few event arguments matched in the texts', 'unmatched core event argument roles' and the like all need to be filtered, in the experiment of the invention, in order to improve the efficiency, the step is converted into a submission limit when labeling the events, the submission is limited according to two conditions of 'whether the number of the labeled event arguments reaches a submission threshold of a corresponding event type' and 'whether all the key argument roles of the corresponding event type are labeled', so that the filtering step is replaced, the query and export time is saved, and the export efficiency is greatly improved.
The invention has the beneficial effects that: by adopting the scheme, the invention can effectively apply various crowdsourcing techniques to the marking process of the markers with different professional degrees, thereby effectively utilizing the background knowledge of the markers and maximally playing the crowdsourcing function. Meanwhile, the invention constructs a complete entity library and an event information library (namely an event type library and an event argument role library thereof), a labeling mechanism and an access mechanism, and provides an effective method for constructing an event labeling system.
Drawings
FIG. 1 is a flow chart of the construction of the event annotation system according to the present invention.
FIG. 2 is a flow chart of the operation of the event annotation system of the present invention.
FIG. 3 is a diagram of a financial event information repository.
FIG. 4 is a corpus processing flow diagram.
FIG. 5 is a diagram of an "asset reorganization" corpus upload format.
Fig. 6 is a schematic diagram of entity tagging information.
Fig. 7 is a schematic diagram of event annotation information.
Fig. 8 is a schematic diagram of a same type multiple event submission case.
Fig. 9 is a schematic diagram of a crowdsourcing mechanism.
FIG. 10 is a schematic diagram of text table information.
FIG. 11 is a schematic diagram of an entity annotation page.
FIG. 12 is a diagram illustrating entity tag table information.
FIG. 13 is a schematic diagram of an event annotation page.
FIG. 14 is a schematic view of an event annotation page.
FIG. 15 is a schematic view of an event annotation page.
Fig. 16 is a schematic diagram of event annotation table information.
FIG. 17 is a single event text dataset derivation diagram.
FIG. 18 is a diagram of multiple event text dataset derivation.
Detailed Description
The invention is further described with reference to the accompanying drawings: the invention constructs a multi-level annotator-oriented event annotation system based on a crowdsourcing technology. First, domain data is collected and a complete entity library and event information library are constructed. Then, the collected data are preprocessed, and a complete corpus access mechanism to be annotated is constructed. Secondly, a complete annotation mechanism is constructed, wherein the complete annotation mechanism comprises an entity annotation mechanism and an event annotation mechanism. And then, constructing a complete crowdsourcing task distribution mechanism and a crowdsourcing result aggregation mechanism, namely perfecting the information of the annotators and controlling the annotation quality by adopting different distribution mechanisms and aggregation mechanisms aiming at the annotators with different professional degrees. And finally, constructing a complete data set export mechanism, and dynamically integrating the required event extraction data set according to the data set format required by the downstream event extraction model. The system construction flow chart is shown in fig. 1, and the system operation flow chart is shown in fig. 2.
In order to test the feasibility of the method provided by modules for storing a large-scale corpus, visually inquiring, distributing multi-level crowdsourcing tasks, labeling event information, deriving a data set and the quality of a generated data set, the financial field event labeling system based on crowdsourcing technology and oriented to multi-level annotators is completely realized on financial field data. In order to maximize overall efficiency and data set quality, a mechanism I is adopted in a crowdsourcing module in combination with the specialty degree, and finally a high-quality financial field event extraction data set is generated for downstream tasks.
An example of the invention is given below with reference to the accompanying drawings:
(1) construction of entity library and event information library
Before testing the data in the financial field, a large amount of precondition research work is firstly carried out, and a professional entity library and an event information library are constructed. The entity library adopts seven subclasses of classification modes, and the event information library adopts a self-defined event information library (comprising fourteen event types and corresponding argument role libraries).
(2) Preprocessing of a corpus
Text classification and screening
We have collected a large amount of financial chapter-level text on the web and defined trigger words for fourteen financial event types, based on which the collected text is classified and most of the invalid text is filtered out. Although this method may cause the loss of valid corpus, the quality of the remaining corpus is improved.
Text cleaning
The collected texts mostly exist in the form of unstructured data and contain mark irrelevant symbols such as spaces, half-angle separators and the like, so that texts obtained by screening are cleaned.
Text clauses
To meet the requirements of the downstream event extraction model, sentence separation processing is performed on each piece of text in the corpus. Specifically, chinese terminators (full corners) are segmented into sentence arrays, and then further segmented based on predefined rules for content whose segmented sentence length is still larger than a predefined maximum sentence length.
(3) Corpus uploading
After preprocessing the corpus is completed, uploading each event type corpus in a classification mode in the annotation system, and dumping json objects in a database is completed, and a text table information schematic diagram is shown in fig. 10.
(4) Crowdsourcing task allocation
During the test, we created 28 annotator accounts, and each two accounts are responsible for the annotation of one event type on average, and since the experiment uses the mechanism one (single annotation), each piece of text does not need to be annotated repeatedly. It is noted that if other crowdsourcing mechanisms are adopted, the information in the annotation table needs to be aggregated into a new aggregation table after the event annotation is completed, and finally the data set is derived from the aggregation table. The new aggregation table and the old aggregation table have the same structure, and the best labeling result is aggregated by the aggregation algorithm and dumped into the new table.
(5) Entity tagging
The entity annotation has no event type score, so 28 annotator accounts are freely allocated any text which is not subjected to entity annotation in the annotation process. The task of entity labeling is to label entity fields from the text, complete the matching with the user-defined entity types, and after all entities in the text are labeled, click to submit and store the labeling result. Fig. 11 shows a schematic diagram of an entity tagging page, and fig. 12 shows a schematic diagram of tagging table information after the entity tagging is completed.
(6) Event annotation
In the experiment of the invention, the event annotation task is the filling of the entity information marked by the entity annotation task. Different from entity labeling, each annotator account is responsible for event labeling of one event type, and every two annotators are only randomly allocated with one text corresponding to the event type in charge. Single event annotation is the selection of event arguments (entities) from the entities marked in the text that can serve the role of corresponding event arguments. If there are multiple events of the same type, it is necessary to label each event sequentially, using the "continuation" button for the submission of the first n-1 events and the "submit" button for the submission of the nth (last) event. The schematic diagrams of the event annotation pages are shown in fig. 13, 14 and 15, and the example text contains three events, wherein the event information annotated in fig. 13 and 14 belongs to multiple events of an "asset reorganization" type, and the event information annotated in fig. 15 belongs to a single event of a "shareholders holding" type. Fig. 16 shows a schematic diagram of the information of the annotation table after the event annotation is completed.
(7) Data set derivation
After the annotator completes all the annotation tasks, the data set export can be carried out in a background system. The JSON format generation classes such as JSONArray and JSONObject provided by a fastjson toolkit under the flag of Alibara are used as technical support, and data in a labeling table is dynamically traversed according to a data set format required by the downstream. A single event text dataset export schema is shown in fig. 17 and a multiple event text dataset export schema is shown in fig. 18.

Claims (8)

1. The construction method of the event annotation system based on the crowdsourcing technology and oriented to the multi-level annotators is characterized by comprising the following steps of:
step 1, collecting field data and constructing a complete entity library and an event information library, wherein the entity library comprises entity types common in a named entity identification task, such as a name of a person, a name of a place and a name of an organization, and the event information library comprises a plurality of financial event types, a set of event argument roles corresponding to each event type and accessory information thereof;
step 2, preprocessing the corpus and constructing a complete corpus access mechanism to be annotated: preprocessing a corpus before uploading the corpus, filtering invalid texts and performing sentence division processing on the corpus;
step 3, constructing a complete labeling mechanism, namely, firstly carrying out entity labeling and then carrying out event labeling, wherein the labeled event argument in the event labeling task is the labeled entity in the entity labeling task;
step 4, constructing a complete crowdsourcing task distribution mechanism and a crowdsourcing result aggregation mechanism, and controlling the annotation quality by adopting different distribution mechanisms and aggregation mechanisms for annotators with different professional degrees;
and 5, constructing a complete data set export mechanism, and dynamically regulating and constructing the required event extraction data set according to the data set format required by the downstream event extraction model.
2. The method according to claim 1, wherein the entity types in the entity library include a person name, a place name, an organization name, a time, a date, a currency and a percentage, and the total of seven types of information.
3. The method according to claim 1, wherein the event information base comprises a plurality of financial event types, a set of event argument roles corresponding to each event type and associated information: the event type is associated with a trigger word, an event argument role and a submission threshold value; the event type and the event argument role are in one-to-many relationship, the event type and the trigger word are in one-to-many relationship, and the event type and the submission threshold value are in one-to-one relationship.
4. The construction method according to claim 1, wherein the step 2 is implemented as follows: classifying the corpus according to event types and filtering invalid texts which do not contain events, namely classifying the corpus by inquiring self-defined trigger words and screening texts, then carrying out sentence segmentation on the screened texts and segmenting sentences by using end characters; and finally processing each text into a json format, creating a json object, and adding information related to the text, such as text titles, text time and the like, in the object entry besides the preprocessed text content so as to form a complete json text object and construct a complete corpus to be annotated.
5. The construction method according to claim 4, wherein the terminator is a full-angle state terminator, and when the sentence segmented by the terminator is too long, the text is segmented according to the value of the "maximum sentence length", wherein the "maximum sentence length" can be dynamically controlled according to the requirements of the model in the event extraction task.
6. The construction method according to claim 1, wherein the construction of the complete annotation mechanism in step 3 includes entity annotation and event annotation, and the entity annotation is performed first and then the event annotation is performed; the entity labeled by the entity refers to a word with a specific meaning or strong reference in the text, and the entity labeling refers to the steps of extracting the entity from the text, and completing the matching and labeling with the entity type; the event marked by the event refers to a relationship generated by a plurality of entities around a certain theme; the event marking is to mark event arguments from the text, so as to complete the matching and marking with the corresponding event argument roles; wherein the event argument marked in the event marking task must be the marked entity in the entity marking task.
7. The construction method according to claim 1, wherein the step 4 is implemented as follows:
the task allocation and result aggregation mechanism is divided into the following three mechanisms:
the first mechanism is as follows: the method comprises the following steps of (1) carrying out single marking, wherein each text only needs to be randomly handed to one of all markers in charge of the event type; because each text is labeled only once, a crowdsourcing aggregation algorithm is not needed;
and a second mechanism: "multiple labeling" + "Majority Vote Vote crowdsourcing aggregation algorithm": the same text needs to be repeatedly labeled by a plurality of people, and finally the best labeling result is obtained through a Majority voting Majority Vote crowdsourcing aggregation algorithm;
and a third mechanism: the method comprises the following steps of marking a population cognition crowdsourcing aggregation algorithm for many times, wherein the population cognition crowdsourcing aggregation algorithm is mainly used for carrying out credibility calculation through a social trust evaluation mechanism; the social trust evaluation mechanism needs to specially set some evaluation users in the system to evaluate the labeled event argument information, the system establishes a trust evaluation value UT for each evaluation user according to the professional degree, and the evaluation users can vote for or vote against each labeled event argument; the system respectively creates an array for all event argument roles in the event to store all marked event arguments, and finally selects the event argument with the maximum social credibility F (x) value in the array as the final marking result of the event argument role; the algorithm for social confidence F (x) is as follows:
Figure FDA0003438461310000021
wherein F (x) is a labeled social credibility, n is the number of voting users, UTi(i-1, …, n) is the confidence level, i.e. weight, of the particular voting user, K represents the user's opinion, and when voted up, the value is 1, and when voted down, the value is-1.
8. The construction method according to claim 1, characterized in that the construction of the complete data set derivation mechanism in step 5: and exporting the data set, namely traversing the data in the annotation table, dynamically querying the event argument annotation matched with the text by taking the text as a unit, and dynamically generating the data set in a json format through json array and json object provided by a fastjson toolkit under the flag of Alibara.
CN202111624377.0A 2021-12-28 Event labeling system construction method for multi-level labeling person based on crowdsourcing technology Active CN114281998B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111624377.0A CN114281998B (en) 2021-12-28 Event labeling system construction method for multi-level labeling person based on crowdsourcing technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111624377.0A CN114281998B (en) 2021-12-28 Event labeling system construction method for multi-level labeling person based on crowdsourcing technology

Publications (2)

Publication Number Publication Date
CN114281998A true CN114281998A (en) 2022-04-05
CN114281998B CN114281998B (en) 2024-09-24

Family

ID=

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116561317A (en) * 2023-05-25 2023-08-08 暨南大学 Personality prediction method, labeling method, system and equipment based on text guidance

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325228A (en) * 2018-09-19 2019-02-12 苏州大学 English event trigger word abstracting method and system
CN112000791A (en) * 2020-08-26 2020-11-27 哈电发电设备国家工程研究中心有限公司 Motor fault knowledge extraction system and method
US20210026835A1 (en) * 2019-07-22 2021-01-28 Kpmg Llp System and semi-supervised methodology for performing machine driven analysis and determination of integrity due diligence risk associated with third party entities and associated individuals and stakeholders
US20210065569A1 (en) * 2014-08-28 2021-03-04 Ideaphora India Private Limited System and method for providing an interactive visual learning environment for creation, presentation, sharing, organizing and analysis of knowledge on subject matter

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210065569A1 (en) * 2014-08-28 2021-03-04 Ideaphora India Private Limited System and method for providing an interactive visual learning environment for creation, presentation, sharing, organizing and analysis of knowledge on subject matter
CN109325228A (en) * 2018-09-19 2019-02-12 苏州大学 English event trigger word abstracting method and system
US20210026835A1 (en) * 2019-07-22 2021-01-28 Kpmg Llp System and semi-supervised methodology for performing machine driven analysis and determination of integrity due diligence risk associated with third party entities and associated individuals and stakeholders
CN112000791A (en) * 2020-08-26 2020-11-27 哈电发电设备国家工程研究中心有限公司 Motor fault knowledge extraction system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘炜;王旭;张雨嘉;刘宗田;: "一种面向突发事件的文本语料自动标注方法", 中文信息学报, no. 02, 15 March 2017 (2017-03-15) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116561317A (en) * 2023-05-25 2023-08-08 暨南大学 Personality prediction method, labeling method, system and equipment based on text guidance

Similar Documents

Publication Publication Date Title
Negara et al. Topic modelling twitter data with latent dirichlet allocation method
EP2211280B1 (en) System and method for providing default hierarchical training for social indexing
CN111767716B (en) Method and device for determining enterprise multi-level industry information and computer equipment
CN111475612A (en) Construction method, device and equipment of early warning event map and storage medium
CN108717433A (en) A kind of construction of knowledge base method and device of programming-oriented field question answering system
US20150006528A1 (en) Hierarchical data structure of documents
WO2023035330A1 (en) Long text event extraction method and apparatus, and computer device and storage medium
CN111737421A (en) Intellectual property big data information retrieval system and storage medium
CN113360647B (en) 5G mobile service complaint source-tracing analysis method based on clustering
CN109492097B (en) Enterprise news data risk classification method
Kmail et al. MatchingSem: online recruitment system based on multiple semantic resources
CN111090994A (en) Chinese-internet-forum-text-oriented event place attribution province identification method
Long An agent-based approach to table recognition and interpretation
Joshi et al. Auto-grouping emails for faster e-discovery
Revindasari et al. Traceability between business process and software component using Probabilistic Latent Semantic Analysis
Wang et al. SOTagRec: A combined tag recommendation approach for stack overflow
CN114281998A (en) Multi-level annotator-oriented event annotation system construction method based on crowdsourcing technology
Edi Topic modelling Twitter data with latent Dirichlet allocation method
CN114281998B (en) Event labeling system construction method for multi-level labeling person based on crowdsourcing technology
CN110688453B (en) Scene application method, system, medium and equipment based on information classification
Hovy Data and knowledge integration for e-government
CN117195004B (en) Policy matching method integrating industry classification and wvLDA theme model
Olatunde Electronic Records Management Skills Needed by Secretaries for Effective Job Performance in Public Universities in Southwest, Nigeria
Ur-Rahman Textual Data Mining for Knowledge Discovery and Data Classification: A Comparative Study
CN111061864B (en) Automatic open source community Fork abstract generation method, system and medium based on feature extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant