CN106897364B

CN106897364B - Chinese reference corpus construction method based on events

Info

Publication number: CN106897364B
Application number: CN201710020573.4A
Authority: CN
Inventors: 张亚军; 刘宗田; 李强; 周文; 刘炜
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2017-01-12
Filing date: 2017-01-12
Publication date: 2021-02-23
Anticipated expiration: 2037-01-12
Also published as: CN106897364A

Abstract

The invention relates to a Chinese reference corpus construction method based on events. The method mainly comprises the following steps: (1) selecting a CEC2.0 corpus as a construction basis, (2) determining a target and a labeling mode of a reference label, (3) making a corresponding labeling specification according to a specific reference target, (4) preprocessing a CEC2.0 corpus text, (5) automatically labeling an event element and an event reference, (6) further optimizing a labeling result through manual labeling, (7) setting a consistency checking step, and ensuring the quality of the corpus label. The invention overcomes the defects of the existing reference resolution corpus. The method not only can cover all events in the corpus, but also is based on Chinese syntactic analysis and semantic analysis, accords with the characteristics of Chinese, and can also check consistency of the labeled corpus to ensure corpus labeling quality.

Description

Chinese reference corpus construction method based on events

Technical Field

The invention belongs to the field of Natural Language Processing (Natural Language Processing), and relates to a method for constructing a Chinese reference corpus based on events.

Background

Reference is a common language phenomenon that occurs in large numbers in everyday conversations and texts. The expression can make the language expression simple and coherent, and is beneficial to language communication and text writing. But the use of references in large quantities increases the difficulty of computing the mechanism to solve the language as well as the text. The main task of reference resolution is to identify the same entities described by different expressions in chapters. In the past, a great deal of research work is concentrated on non-event texts, and certain results are obtained. With the rise of the concept of "events", more and more scholars are beginning to take event-oriented research. The event is related to multiple factors, is a knowledge representation unit with granularity larger than that of a static concept, is closer to the cognitive process of human by taking the event as a basic unit of human knowledge, is more in line with objective practice, is concerned by more and more fields, and is gradually adopted by knowledge processing fields such as computer linguistics, artificial intelligence, information retrieval, information extraction, automatic abstracting and the like.

Since the last 80 th century, some international evaluation meetings for information extraction, such as information understanding Meetings (MUC), Automatic Content Extraction (ACE), and the like, began to rise, and these meetings provide unified test corpora and evaluation methods for natural language processing technologies such as information extraction and reference resolution, and their development of reference resolution is promoted to a great extent, especially for the test corpora provided by these meetings, so that the reference resolution system is shifted from a resolution method based on heuristic rules to a resolution method based on data driving. For example, the MUC corpus adopts an SGML labeling method, where < COREF ID = "x" >, < COREF ID = "x" REF = "y" > respectively represents the left boundary of an entity and a reference expression, and the right boundary of the entity and the reference expression is represented by </COREF >, x strictly monotonically increases from 1 to represent the sequence number of the entity in the text, REF represents the information of the antecedent of the entity, if y is equal to the value of a certain x, the antecedent of the reference expression is the entity with the ID number of x, and if there is no REF value, the entity has no antecedent; the ACE corpus is different from the MUC corpus, taking ACE2005 as an example, the expression pointing to the same entity is placed in a reference chain with the same number through the reference chain description text, it is worth mentioning that the ACE corpus adds Chinese corpus from ACE2003, the training corpus of 30 ten thousand characters and the testing corpus of 5 ten thousand characters are achieved at present, and the evaluation of event mention is added, which is the international evaluation corpus resource aiming at Chinese reference resolution at the earliest, and plays a great promoting role in the development of Chinese reference resolution. In 2011, CoNL provides an English ontoNoteses 4.0 corpus, labels the coreference relationship between event nouns and verbs, and in 2012, provides Ontonoteses 5.0 corpus of English, Chinese and Arabic to perform multi-language coreference resolution evaluation. In recent years, research on reference resolution in China is gradually increased, and related corpora are constructed. For example, an information extraction-oriented Chinese cross-text reference corpus constructed on the basis of an ACE2005 Chinese corpus by Zhao Zhi Wen et al, and an entity link corpus constructed on the basis of an ACE2005 Chinese corpus and a Chinese Wikipedia by Shujiagen et al.

However, most of these corpora are not based on event annotation, and although the ACE corpus defines 8 types of events and evaluates event references, the understanding of events still stays at chapter level, and is not detailed to specific sentences, and cannot cover all events, and the evaluation of event references does not involve the problem of coreference resolution. The Ontonotes corpus provides the coreference relation about events, but only relates to English, and is not suitable for Chinese statement analysis. Most of domestic corpora are also built on the basis of Chinese corpora like ACE, and events are not marked as knowledge representation units. The entities, called elements, related to multiple aspects in the event also have a large number of reference phenomena as with the static concept in the traditional text, and at the same time, the event itself has a number of references, and for the application facing the event, they bring many uncertainties and need to be processed and researched, which needs help of a corpus, however, so far, there is no Chinese reference corpus facing the event.

Disclosure of Invention

The invention provides a Chinese reference corpus construction method based on events in order to make up the defects of the existing reference resolution corpus, and the Chinese reference corpus construction method based on the events is characterized in that an event-oriented Chinese reference corpus is constructed on the basis of a CEC2.0 corpus, wherein the event-oriented Chinese reference corpus comprises reference marks of existing elements, default elements and events. The method not only can cover all events in the corpus, but also is based on Chinese syntactic analysis and semantic analysis, accords with the characteristics of Chinese, and can also carry out consistency check on the labeled corpus so as to ensure the quality of corpus labeling.

The following three definitions are the concepts involved in the present invention:

definition 1. antecedent and response elements: if the event-oriented Chinese text has the reference relationship among the elements, the elements expressing more concrete are called antecedent elements, and the elements expressing more abstract are called as corresponding elements.

Definition 2. look ahead and photopic events: if the event-oriented Chinese text has the reference relationship among the events, the event with more concrete expression is called a prior event, and the event with more abstract expression is called a response event. The concrete and abstract judgment of the event is related to whether the elements contained in the event are complete or not, namely whether the object, environment and time elements of the event are default or not.

Definition 3. event-oriented reference resolution: the process of finding the relationship between the antecedent element (or antecedent event) and the event-oriented element (or event-oriented) in the event-oriented text and explicitly giving the antecedent element (or event-oriented) to which the event-oriented element (or event-oriented) points.

In order to achieve the purpose, the invention adopts the following technical scheme: a Chinese reference corpus construction method based on events is characterized by comprising the following operation steps:

(1) the CEC2.0 corpus was chosen as the building basis.

A. CEC2.0 was selected as the base corpus constructed.

B. And checking the accuracy of the event and the event element labeling by contrasting with the CEC2.0 corpus labeling specification.

C. And supplementing related labels to the linguistic data with incomplete labels, and correcting the linguistic data with wrong labels.

(2) And determining the target of the reference label and the labeling mode.

A. Objects referred to as annotations fall into two broad categories: the reference labels of the event elements (objects, environments and time) and the reference labels of the events are divided into the reference labels of the existing elements and the reference labels of the default elements.

B. All types of reference labels are in XML format for the convenience of computer related processing. Event elements are divided into existing elements and default elements, so the corresponding references are annotated in two forms: the first form is Attribute (Attribute) labeling, and the labeling only aims at the reference of elements and is not related to the labeling of events, so as to label default elements in the events; the second form is label (Tag) labeling, which is a label referred to as a label alone for the purpose of labeling existing elements and events.

(3) And making a corresponding marking specification according to the specific reference target.

A. Labeling specification of default elements: A. the Object element is marked in an attribute sid (subject number) or oid (Object number) for identifying a Participant or Object; B. the environment element is marked in the attribute lid of the Location; C. the Time element is marked in an attribute tid that identifies the Time.

B. There is already a specification of element labeling: 1. the Object elements have two semantic types, which are labeled with two identifiers, namely, a particle and an Object, respectively, wherein the former is related to a person, and the latter is related to an Object, so that the Object elements do not belong to one semantic type and cannot refer to each other. 2. The environmental elements are labeled not only with elements pointing to the same geographical position, but also with reference types, that is, with the preceding environmental elements, we can specify the geographical positions of the environmental elements. 3. Time elements are similar to environmental elements, except that elements pointing to the same time are labeled with a reference type.

C. And (3) event marking specification: firstly, comparing whether the trigger words of the two events are the same or synonymous, if so, carrying out the next step, otherwise, the two events have no indication relation. And then comparing the elements of the two events, wherein each event must contain a trigger word, and other elements may be in default and cannot appear, so that the default element is complemented according to the context, and then whether the two events have a reference relationship is judged, and the elements of the two events having the reference relationship must be consistent, namely point to the same entity in reality.

(4) CEC2.0 corpus text preprocessing.

A. The CEC2.0 corpus does not have a reference number for ReportTime, and it can refer to the reference time in the annotation as a time element. Therefore, the pre-processing needs to add an identification attribute tid, and the attribute value is t 0.

B. Since the object elements in the CEC corpus are marked with coarse granularity, the modification components of the object elements need to be marked with further refinement, so that more abstract object elements can be embodied.

(5) Automatically annotating event elements and event references.

A. Due to the fact that the default elements are high in complexity, large in difficulty and low in accuracy in automatic labeling, the automatic labeling is not carried out.

B. And for the existing elements, marking the existing elements by adopting an identification marking form through a simple character string matching rule.

C. And for the event, marking by adopting an identification marking form through a synonym detection method for the trigger words.

(6) And further optimizing the labeling result through manual labeling.

A. And arranging two annotators to correct the reference chain generated in the automatic annotation stage, and simultaneously performing manual completion on the references which cannot be automatically identified, wherein the two annotators are required to finish the work independently.

B. For the labeling differences of two independent markers, arbitration is performed by a third person. The arbitrator can resolve the divergence according to the labeling specification or introduce external knowledge to determine the final labeling result.

(7) And setting a consistency checking step to ensure the quality of the corpus annotation:

A. in order to ensure the quality of corpus labeling, consistency detection needs to be performed on the labeling results of two independent labeling persons.

B. And adopting a reference annotation reliability calculation method proposed by Passioneau. The method represents the similarity between the reference chains through a distance measurement, and the distance measurement principle mainly comprises the following steps:

1. when the two reference chains are completely anastomosed, the distance is 0;

2. distance is 0.33 when one refers to a subset of the other;

3. when the two reference chains do not contain each other and have a common non-empty subset, the distance is 0.67;

4. when the intersection of the two reference chains is an empty set, the distance value is set to 1.

C. The similarity distance between the reference chains is calculated according to the alpha coefficient of Krippendorff to check the consistency between different annotators. If the alpha coefficient is lower than 67 percent, the marking result is unreliable, then the step of automatic marking is carried out (5), and the marking is carried out again until the consistency is higher than the threshold value.

Compared with the traditional Chinese reference corpus construction method, the event-based Chinese reference corpus construction method has the following obvious prominent substantive features and remarkable technical progress: (1) the Chinese language reference corpus based on the events is established on the basis of the events, the events are used as knowledge representation units, the dynamic property of the objects is reflected, objective reality is better met, and the computer can simulate the brain to work conveniently; (2) the traditional reference labels are divided into too many entity categories, while the reference labels based on events are labels depending on events and event elements, so that the classification is less and the structure is clear; (3) the event-based reference label not only labels elements pointing to the same entity, but also labels reference of a reference type, and through the reference relationship, the abstract elements can be concretized (4) based on the event label, so that zero reference resolution in the traditional reference is converted into reference resolution of default elements, the entity is componentized, and the recognition and resolution of the default elements are facilitated by combining language expression rules of the event; (5) the traditional reference resolution is easy to be limited because of lack of necessary chapter knowledge for resolution, and the reference annotation based on the event can lead us to dig more chapter knowledge through combination with the event relation, thereby improving the performance of the reference resolution system.

Drawings

FIG. 1 is a flowchart of a method for constructing an event-based Chinese reference corpus according to the present invention.

Detailed Description

The preferred embodiments of the present invention are described in detail below with reference to the accompanying drawings:

the first embodiment is as follows:

referring to fig. 1, the method for constructing a chinese reference corpus based on events mainly includes the following steps:

(1) the CEC2.0 corpus was chosen as the building basis,

(2) determining the target of the reference label and the labeling mode,

(3) corresponding marking specifications are made according to specific reference targets,

(4) CEC2.0 corpus text pre-processing,

(5) the event elements are automatically annotated and the event references,

(6) the labeling result is further optimized by manual labeling,

(7) and setting a consistency checking step to ensure the quality of the corpus annotation.

Example two:

this embodiment is substantially the same as the first embodiment, and is characterized in that:

the step (1) selects a CEC2.0 corpus as a construction basis:

(1-1) selecting CEC2.0 as a constructed basic corpus;

(1-2) checking the accuracy of the event and the annotation of the event elements by contrasting with CEC2.0 corpus annotation specifications;

and (1-3) supplementing related labels to the linguistic data with incomplete labels, and correcting the linguistic data with wrong labels.

The step (2) determines the target and the labeling mode of the reference label:

(2-1.) the objects referred to as annotations fall into two broad categories: the method comprises the following steps of (1) performing reference labels of event elements (objects, environments and time) and events, wherein the reference labels of the event elements are divided into the reference labels of existing elements and the reference labels of default elements;

(2-2) in order to facilitate relevant processing of a computer, all types of reference labels adopt XML format. Event elements are divided into existing elements and default elements, so the corresponding references are annotated in two forms: the first form is Attribute (Attribute) labeling, and the labeling only aims at the reference of elements and is not related to the labeling of events, so as to label default elements in the events; the second form is label (Tag) labeling, namely, a label is singly used for reference labeling, and the purpose is to label existing elements and events;

example 1: object element attribute labeling

< particulate sid = "s2, s3" > Shanghai City government News >

< Time type = "relTime" tid = "t2" >12 days 15 hours 45 minutes </Time > issue

< Denoter type = "status" did = "d2" > message </Denoter >

</Event>

< Denoter type = "status" did = "d3" > call </Denoter >

</Event>

Example 2: site element identification (Tag) tagging

< eAnaph and type = "Loc" aid = "l3" antecedent = "Wen Si Chuan in Sichuan =" l7 "anaphor =" Wen Si Chuan county "/>" in Sichuan

< eAnaph anaType = "Loc" aid = "l7" antigent = "Wenchuan county, Sichuan province" rid = "l13" anaphor = "disaster area"/>)

The step (3) formulates a corresponding labeling specification according to a specific reference target:

(3-1) default element labeling specification: A. the Object element is marked in an attribute sid (subject number) or oid (Object number) for identifying a Participant or Object; B. the environment element is marked in the attribute lid of the Location; C. the Time element is marked in an attribute tid that identifies the Time.

(3-2) there is already an element labeling specification: A. the Object elements have two semantic types, which are labeled with two identifiers, namely, a particle and an Object, respectively, wherein the former is related to a person, and the latter is related to an Object, so that the Object elements do not belong to one semantic type and cannot refer to each other. B. The environmental elements are labeled not only with elements pointing to the same geographical position, but also with reference types, that is, with the preceding environmental elements, we can specify the geographical positions of the environmental elements. C. Time elements are similar to environmental elements, except that elements pointing to the same time are labeled with a reference type.

(3-3) event annotation specification: firstly, comparing whether the trigger words of the two events are the same or synonymous, if so, carrying out the next step, otherwise, the two events have no indication relation. And then comparing the elements of the two events, wherein each event must contain a trigger word, and other elements may be in default and cannot appear, so that the default element is complemented according to the context, and then whether the two events have a reference relationship is judged, and the elements of the two events having the reference relationship must be consistent, namely point to the same entity in reality. :

the CEC2.0 corpus text preprocessing in the step (4):

the ReportTime (reporting time) is not numbered in the CEC2.0 corpus, and it can be referred to as a reference time in the annotation as a time element. Therefore, an identification attribute tid needs to be added in the preprocessing, and the attribute value is t 0;

example 3: ReportTime renumbering

< ReportTime type = "absTime" tid = "t0" > 2008. 05 month 12 th 16:25</ReportTime >

(4-2) since the object elements in the CEC corpus are marked with coarse granularity, the modification components of the object elements need to be marked with further refinement, so that more abstract object elements can be embodied.

The step (5) automatically marks event elements and event references:

(5-1) because the default elements have high complexity, high difficulty and low accuracy in automatic labeling, the automatic labeling is not carried out;

(5-2) for the existing elements, marking the existing elements in an identification marking form through a simple character string matching rule;

and (5-3) for the event, marking the event by adopting an identification marking form through a synonym detection method for the trigger.

The step (6) further optimizes the labeling result through manual labeling:

and (6-1) arranging two annotators to correct the reference chain generated in the automatic annotation stage, and simultaneously performing manual completion on the references which cannot be automatically identified, wherein the two annotators need to finish the work independently.

(6-2) for the labeling differences of two independent markers, arbitration will be performed by the third person. The arbitrator can resolve the divergence according to the labeling specification or introduce external knowledge to determine the final labeling result.

And (7) setting a consistency checking step to ensure the quality of corpus annotation:

(7-1) in order to ensure the quality of corpus labeling, consistency detection needs to be carried out on the labeling results of two independent annotators.

And (7-2) adopting a reference annotation reliability calculation method proposed by Passioneau. The method represents the similarity between the reference chains through a distance measurement, and the distance measurement principle mainly comprises the following steps:

A. when the two reference chains are completely anastomosed, the distance is 0;

B. distance is 0.33 when one refers to a subset of the other;

C. when the two reference chains do not contain each other and have a common non-empty subset, the distance is 0.67;

D. when the intersection of the two reference chains is an empty set, the distance value is set to 1.

(7-3) calculating the similarity distance between the reference chains according to the alpha coefficient of Krippendorff to check the consistency between different annotators. If the alpha coefficient is lower than 67 percent, the marking result is unreliable, then the step of automatic marking is carried out (5), and the marking is carried out again until the consistency is higher than the threshold value.

Claims

1. A Chinese reference corpus construction method based on events is characterized by comprising the following operation steps:

(1) the CEC2.0 corpus was chosen as the building basis,

(2) determining the target of the reference label and the labeling mode,

(4) CEC2.0 corpus text pre-processing,

(5) the event elements are automatically annotated and the event references,

(6) the labeling result is further optimized by manual labeling,

(7) setting a consistency checking step to ensure the quality of corpus labeling;

(2-1.) the objects referred to as annotations fall into two broad categories: event elements, namely object, environment and time reference labels and event reference labels, wherein the event elements are divided into the existing element reference labels and the default element reference labels;

(2-2) in order to facilitate the computer to do related processing, all types of reference labels adopt XML format, event elements are divided into existing elements and default elements, so that the corresponding reference labels have two forms: the first form is Attribute label, the label only refers to the element, and is not related to the label of the event, and the purpose is to label the default element in the event; the second form is label Tag, namely, a label is used for reference marking separately, so as to mark existing elements and events;

(3-1) default element labeling specification:

A. the Object element is marked in an attribute subject number sid or Object number oid for identifying the Participant or Object; B. the environment element is marked in the attribute lid of the Location; C. the Time element is marked in the attribute tid for marking the Time;

(3-2) there is already an element labeling specification:

A. the Object elements have two semantic types, and are respectively marked with two identifications, namely a particle and an Object, in a corpus, wherein the former is related to people, and the latter is related to objects, so that the Object elements do not belong to one semantic type and cannot refer to each other; B. the marking of the environment elements is to mark elements pointing to the same geographic position and also to mark a reference type, namely, the geographic position of the corresponding environment element can be specified through the prior environment elements; C. the time elements are similar to the environment elements, except that the elements pointing to the same time are labeled, the reference type is also labeled;

(3-3) event annotation specification:

firstly, comparing whether trigger words of two events are the same or synonymous, if so, carrying out the next step, otherwise, two events have no indication relation; then comparing each element of the two events, because each event must contain a trigger word, and other elements may be default and will not appear, so the default element is complemented according to the context, and then whether the two events have a reference relationship is judged, and each element of the two events having the reference relationship must be consistent, namely pointing to the same entity in reality;

the step (5) automatically marks event elements and event references:

2. The method for constructing an event-based Chinese reference corpus according to claim 1, wherein the step (1) selects CEC2.0 corpus as a construction basis:

(1-1) selecting CEC2.0 as a constructed basic corpus;

3. The method for constructing an event-based Chinese reference corpus according to claim 1, wherein the step (4) CEC2.0 corpus text preprocessing:

(4-1) the report time ReportTime is not numbered in the CEC2.0 corpus, and can be used as a time element to refer to reference time in the annotation; therefore, an identification attribute tid needs to be added in the preprocessing, and the attribute value is t 0;

(4-2) since the object elements in the CEC corpus are labeled with coarse granularity, the modification components of the object elements need to be labeled with further refinement, so that more abstract object elements can be embodied.

4. The method for constructing an event-based Chinese reference corpus according to claim 1, wherein the step (6) further optimizes the annotation result by manual annotation:

(6-1) arranging two annotators to correct the reference chain generated in the automatic annotation stage, and simultaneously performing manual completion on the references which cannot be automatically identified, wherein the two annotators need to finish the work independently;

(6-2) for the labeling difference of two independent labels, carrying out arbitration by a third person; the arbitrator can resolve the divergence according to the labeling specification or introduce external knowledge to determine the final labeling result.

5. The method for constructing an event-based Chinese reference corpus according to claim 1, wherein the step (7) sets a consistency check step to ensure the quality of corpus labeling:

(7-1) in order to ensure the quality of corpus labeling, consistency detection needs to be carried out on labeling results of two independent annotators;

(7-2) adopting a reliability calculation method for the reference annotation proposed by Passioneau, wherein the method represents the similarity between reference chains through a distance metric, and the principle of the distance metric mainly comprises the following steps:

A. when the two reference chains are completely anastomosed, the distance is 0;

B. distance is 0.33 when one refers to a subset of the other;

D. when the intersection of the two reference chains is a null set, the distance value is set to be 1;

(7-3) calculating similarity distances between the reference chains according to an alpha coefficient of Krippendorff to check the consistency between different annotators; if the alpha coefficient is lower than the threshold value of 67 percent, indicating that the marking result is unreliable, the step (5) is carried out for automatic marking, and marking is carried out again until the consistency is higher than the threshold value of 67 percent.