CN106897364B - Chinese reference corpus construction method based on events - Google Patents

Chinese reference corpus construction method based on events Download PDF

Info

Publication number
CN106897364B
CN106897364B CN201710020573.4A CN201710020573A CN106897364B CN 106897364 B CN106897364 B CN 106897364B CN 201710020573 A CN201710020573 A CN 201710020573A CN 106897364 B CN106897364 B CN 106897364B
Authority
CN
China
Prior art keywords
corpus
elements
event
labeling
events
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710020573.4A
Other languages
Chinese (zh)
Other versions
CN106897364A (en
Inventor
张亚军
刘宗田
李强
周文
刘炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN201710020573.4A priority Critical patent/CN106897364B/en
Publication of CN106897364A publication Critical patent/CN106897364A/en
Application granted granted Critical
Publication of CN106897364B publication Critical patent/CN106897364B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a Chinese reference corpus construction method based on events. The method mainly comprises the following steps: (1) selecting a CEC2.0 corpus as a construction basis, (2) determining a target and a labeling mode of a reference label, (3) making a corresponding labeling specification according to a specific reference target, (4) preprocessing a CEC2.0 corpus text, (5) automatically labeling an event element and an event reference, (6) further optimizing a labeling result through manual labeling, (7) setting a consistency checking step, and ensuring the quality of the corpus label. The invention overcomes the defects of the existing reference resolution corpus. The method not only can cover all events in the corpus, but also is based on Chinese syntactic analysis and semantic analysis, accords with the characteristics of Chinese, and can also check consistency of the labeled corpus to ensure corpus labeling quality.

Description

Chinese reference corpus construction method based on events
Technical Field
The invention belongs to the field of Natural Language Processing (Natural Language Processing), and relates to a method for constructing a Chinese reference corpus based on events.
Background
Reference is a common language phenomenon that occurs in large numbers in everyday conversations and texts. The expression can make the language expression simple and coherent, and is beneficial to language communication and text writing. But the use of references in large quantities increases the difficulty of computing the mechanism to solve the language as well as the text. The main task of reference resolution is to identify the same entities described by different expressions in chapters. In the past, a great deal of research work is concentrated on non-event texts, and certain results are obtained. With the rise of the concept of "events", more and more scholars are beginning to take event-oriented research. The event is related to multiple factors, is a knowledge representation unit with granularity larger than that of a static concept, is closer to the cognitive process of human by taking the event as a basic unit of human knowledge, is more in line with objective practice, is concerned by more and more fields, and is gradually adopted by knowledge processing fields such as computer linguistics, artificial intelligence, information retrieval, information extraction, automatic abstracting and the like.
Since the last 80 th century, some international evaluation meetings for information extraction, such as information understanding Meetings (MUC), Automatic Content Extraction (ACE), and the like, began to rise, and these meetings provide unified test corpora and evaluation methods for natural language processing technologies such as information extraction and reference resolution, and their development of reference resolution is promoted to a great extent, especially for the test corpora provided by these meetings, so that the reference resolution system is shifted from a resolution method based on heuristic rules to a resolution method based on data driving. For example, the MUC corpus adopts an SGML labeling method, where < COREF ID = "x" >, < COREF ID = "x" REF = "y" > respectively represents the left boundary of an entity and a reference expression, and the right boundary of the entity and the reference expression is represented by </COREF >, x strictly monotonically increases from 1 to represent the sequence number of the entity in the text, REF represents the information of the antecedent of the entity, if y is equal to the value of a certain x, the antecedent of the reference expression is the entity with the ID number of x, and if there is no REF value, the entity has no antecedent; the ACE corpus is different from the MUC corpus, taking ACE2005 as an example, the expression pointing to the same entity is placed in a reference chain with the same number through the reference chain description text, it is worth mentioning that the ACE corpus adds Chinese corpus from ACE2003, the training corpus of 30 ten thousand characters and the testing corpus of 5 ten thousand characters are achieved at present, and the evaluation of event mention is added, which is the international evaluation corpus resource aiming at Chinese reference resolution at the earliest, and plays a great promoting role in the development of Chinese reference resolution. In 2011, CoNL provides an English ontoNoteses 4.0 corpus, labels the coreference relationship between event nouns and verbs, and in 2012, provides Ontonoteses 5.0 corpus of English, Chinese and Arabic to perform multi-language coreference resolution evaluation. In recent years, research on reference resolution in China is gradually increased, and related corpora are constructed. For example, an information extraction-oriented Chinese cross-text reference corpus constructed on the basis of an ACE2005 Chinese corpus by Zhao Zhi Wen et al, and an entity link corpus constructed on the basis of an ACE2005 Chinese corpus and a Chinese Wikipedia by Shujiagen et al.
However, most of these corpora are not based on event annotation, and although the ACE corpus defines 8 types of events and evaluates event references, the understanding of events still stays at chapter level, and is not detailed to specific sentences, and cannot cover all events, and the evaluation of event references does not involve the problem of coreference resolution. The Ontonotes corpus provides the coreference relation about events, but only relates to English, and is not suitable for Chinese statement analysis. Most of domestic corpora are also built on the basis of Chinese corpora like ACE, and events are not marked as knowledge representation units. The entities, called elements, related to multiple aspects in the event also have a large number of reference phenomena as with the static concept in the traditional text, and at the same time, the event itself has a number of references, and for the application facing the event, they bring many uncertainties and need to be processed and researched, which needs help of a corpus, however, so far, there is no Chinese reference corpus facing the event.
Disclosure of Invention
The invention provides a Chinese reference corpus construction method based on events in order to make up the defects of the existing reference resolution corpus, and the Chinese reference corpus construction method based on the events is characterized in that an event-oriented Chinese reference corpus is constructed on the basis of a CEC2.0 corpus, wherein the event-oriented Chinese reference corpus comprises reference marks of existing elements, default elements and events. The method not only can cover all events in the corpus, but also is based on Chinese syntactic analysis and semantic analysis, accords with the characteristics of Chinese, and can also carry out consistency check on the labeled corpus so as to ensure the quality of corpus labeling.
The following three definitions are the concepts involved in the present invention:
definition 1. antecedent and response elements: if the event-oriented Chinese text has the reference relationship among the elements, the elements expressing more concrete are called antecedent elements, and the elements expressing more abstract are called as corresponding elements.
Definition 2. look ahead and photopic events: if the event-oriented Chinese text has the reference relationship among the events, the event with more concrete expression is called a prior event, and the event with more abstract expression is called a response event. The concrete and abstract judgment of the event is related to whether the elements contained in the event are complete or not, namely whether the object, environment and time elements of the event are default or not.
Definition 3. event-oriented reference resolution: the process of finding the relationship between the antecedent element (or antecedent event) and the event-oriented element (or event-oriented) in the event-oriented text and explicitly giving the antecedent element (or event-oriented) to which the event-oriented element (or event-oriented) points.
In order to achieve the purpose, the invention adopts the following technical scheme: a Chinese reference corpus construction method based on events is characterized by comprising the following operation steps:
(1) the CEC2.0 corpus was chosen as the building basis.
A. CEC2.0 was selected as the base corpus constructed.
B. And checking the accuracy of the event and the event element labeling by contrasting with the CEC2.0 corpus labeling specification.
C. And supplementing related labels to the linguistic data with incomplete labels, and correcting the linguistic data with wrong labels.
(2) And determining the target of the reference label and the labeling mode.
A. Objects referred to as annotations fall into two broad categories: the reference labels of the event elements (objects, environments and time) and the reference labels of the events are divided into the reference labels of the existing elements and the reference labels of the default elements.
B. All types of reference labels are in XML format for the convenience of computer related processing. Event elements are divided into existing elements and default elements, so the corresponding references are annotated in two forms: the first form is Attribute (Attribute) labeling, and the labeling only aims at the reference of elements and is not related to the labeling of events, so as to label default elements in the events; the second form is label (Tag) labeling, which is a label referred to as a label alone for the purpose of labeling existing elements and events.
(3) And making a corresponding marking specification according to the specific reference target.
A. Labeling specification of default elements: A. the Object element is marked in an attribute sid (subject number) or oid (Object number) for identifying a Participant or Object; B. the environment element is marked in the attribute lid of the Location; C. the Time element is marked in an attribute tid that identifies the Time.
B. There is already a specification of element labeling: 1. the Object elements have two semantic types, which are labeled with two identifiers, namely, a particle and an Object, respectively, wherein the former is related to a person, and the latter is related to an Object, so that the Object elements do not belong to one semantic type and cannot refer to each other. 2. The environmental elements are labeled not only with elements pointing to the same geographical position, but also with reference types, that is, with the preceding environmental elements, we can specify the geographical positions of the environmental elements. 3. Time elements are similar to environmental elements, except that elements pointing to the same time are labeled with a reference type.
C. And (3) event marking specification: firstly, comparing whether the trigger words of the two events are the same or synonymous, if so, carrying out the next step, otherwise, the two events have no indication relation. And then comparing the elements of the two events, wherein each event must contain a trigger word, and other elements may be in default and cannot appear, so that the default element is complemented according to the context, and then whether the two events have a reference relationship is judged, and the elements of the two events having the reference relationship must be consistent, namely point to the same entity in reality.
(4) CEC2.0 corpus text preprocessing.
A. The CEC2.0 corpus does not have a reference number for ReportTime, and it can refer to the reference time in the annotation as a time element. Therefore, the pre-processing needs to add an identification attribute tid, and the attribute value is t 0.
B. Since the object elements in the CEC corpus are marked with coarse granularity, the modification components of the object elements need to be marked with further refinement, so that more abstract object elements can be embodied.
(5) Automatically annotating event elements and event references.
A. Due to the fact that the default elements are high in complexity, large in difficulty and low in accuracy in automatic labeling, the automatic labeling is not carried out.
B. And for the existing elements, marking the existing elements by adopting an identification marking form through a simple character string matching rule.
C. And for the event, marking by adopting an identification marking form through a synonym detection method for the trigger words.
(6) And further optimizing the labeling result through manual labeling.
A. And arranging two annotators to correct the reference chain generated in the automatic annotation stage, and simultaneously performing manual completion on the references which cannot be automatically identified, wherein the two annotators are required to finish the work independently.
B. For the labeling differences of two independent markers, arbitration is performed by a third person. The arbitrator can resolve the divergence according to the labeling specification or introduce external knowledge to determine the final labeling result.
(7) And setting a consistency checking step to ensure the quality of the corpus annotation:
A. in order to ensure the quality of corpus labeling, consistency detection needs to be performed on the labeling results of two independent labeling persons.
B. And adopting a reference annotation reliability calculation method proposed by Passioneau. The method represents the similarity between the reference chains through a distance measurement, and the distance measurement principle mainly comprises the following steps:
1. when the two reference chains are completely anastomosed, the distance is 0;
2. distance is 0.33 when one refers to a subset of the other;
3. when the two reference chains do not contain each other and have a common non-empty subset, the distance is 0.67;
4. when the intersection of the two reference chains is an empty set, the distance value is set to 1.
C. The similarity distance between the reference chains is calculated according to the alpha coefficient of Krippendorff to check the consistency between different annotators. If the alpha coefficient is lower than 67 percent, the marking result is unreliable, then the step of automatic marking is carried out (5), and the marking is carried out again until the consistency is higher than the threshold value.
Compared with the traditional Chinese reference corpus construction method, the event-based Chinese reference corpus construction method has the following obvious prominent substantive features and remarkable technical progress: (1) the Chinese language reference corpus based on the events is established on the basis of the events, the events are used as knowledge representation units, the dynamic property of the objects is reflected, objective reality is better met, and the computer can simulate the brain to work conveniently; (2) the traditional reference labels are divided into too many entity categories, while the reference labels based on events are labels depending on events and event elements, so that the classification is less and the structure is clear; (3) the event-based reference label not only labels elements pointing to the same entity, but also labels reference of a reference type, and through the reference relationship, the abstract elements can be concretized (4) based on the event label, so that zero reference resolution in the traditional reference is converted into reference resolution of default elements, the entity is componentized, and the recognition and resolution of the default elements are facilitated by combining language expression rules of the event; (5) the traditional reference resolution is easy to be limited because of lack of necessary chapter knowledge for resolution, and the reference annotation based on the event can lead us to dig more chapter knowledge through combination with the event relation, thereby improving the performance of the reference resolution system.
Drawings
FIG. 1 is a flowchart of a method for constructing an event-based Chinese reference corpus according to the present invention.
Detailed Description
The preferred embodiments of the present invention are described in detail below with reference to the accompanying drawings:
the first embodiment is as follows:
referring to fig. 1, the method for constructing a chinese reference corpus based on events mainly includes the following steps:
(1) the CEC2.0 corpus was chosen as the building basis,
(2) determining the target of the reference label and the labeling mode,
(3) corresponding marking specifications are made according to specific reference targets,
(4) CEC2.0 corpus text pre-processing,
(5) the event elements are automatically annotated and the event references,
(6) the labeling result is further optimized by manual labeling,
(7) and setting a consistency checking step to ensure the quality of the corpus annotation.
Example two:
this embodiment is substantially the same as the first embodiment, and is characterized in that:
the step (1) selects a CEC2.0 corpus as a construction basis:
(1-1) selecting CEC2.0 as a constructed basic corpus;
(1-2) checking the accuracy of the event and the annotation of the event elements by contrasting with CEC2.0 corpus annotation specifications;
and (1-3) supplementing related labels to the linguistic data with incomplete labels, and correcting the linguistic data with wrong labels.
The step (2) determines the target and the labeling mode of the reference label:
(2-1.) the objects referred to as annotations fall into two broad categories: the method comprises the following steps of (1) performing reference labels of event elements (objects, environments and time) and events, wherein the reference labels of the event elements are divided into the reference labels of existing elements and the reference labels of default elements;
(2-2) in order to facilitate relevant processing of a computer, all types of reference labels adopt XML format. Event elements are divided into existing elements and default elements, so the corresponding references are annotated in two forms: the first form is Attribute (Attribute) labeling, and the labeling only aims at the reference of elements and is not related to the labeling of events, so as to label default elements in the events; the second form is label (Tag) labeling, namely, a label is singly used for reference labeling, and the purpose is to label existing elements and events;
example 1: object element attribute labeling
<Event eid="e2" type="thoughtevent">
< particulate sid = "s2, s3" > Shanghai City government News >
< Time type = "relTime" tid = "t2" >12 days 15 hours 45 minutes </Time > issue
< Denoter type = "status" did = "d2" > message </Denoter >
</Event>
<Event type="thoughtevent" eid="e3">
< Denoter type = "status" did = "d3" > call </Denoter >
</Event>
Example 2: site element identification (Tag) tagging
< eAnaph and type = "Loc" aid = "l3" antecedent = "Wen Si Chuan in Sichuan =" l7 "anaphor =" Wen Si Chuan county "/>" in Sichuan
< eAnaph anaType = "Loc" aid = "l7" antigent = "Wenchuan county, Sichuan province" rid = "l13" anaphor = "disaster area"/>)
The step (3) formulates a corresponding labeling specification according to a specific reference target:
(3-1) default element labeling specification: A. the Object element is marked in an attribute sid (subject number) or oid (Object number) for identifying a Participant or Object; B. the environment element is marked in the attribute lid of the Location; C. the Time element is marked in an attribute tid that identifies the Time.
(3-2) there is already an element labeling specification: A. the Object elements have two semantic types, which are labeled with two identifiers, namely, a particle and an Object, respectively, wherein the former is related to a person, and the latter is related to an Object, so that the Object elements do not belong to one semantic type and cannot refer to each other. B. The environmental elements are labeled not only with elements pointing to the same geographical position, but also with reference types, that is, with the preceding environmental elements, we can specify the geographical positions of the environmental elements. C. Time elements are similar to environmental elements, except that elements pointing to the same time are labeled with a reference type.
(3-3) event annotation specification: firstly, comparing whether the trigger words of the two events are the same or synonymous, if so, carrying out the next step, otherwise, the two events have no indication relation. And then comparing the elements of the two events, wherein each event must contain a trigger word, and other elements may be in default and cannot appear, so that the default element is complemented according to the context, and then whether the two events have a reference relationship is judged, and the elements of the two events having the reference relationship must be consistent, namely point to the same entity in reality. :
the CEC2.0 corpus text preprocessing in the step (4):
the ReportTime (reporting time) is not numbered in the CEC2.0 corpus, and it can be referred to as a reference time in the annotation as a time element. Therefore, an identification attribute tid needs to be added in the preprocessing, and the attribute value is t 0;
example 3: ReportTime renumbering
< ReportTime type = "absTime" tid = "t0" > 2008. 05 month 12 th 16:25</ReportTime >
(4-2) since the object elements in the CEC corpus are marked with coarse granularity, the modification components of the object elements need to be marked with further refinement, so that more abstract object elements can be embodied.
The step (5) automatically marks event elements and event references:
(5-1) because the default elements have high complexity, high difficulty and low accuracy in automatic labeling, the automatic labeling is not carried out;
(5-2) for the existing elements, marking the existing elements in an identification marking form through a simple character string matching rule;
and (5-3) for the event, marking the event by adopting an identification marking form through a synonym detection method for the trigger.
The step (6) further optimizes the labeling result through manual labeling:
and (6-1) arranging two annotators to correct the reference chain generated in the automatic annotation stage, and simultaneously performing manual completion on the references which cannot be automatically identified, wherein the two annotators need to finish the work independently.
(6-2) for the labeling differences of two independent markers, arbitration will be performed by the third person. The arbitrator can resolve the divergence according to the labeling specification or introduce external knowledge to determine the final labeling result.
And (7) setting a consistency checking step to ensure the quality of corpus annotation:
(7-1) in order to ensure the quality of corpus labeling, consistency detection needs to be carried out on the labeling results of two independent annotators.
And (7-2) adopting a reference annotation reliability calculation method proposed by Passioneau. The method represents the similarity between the reference chains through a distance measurement, and the distance measurement principle mainly comprises the following steps:
A. when the two reference chains are completely anastomosed, the distance is 0;
B. distance is 0.33 when one refers to a subset of the other;
C. when the two reference chains do not contain each other and have a common non-empty subset, the distance is 0.67;
D. when the intersection of the two reference chains is an empty set, the distance value is set to 1.
(7-3) calculating the similarity distance between the reference chains according to the alpha coefficient of Krippendorff to check the consistency between different annotators. If the alpha coefficient is lower than 67 percent, the marking result is unreliable, then the step of automatic marking is carried out (5), and the marking is carried out again until the consistency is higher than the threshold value.

Claims (5)

1. A Chinese reference corpus construction method based on events is characterized by comprising the following operation steps:
(1) the CEC2.0 corpus was chosen as the building basis,
(2) determining the target of the reference label and the labeling mode,
(3) corresponding marking specifications are made according to specific reference targets,
(4) CEC2.0 corpus text pre-processing,
(5) the event elements are automatically annotated and the event references,
(6) the labeling result is further optimized by manual labeling,
(7) setting a consistency checking step to ensure the quality of corpus labeling;
the step (2) determines the target and the labeling mode of the reference label:
(2-1.) the objects referred to as annotations fall into two broad categories: event elements, namely object, environment and time reference labels and event reference labels, wherein the event elements are divided into the existing element reference labels and the default element reference labels;
(2-2) in order to facilitate the computer to do related processing, all types of reference labels adopt XML format, event elements are divided into existing elements and default elements, so that the corresponding reference labels have two forms: the first form is Attribute label, the label only refers to the element, and is not related to the label of the event, and the purpose is to label the default element in the event; the second form is label Tag, namely, a label is used for reference marking separately, so as to mark existing elements and events;
the step (3) formulates a corresponding labeling specification according to a specific reference target:
(3-1) default element labeling specification:
A. the Object element is marked in an attribute subject number sid or Object number oid for identifying the Participant or Object; B. the environment element is marked in the attribute lid of the Location; C. the Time element is marked in the attribute tid for marking the Time;
(3-2) there is already an element labeling specification:
A. the Object elements have two semantic types, and are respectively marked with two identifications, namely a particle and an Object, in a corpus, wherein the former is related to people, and the latter is related to objects, so that the Object elements do not belong to one semantic type and cannot refer to each other; B. the marking of the environment elements is to mark elements pointing to the same geographic position and also to mark a reference type, namely, the geographic position of the corresponding environment element can be specified through the prior environment elements; C. the time elements are similar to the environment elements, except that the elements pointing to the same time are labeled, the reference type is also labeled;
(3-3) event annotation specification:
firstly, comparing whether trigger words of two events are the same or synonymous, if so, carrying out the next step, otherwise, two events have no indication relation; then comparing each element of the two events, because each event must contain a trigger word, and other elements may be default and will not appear, so the default element is complemented according to the context, and then whether the two events have a reference relationship is judged, and each element of the two events having the reference relationship must be consistent, namely pointing to the same entity in reality;
the step (5) automatically marks event elements and event references:
(5-1) because the default elements have high complexity, high difficulty and low accuracy in automatic labeling, the automatic labeling is not carried out;
(5-2) for the existing elements, marking the existing elements in an identification marking form through a simple character string matching rule;
and (5-3) for the event, marking the event by adopting an identification marking form through a synonym detection method for the trigger.
2. The method for constructing an event-based Chinese reference corpus according to claim 1, wherein the step (1) selects CEC2.0 corpus as a construction basis:
(1-1) selecting CEC2.0 as a constructed basic corpus;
(1-2) checking the accuracy of the event and the annotation of the event elements by contrasting with CEC2.0 corpus annotation specifications;
and (1-3) supplementing related labels to the linguistic data with incomplete labels, and correcting the linguistic data with wrong labels.
3. The method for constructing an event-based Chinese reference corpus according to claim 1, wherein the step (4) CEC2.0 corpus text preprocessing:
(4-1) the report time ReportTime is not numbered in the CEC2.0 corpus, and can be used as a time element to refer to reference time in the annotation; therefore, an identification attribute tid needs to be added in the preprocessing, and the attribute value is t 0;
(4-2) since the object elements in the CEC corpus are labeled with coarse granularity, the modification components of the object elements need to be labeled with further refinement, so that more abstract object elements can be embodied.
4. The method for constructing an event-based Chinese reference corpus according to claim 1, wherein the step (6) further optimizes the annotation result by manual annotation:
(6-1) arranging two annotators to correct the reference chain generated in the automatic annotation stage, and simultaneously performing manual completion on the references which cannot be automatically identified, wherein the two annotators need to finish the work independently;
(6-2) for the labeling difference of two independent labels, carrying out arbitration by a third person; the arbitrator can resolve the divergence according to the labeling specification or introduce external knowledge to determine the final labeling result.
5. The method for constructing an event-based Chinese reference corpus according to claim 1, wherein the step (7) sets a consistency check step to ensure the quality of corpus labeling:
(7-1) in order to ensure the quality of corpus labeling, consistency detection needs to be carried out on labeling results of two independent annotators;
(7-2) adopting a reliability calculation method for the reference annotation proposed by Passioneau, wherein the method represents the similarity between reference chains through a distance metric, and the principle of the distance metric mainly comprises the following steps:
A. when the two reference chains are completely anastomosed, the distance is 0;
B. distance is 0.33 when one refers to a subset of the other;
C. when the two reference chains do not contain each other and have a common non-empty subset, the distance is 0.67;
D. when the intersection of the two reference chains is a null set, the distance value is set to be 1;
(7-3) calculating similarity distances between the reference chains according to an alpha coefficient of Krippendorff to check the consistency between different annotators; if the alpha coefficient is lower than the threshold value of 67 percent, indicating that the marking result is unreliable, the step (5) is carried out for automatic marking, and marking is carried out again until the consistency is higher than the threshold value of 67 percent.
CN201710020573.4A 2017-01-12 2017-01-12 Chinese reference corpus construction method based on events Active CN106897364B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710020573.4A CN106897364B (en) 2017-01-12 2017-01-12 Chinese reference corpus construction method based on events

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710020573.4A CN106897364B (en) 2017-01-12 2017-01-12 Chinese reference corpus construction method based on events

Publications (2)

Publication Number Publication Date
CN106897364A CN106897364A (en) 2017-06-27
CN106897364B true CN106897364B (en) 2021-02-23

Family

ID=59197848

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710020573.4A Active CN106897364B (en) 2017-01-12 2017-01-12 Chinese reference corpus construction method based on events

Country Status (1)

Country Link
CN (1) CN106897364B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832419A (en) * 2017-11-10 2018-03-23 中国人民解放军陆军工程大学 Military information corpus construction method and system
CN110059169B (en) * 2019-01-25 2023-12-01 邵勃 Intelligent robot chat context implementation method and system based on corpus labeling
CN110852109A (en) * 2019-11-11 2020-02-28 腾讯科技(深圳)有限公司 Corpus generating method, corpus generating device, and storage medium
CN113111661B (en) * 2020-01-09 2024-09-10 图灵人工智能研究院(南京)有限公司 Text information classification method, system, equipment and readable storage medium
CN111859903B (en) * 2020-07-30 2024-01-12 思必驰科技股份有限公司 Event same-index model training method and event same-index resolution method
CN115186820B (en) * 2022-09-07 2023-01-10 粤港澳大湾区数字经济研究院(福田) Event coreference resolution method, device, terminal and computer readable storage medium
CN117391088A (en) * 2023-08-29 2024-01-12 泰瑞数创科技(北京)股份有限公司 Semantic consistency checking method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101782897A (en) * 2010-03-17 2010-07-21 上海大学 Chinese corpus labeling method based on events
CN102207948A (en) * 2010-07-13 2011-10-05 天津海量信息技术有限公司 Method for generating incident statement sentence material base
CN103268311A (en) * 2012-11-07 2013-08-28 上海大学 Event-structure-based Chinese statement analysis method
CN103678281A (en) * 2013-12-31 2014-03-26 北京百度网讯科技有限公司 Method and device for automatically labeling text
CN105302794A (en) * 2015-10-30 2016-02-03 苏州大学 Chinese homodigital event recognition method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9245015B2 (en) * 2013-03-08 2016-01-26 Accenture Global Services Limited Entity disambiguation in natural language text

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101782897A (en) * 2010-03-17 2010-07-21 上海大学 Chinese corpus labeling method based on events
CN102207948A (en) * 2010-07-13 2011-10-05 天津海量信息技术有限公司 Method for generating incident statement sentence material base
CN103268311A (en) * 2012-11-07 2013-08-28 上海大学 Event-structure-based Chinese statement analysis method
CN103678281A (en) * 2013-12-31 2014-03-26 北京百度网讯科技有限公司 Method and device for automatically labeling text
CN105302794A (en) * 2015-10-30 2016-02-03 苏州大学 Chinese homodigital event recognition method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一个面向信息抽取的中文跨文本指代语料库;赵知纬;《中文信息学报》;20150131;全文 *

Also Published As

Publication number Publication date
CN106897364A (en) 2017-06-27

Similar Documents

Publication Publication Date Title
CN106897364B (en) Chinese reference corpus construction method based on events
CN109325228B (en) English event trigger word extraction method and system
WO2020119075A1 (en) General text information extraction method and apparatus, computer device and storage medium
Dozier et al. Named entity recognition and resolution in legal text
CN110334213B (en) Method for identifying time sequence relation of Hanyue news events based on bidirectional cross attention mechanism
CN109918640B (en) Chinese text proofreading method based on knowledge graph
CN106407236B (en) A kind of emotion tendency detection method towards comment data
Flores et al. On the detection of source code re-use
CN108628828A (en) A kind of joint abstracting method of viewpoint and its holder based on from attention
CN109460552B (en) Method and equipment for automatically detecting Chinese language diseases based on rules and corpus
CN103049435A (en) Text fine granularity sentiment analysis method and text fine granularity sentiment analysis device
CN108932218B (en) Instance extension method, device, equipment and medium
CN103268311A (en) Event-structure-based Chinese statement analysis method
CN109472022B (en) New word recognition method based on machine learning and terminal equipment
CN110909122A (en) Information processing method and related equipment
CN106202039B (en) Vietnamese portmanteau word disambiguation method based on condition random field
CN105786971B (en) A kind of grammer point recognition methods towards international Chinese teaching
CN113204667A (en) Method and device for training audio labeling model and audio labeling
CN114218951B (en) Entity recognition model training method, entity recognition method and device
CN108268443B (en) Method and device for determining topic point transfer and acquiring reply text
CN113434631A (en) Emotion analysis method and device based on event, computer equipment and storage medium
CN110889284A (en) Multi-task learning Chinese language disease diagnosis method based on bidirectional long-time and short-time memory network
CN102955842A (en) Multi-feature-fused controlling method for recognizing Chinese organization name
CN114219438A (en) Document file distribution method, device, equipment and medium based on RPA and AI
Suriyachay et al. Thai named entity tagged corpus annotation scheme and self verification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant