CN110362682A - A kind of entity coreference resolution method based on statistical machine learning algorithm - Google Patents

A kind of entity coreference resolution method based on statistical machine learning algorithm Download PDF

Info

Publication number
CN110362682A
CN110362682A CN201910542364.5A CN201910542364A CN110362682A CN 110362682 A CN110362682 A CN 110362682A CN 201910542364 A CN201910542364 A CN 201910542364A CN 110362682 A CN110362682 A CN 110362682A
Authority
CN
China
Prior art keywords
entity
coreference resolution
statement
machine learning
learning algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910542364.5A
Other languages
Chinese (zh)
Inventor
肖清林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central Mdt Infotech Ltd Of United States Of Xiamen
Original Assignee
Central Mdt Infotech Ltd Of United States Of Xiamen
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central Mdt Infotech Ltd Of United States Of Xiamen filed Critical Central Mdt Infotech Ltd Of United States Of Xiamen
Priority to CN201910542364.5A priority Critical patent/CN110362682A/en
Publication of CN110362682A publication Critical patent/CN110362682A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

A kind of entity coreference resolution method based on statistical machine learning algorithm, method and step include: building data bank, divide training group and actual measurement group;Determine the feature of entity coreference resolution;Statement detection is carried out to training group;Building entity refers to statement to set altogether;Construct disaggregated model;Carry out disaggregated model training;Carry out disaggregated model correction;Actual measurement group information is inputted into disaggregated model, and generates result.The present invention determines the feature of entity coreference resolution first, then statement detection is carried out, establish disaggregated model, and pass through the repetition training and correction to disaggregated model, it is finally realized on the basis of statistical machine learning algorithm to entity coreference resolution, so that the accuracy to entity coreference resolution is high, to ensure that work in machine translation, the fields such as information extraction and question and answer are gone on smoothly, and conducive to the universal of work and are carried out.

Description

A kind of entity coreference resolution method based on statistical machine learning algorithm
Technical field
The present invention relates to entity coreference resolution fields more particularly to a kind of entity based on statistical machine learning algorithm to refer to altogether Digestion procedure.
Background technique
Reference is a kind of common language phenomenon, is widely present in the various expression of natural language.Under normal circumstances, refer to In generation, is divided into 2 kinds: referring to (also referred to as indicative reference) and refers to (also referred to as refer to together) altogether.It refers to refer to current anaphor and occur above Word, phrase or sentence (sentence group) there are close semantic relevance, reference depends in context semanteme, in different language Different entities may be referred in environment, and there is asymmetry and non-transitivity;Refer to altogether and is primarily referred to as 2 nouns (including code name Word, noun phrase) it is directed toward the same reference body in real world, this reference is detached from context and still sets up.And entity refers to altogether Resolution is the different identification symbol of identical entity in the different associated data sources of identification, mainly solves the conflict in triple between subject Problem.
Currently, entity refers to that phenomenon is serious altogether, often to machine translation, the fields such as information extraction and question and answer are interfered, The efficiency and accuracy for reducing the universal of work and carrying out.
To solve the above problems, proposing a kind of entity coreference resolution side based on statistical machine learning algorithm in the application Method.
Summary of the invention
(1) goal of the invention
To solve technical problem present in background technique, the present invention proposes a kind of reality based on statistical machine learning algorithm Body coreference resolution method, the present invention determine the feature of entity coreference resolution first, then carry out statement detection, establish classification mould Type, and by the repetition training and correction to disaggregated model, it finally realizes on the basis of statistical machine learning algorithm to entity Coreference resolution, so that high to the accuracy of entity coreference resolution, to ensure that work in machine translation, information extraction and is asked Going on smoothly for equal fields is answered, conducive to the universal of work and is carried out.
(2) technical solution
To solve the above problems, the present invention provides a kind of entity coreference resolution sides based on statistical machine learning algorithm Method, method and step include:
S1, building data bank, and be training group and actual measurement group by data bank content random division;
S2, the feature for determining entity coreference resolution;
S3, statement detection is carried out to training group according to the feature of entity coreference resolution, identifies training by stating detection There may be all candidate statements that entity refers to altogether in group;
S4, building entity refers to statement to set altogether according to testing result;
S5, building disaggregated model;Disaggregated model includes input module, classifier modules and output module;
S6, by entity be total to finger statement set is sequentially input in disaggregated model, be trained;
S7, disaggregated model is corrected according to training result, and removes unmatched statement pair;
S8, actual measurement group information is inputted into disaggregated model, and generates result.
Preferably, in S1, training group and actual measurement group include multiple triples.
Preferably, triple includes subject, predicate and object.
Preferably, in S2, the feature of entity coreference resolution includes: that lexical characteristics, grammar property, distance and position are special Sign and semantic feature;Wherein entity is the core of method at a distance from statement.
Preferably, in S3, the statement of candidate's statement is to the subject subset in all triples.
Preferably, in S4, all statements to statement m and the front including any position in text are stated m。
Preferably, in S5, it will be stated by input module and disaggregated model inputted to set.
It preferably, include binary classifier to the classifier modules of set screening, classification for stating in S5.
Preferably, in S5, output module feeds back classification results.
Preferably, in S8, the method that as a result generates is optimal any one at first, in most recent first or transitivity constraint Kind.
Above-mentioned technical proposal of the invention has following beneficial technical effect:
In the present invention, it is first determined the feature of entity coreference resolution then carries out statement detection, establishes disaggregated model, and By the repetition training and correction to disaggregated model, is finally realized on the basis of statistical machine learning algorithm and entity is referred to altogether disappear Solution, so that the accuracy to entity coreference resolution is high, to ensure that work in machine translation, the neck such as information extraction and question and answer Domain is gone on smoothly, and conducive to the universal of work and is carried out.
Detailed description of the invention
Fig. 1 is a kind of flow chart of the entity coreference resolution method based on statistical machine learning algorithm proposed by the present invention.
Specific embodiment
In order to make the objectives, technical solutions and advantages of the present invention clearer, With reference to embodiment and join According to attached drawing, the present invention is described in more detail.It should be understood that these descriptions are merely illustrative, and it is not intended to limit this hair Bright range.In addition, in the following description, descriptions of well-known structures and technologies are omitted, to avoid this is unnecessarily obscured The concept of invention.
As shown in Figure 1, a kind of entity coreference resolution method based on statistical machine learning algorithm proposed by the present invention, method Step includes:
S1, building data bank, and be training group and actual measurement group by data bank content random division;
S2, the feature for determining entity coreference resolution;
S3, statement detection is carried out to training group according to the feature of entity coreference resolution, identifies training by stating detection There may be all candidate statements that entity refers to altogether in group;
S4, building entity refers to statement to set altogether according to testing result;
S5, building disaggregated model;Disaggregated model includes input module, classifier modules and output module;
S6, by entity be total to finger statement set is sequentially input in disaggregated model, be trained;
S7, disaggregated model is corrected according to training result, and removes unmatched statement pair;
S8, actual measurement group information is inputted into disaggregated model, and generates result.
In an alternative embodiment, in S1, training group and actual measurement group include multiple triples.
In an alternative embodiment, triple includes subject, predicate and object.
In an alternative embodiment, in S2, the feature of entity coreference resolution include: lexical characteristics, grammar property, Distance and position feature and semantic feature;Wherein entity is the core of method at a distance from statement.
In an alternative embodiment, in S3, the statement of candidate's statement is to the master in all triples Language subset.
In an alternative embodiment, it in S4, states to statement m, Yi Jiqi including any position in text All statement m of front.
In an alternative embodiment, in S5, it will be stated by input module and disaggregated model inputted to set.
It in an alternative embodiment, include two to the classifier modules of set screening, classification for stating in S5 Meta classifier.
In an alternative embodiment, in S5, output module feeds back classification results.
In an alternative embodiment, in S8, the method that as a result generates be it is optimal at first, most recent first or transitivity Any one in constraint.
In the present invention, it is first determined the feature of entity coreference resolution then carries out statement detection, establishes disaggregated model, and By the repetition training and correction to disaggregated model, is finally realized on the basis of statistical machine learning algorithm and entity is referred to altogether disappear Solution, so that the accuracy to entity coreference resolution is high, to ensure that work in machine translation, the neck such as information extraction and question and answer Domain is gone on smoothly, and conducive to the universal of work and is carried out.
It should be understood that above-mentioned specific embodiment of the invention is used only for exemplary illustration or explains of the invention Principle, but not to limit the present invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention Covering the whole variations fallen into attached claim scope and boundary or this range and the equivalent form on boundary and is repairing Change example.

Claims (10)

1. a kind of entity coreference resolution method based on statistical machine learning algorithm, which is characterized in that method and step includes:
S1, building data bank, and be training group and actual measurement group by data bank content random division;
S2, the feature for determining entity coreference resolution;
S3, statement detection is carried out to training group according to the feature of entity coreference resolution, is identified in training group by stating detection There may be all candidate statements that entity refers to altogether;
S4, building entity refers to statement to set altogether according to testing result;
S5, building disaggregated model;Disaggregated model includes input module, classifier modules and output module;
S6, by entity be total to finger statement set is sequentially input in disaggregated model, be trained;
S7, disaggregated model is corrected according to training result, and removes unmatched statement pair;
S8, actual measurement group information is inputted into disaggregated model, and generates result.
2. a kind of entity coreference resolution method based on statistical machine learning algorithm according to claim 1, feature exist In in S1, training group and actual measurement group include multiple triples.
3. a kind of entity coreference resolution method based on statistical machine learning algorithm according to claim 2, feature exist In triple includes subject, predicate and object.
4. a kind of entity coreference resolution method based on statistical machine learning algorithm according to claim 1, feature exist In in S2, the feature of entity coreference resolution includes: lexical characteristics, grammar property, distance and position feature and semantic spy Sign;Wherein entity is the core of method at a distance from statement.
5. a kind of entity coreference resolution method based on statistical machine learning algorithm according to claim 1, feature exist In in S3, the statement of candidate's statement is to the subject subset in all triples.
6. a kind of entity coreference resolution method based on statistical machine learning algorithm according to claim 1, feature exist In, in S4, all statement ms of the statement to statement m and the front including any position in text.
7. a kind of entity coreference resolution method based on statistical machine learning algorithm according to claim 1, feature exist In in S5, by input module by statement to set input disaggregated model.
8. a kind of entity coreference resolution method based on statistical machine learning algorithm according to claim 1, feature exist In, in S5, for state to set screening, classification classifier modules include binary classifier.
9. a kind of entity coreference resolution method based on statistical machine learning algorithm according to claim 1, feature exist In in S5, output module feeds back classification results.
10. a kind of entity coreference resolution method based on statistical machine learning algorithm according to claim 1, feature exist In in S8, the method that as a result generates is any one optimal at first, in most recent first or transitivity constraint.
CN201910542364.5A 2019-06-21 2019-06-21 A kind of entity coreference resolution method based on statistical machine learning algorithm Pending CN110362682A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910542364.5A CN110362682A (en) 2019-06-21 2019-06-21 A kind of entity coreference resolution method based on statistical machine learning algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910542364.5A CN110362682A (en) 2019-06-21 2019-06-21 A kind of entity coreference resolution method based on statistical machine learning algorithm

Publications (1)

Publication Number Publication Date
CN110362682A true CN110362682A (en) 2019-10-22

Family

ID=68217473

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910542364.5A Pending CN110362682A (en) 2019-06-21 2019-06-21 A kind of entity coreference resolution method based on statistical machine learning algorithm

Country Status (1)

Country Link
CN (1) CN110362682A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950281A (en) * 2020-07-02 2020-11-17 中国科学院软件研究所 Demand entity co-reference detection method and device based on deep learning and context semantics
CN116738974A (en) * 2023-05-10 2023-09-12 济南云微软件科技有限公司 Language model generation method, device and medium based on generalization causal network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101770453A (en) * 2008-12-31 2010-07-07 华建机器翻译有限公司 Chinese text coreference resolution method based on domain ontology through being combined with machine learning model
CN101901213A (en) * 2010-07-29 2010-12-01 哈尔滨工业大学 Instance-based dynamic generalization coreference resolution method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101770453A (en) * 2008-12-31 2010-07-07 华建机器翻译有限公司 Chinese text coreference resolution method based on domain ontology through being combined with machine learning model
CN101901213A (en) * 2010-07-29 2010-12-01 哈尔滨工业大学 Instance-based dynamic generalization coreference resolution method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郎君等: "集成多种背景语义知识的共指消解", 《中文信息学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950281A (en) * 2020-07-02 2020-11-17 中国科学院软件研究所 Demand entity co-reference detection method and device based on deep learning and context semantics
CN116738974A (en) * 2023-05-10 2023-09-12 济南云微软件科技有限公司 Language model generation method, device and medium based on generalization causal network
CN116738974B (en) * 2023-05-10 2024-01-23 济南云微软件科技有限公司 Language model generation method, device and medium based on generalization causal network

Similar Documents

Publication Publication Date Title
CN106096664B (en) A kind of sentiment analysis method based on social network data
US11023684B1 (en) Systems and methods for automatic generation of questions from text
CN105512105B (en) Semantic analysis method and device
KR20150036041A (en) Phrase-based dictionary extraction and translation quality evaluation
Al-Taani et al. A top-down chart parser for analyzing arabic sentences.
Humayoun et al. Urdu summary corpus
KR20210090906A (en) Method and apparatus of generating training data for sentiment analysis
CN111581953A (en) Method for automatically analyzing grammar phenomenon of English text
CN110362682A (en) A kind of entity coreference resolution method based on statistical machine learning algorithm
Rahman et al. Learning the information status of noun phrases in spoken dialogues
CN106055633A (en) Chinese microblog subjective and objective sentence classification method
CN107526742A (en) Method and apparatus for handling multi-language text
Lee et al. An analysis of grammatical errors in non-native speech in English
JPWO2008146583A1 (en) Dictionary registration system, dictionary registration method, and dictionary registration program
Weissweiler et al. Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model
Oyama et al. Towards automatic error type classification of Japanese language learners’ writings
Kapočiūtė-Dzikienė et al. Character-based machine learning vs. language modeling for diacritics restoration
Mohamed Morphological segmentation and part-of-speech tagging for the arabic heritage
Mahafdah et al. Arabic Part of speech Tagging using k-Nearest Neighbour and Naive Bayes Classifiers Combination.
Deksne Bidirectional lstm tagger for latvian grammatical error detection
Li et al. Data augmentation of incorporating real error patterns and linguistic knowledge for grammatical error correction
JP2014215920A (en) Case analysis model parameter learning apparatus, case analyzer, method and program
CN102955842A (en) Multi-feature-fused controlling method for recognizing Chinese organization name
Alfaidi et al. Exploring the performance of farasa and CAMeL taggers for arabic dialect tweets.
Ţucudean et al. The Use of Data Augmentation as a Technique for Improving Fake News Detection in the Romanian Language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191022

RJ01 Rejection of invention patent application after publication