CN112579583B - Evidence and statement combined extraction method for fact detection - Google Patents

Evidence and statement combined extraction method for fact detection Download PDF

Info

Publication number
CN112579583B
CN112579583B CN202011467223.0A CN202011467223A CN112579583B CN 112579583 B CN112579583 B CN 112579583B CN 202011467223 A CN202011467223 A CN 202011467223A CN 112579583 B CN112579583 B CN 112579583B
Authority
CN
China
Prior art keywords
evidence
constraint
statement
category
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011467223.0A
Other languages
Chinese (zh)
Other versions
CN112579583A (en
Inventor
万海
陈海城
黄佳莉
曾娟
赵杭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202011467223.0A priority Critical patent/CN112579583B/en
Publication of CN112579583A publication Critical patent/CN112579583A/en
Application granted granted Critical
Publication of CN112579583B publication Critical patent/CN112579583B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a fact detection-oriented evidence and statement combined extraction method, which comprises the following steps: s1: appointing a prediction base for retrieval and a section of statement to be verified, cleaning the corpus, and performing entity extraction on the statement to obtain an entity set; s2: document retrieval: for a given statement, retrieving and constructing a corresponding candidate document set from the cleaned corpus by using an entity linking method according to the entity set, and taking all sentences in the set as a candidate sentence subset; s3: and constructing an evidence by an evidence search method based on a greedy strategy, and training and testing the evaluation model by using a pre-trained language model BERT as an evaluation model of the evidence to obtain the final target evidence and type. The invention can effectively improve the accuracy of evidence search.

Description

Evidence and statement combined extraction method for fact detection
Technical Field
The invention relates to the field of automatic fact detection, in particular to a fact detection-oriented evidence and statement combined extraction method.
Background
The purpose of automatic fact detection work is to enable a computer to automatically identify and filter false information in the Internet and guarantee the truth and reliability of the information. With the successful application of deep learning in natural language processing in recent years, more and more research efforts have attempted to incorporate deep learning techniques into automated fact checking efforts with good results. The fact detection task is one of the automatic fact detection tasks that are used to determine the authenticity of a given claim, involving two objects: (1) evidence mining, namely for a given statement, retrieving a sentence subset with the highest relevance to the statement from Wikipedia as evidence; (2) the assertion is checked, i.e. the assertion is classified according to evidence. This task comprises the traditional three-stage pipelined subtasks: document retrieval, evidence construction and statement checking. The input of the task is all documents on the declaration and the Wikipedia, and the output is labels of the evidence and the declaration, wherein the labels have three types, namely 'support/rejection/insufficient information', which sequentially shows that the declaration can be known to be true/false/the truth-false can not be judged through the evidence.
Since this task requires the retrieval of target evidence over about five million unstructured wikipedia documents, to narrow the search space, the fact detection task divides "evidence mining" into two phases, "document retrieval" and "evidence construction": the 'document retrieval' stage is used for retrieving a plurality of candidate documents possibly containing target evidence from five million documents; the "evidence construction" stage is used to screen out the set of sentences that constitute the target evidence among the several candidate documents. The problem to be solved in the "statement check" phase is to use the retrieved evidence to classify the statement.
For this task, there have been many works to achieve good results. For example, one work published at the AAAI-19 conference states that the traditional approach to semantic matching assertions and evidence is to project them into an artificially pre-designed feature word vector space where semantic matching is performed. The method considers that the feature vector space designed artificially has a large limit and cannot capture semantic information well, so that the method proposes to use a depth model to automatically learn the feature space for deep semantic matching. Therefore, a homogeneous neural-semantic matching network is respectively introduced into the document retrieval, the evidence construction and the statement verification, so that the semantic matching precision of each of the three stages is improved, and a good effect is achieved on the task; another work published on ACL-19 meetings primarily improves the "statement check" phase. It states that the traditional work is in the statement verification stage, only simply concatenates all sentences in the evidence or generates "statement-sentence" pairs as input, predicts the category of the statement, ignores the semantic links between different sentences, so it uses the pre-trained language model BERT to encode the semantic information of different sentences, then constructs a fully connected evidence graph network to perform message passing between sentences, and captures the potential semantic links.
This task comprises the traditional three-stage pipelined subtasks: document retrieval, evidence construction and statement checking. Most existing approaches follow this three-stage framework. However, the current methods have disadvantages, specifically:
in the stage of evidence construction, a score ranking method is adopted, namely each sentence is evaluated, and the 5 sentences with the highest evaluation scores are taken as the evidence, so that the sentences cannot find out accurate evidence, namely, many irrelevant sentences are introduced into the evidence, the quality of the evidence is reduced, and the manual verification is difficult.
Disclosure of Invention
The invention provides an evidence and statement combined extraction method for fact detection, aiming at overcoming the defect that evidence cannot be accurately searched in the process of the fact detection in the prior art.
The method comprises the following steps:
s1: appointing a prediction base for searching and a section of statement to be checked, cleaning the corpus and performing entity extraction on the statement;
s2: searching documents, namely searching and constructing a corresponding candidate document set from a corpus by using an entity linking method for a statement to be verified, and taking all sentences in the set as a candidate sentence subset;
S3: and an evidence mining and statement checking phase. In the stage, an evidence is constructed based on an evidence search method of a greedy strategy, and a pre-training language model BERT is used as an evaluation model of the evidence.
Wherein the evidence is a subset of the candidate sentence set, i.e. the sentences of the evidence are derived from the candidate sentence set.
The training and testing processes of the evaluation model at this stage are respectively as follows:
s3.1: and (5) training the process. The searching scheme based on the greedy strategy is converted into six equivalent constraints, and in order to enable the evaluation model to learn the six constraints, the method further converts the six equivalent constraints into six corresponding loss objective functions.
Constructing training samples and testing samples corresponding to six kinds of constraints according to existing marking evidences and candidate sentence sets in the data set;
for each instance in the training data, it must satisfy one or more of the constraints. Substituting the training samples into the objective functions corresponding to the constraints which are met by the training samples to calculate corresponding loss values, and then updating parameters of the evaluation model by using a random gradient descent method based on the loss values;
s3.2: and (5) predicting the flow. For a given test case, an evidence search method based on a greedy strategy is adopted to iteratively construct the evidence. During each iterative search, obtaining the evidence and the category where the highest score is located as the prediction evidence and the category of the current iteration; the candidate evidence of the next iteration is composed of the prediction evidence obtained in the last iteration and a candidate sentence. The condition for stopping the iteration is that the number of sentences contained in the predicted evidence reaches a given threshold. Thus, in each iteration, a prediction evidence, prediction category, and the highest score corresponding to that stage are obtained. The method selects the highest scoring one of the predicted evidence and categories as the final target evidence and category.
The training examples corresponding to the six constraints are constructed in the following modes:
giving a statement c to be checked in the training set, marking the corresponding labeled type y of the statement, and marking evidence
Figure BDA0002834806230000031
And candidate sentence set S ═ S { [ S ] 1 ,s 2 ,…,s N And constructing a training sample by the following method:
for constraint one, if y is equal to N, the label category of the declaration is "it cannot be established that the declaration is truePseudo ", the training examples of the constraint are all the singleton subsets in S, i.e. the training example set is T 1 ={{s i }:s i E.g., S), where S i Is a training example of the constraint;
for constraint two, if y is T or y is F, i.e. the labeled class of the statement is "declare true" or "declare false", the training examples of the constraint are all the single element subsets of e, i.e. the training example set is
Figure BDA0002834806230000032
Wherein
Figure BDA0002834806230000033
Is a training example of the constraint;
for constraint three, if y ═ T or y ═ F, i.e. the label category of the statement is "claim true" or "claim false", the training example of the constraint is e itself, i.e. the set of training examples is T 3 E, where e is a training example of the constraint;
for constraint four, if y is T or y is F, i.e. the labeled category of the statement is "declare true" or "declare false", the training sample set of the constraint is
Figure BDA0002834806230000037
Figure BDA0002834806230000035
Wherein S sub Is any subset of e, S vsub Is any subset of S, and S sub And S vsub The number of the contained sentences is the same, and only one sentence is different. { S sub ,S vsub Is a training example of the constraint;
for constraint five, if y ═ T or y ═ F, that is, the labeled category of the statement is "claim true" or "claim false", the training sample set of the constraint is
Figure BDA0002834806230000036
Wherein S sub Is any one of eA proper subset; { e, S sub Is a training example of the constraint;
for constraint six, if y is T or y is F, that is, the labeled category of the statement is "declare true" or "declare false", the training sample set of the constraint is
Figure BDA00028348062300000411
Wherein S sup Is any subset of S, and e is S sup Is a proper subset of (1) and S sup Only one sentence more than e. { e, S sup Is a training example of the constraint.
Preferably, the step of cleaning the corpus in S1 is to perform text cleaning on all documents in the corpus, including removing stop words, low-frequency words and special symbols;
preferably, the entity extraction of the declaration in S2 is to extract all entities in the declaration using a hidden markov model-based method, including information such as organization name, person name, place name, etc.
Preferably, the process of linking entities in S2 is as follows: for a given declaration, a corresponding entity set may be obtained according to S1; and traversing all the documents in the corpus, and adding the document into the candidate document set if the title of the document contains any entity in the statement.
Preferably, in order to avoid the problem that the number of the sentences in the candidate document set is large due to the excessive number of the sentences, and further the searching efficiency is reduced, the invention designs an evidence searching method based on a greedy strategy, and the searching space is greatly reduced. The concrete flow of the evidence searching method based on the greedy strategy in the step is as follows:
step 1: setting the evidence currently looked up as
Figure BDA0002834806230000041
The current predicted category is
Figure BDA0002834806230000042
Target evidence
Figure BDA0002834806230000043
Object classes
Figure BDA0002834806230000044
All sentence subsets contained in the candidate document set are S ═ { S ═ S 1 ,s 2 ,…,s N In which s is i The ith sentence is represented, and the statement is c;
step 2: constructing a set of candidate evidences
Figure BDA0002834806230000045
Wherein
Figure BDA0002834806230000046
Representing the ith candidate evidence;
and step 3: evaluating each evidence in the candidate evidence set using the pre-trained language model BERT, i.e.
Figure BDA00028348062300000413
Wherein V ∈ R C Is a C-dimensional vector, C representing the number of categories;
and 4, step 4: the candidate evidence and category corresponding to the highest score is taken as the current evidence and prediction category, i.e.
Figure BDA0002834806230000047
And 5: if the current highest score is higher than the historical highest score, then the target evidence and target category are updated, i.e.
Figure BDA0002834806230000048
Step 6: removing sentences that have been selected as evidence from the set of candidate sentences, i.e.
Figure BDA00028348062300000412
And 7: if the number of sentences contained in the currently searched evidence reaches a given threshold value K, that is
Figure BDA0002834806230000049
The search is stopped and output
Figure BDA00028348062300000410
Otherwise, repeating the step 2 to the step 6;
preferably, in S3.1, in order for the evaluation model to correctly identify the target evidence and the category, the present invention converts the proposed search scheme into the following six constraints, and converts these constraints into equivalent loss functions for updating the parameters of the evaluation model. For a given data set D ═ tone<c i ,S i ,E i ,y i >I is more than or equal to 1 and less than or equal to N, wherein c i ,S i ,E i And y _ i sequentially represents the ith statement, the candidate sentence set corresponding to the statement, the labeling evidence of the statement and the labeling category of the statement. For any sample in the dataset, it must satisfy one or more of the following constraints:
constraint one, if the labeling category y of a declaration is N, that is, "it is impossible to establish the authenticity of the declaration", then all candidate evidences corresponding to the declaration score higher on the N category than on other categories. The penalty function for this constraint is as follows:
Figure BDA0002834806230000051
wherein
Figure BDA0002834806230000052
Representing categories
Figure BDA0002834806230000053
Score of above, α 1 Distance super ginseng is more than or equal to 0;
constraint two, if the declared annotation category y is T or y is F, that is, "declare true" or "declare false", then the score of the singleton subset of the annotation evidence corresponding to the declaration on the N category is lower than the scores on the T and F categories. The penalty function for this constraint is as follows:
Figure BDA0002834806230000054
Wherein alpha is 2 Distance super parameter is more than or equal to 0;
constraint three, the score of the annotation evidence E on the annotation class y is higher than the score on the error class. The penalty function for this constraint is as follows:
Figure BDA0002834806230000055
Figure BDA0002834806230000056
wherein alpha is 3 Distance super ginseng is more than or equal to 0;
constraint four, for any subset of the annotated evidence E, it scores higher than the scores of the other sets, which are consistent with the size of the subset, and have one and only one element as the element of the subset. The penalty function for this constraint is as follows:
Figure BDA0002834806230000057
wherein alpha is 4 Distance super ginseng is more than or equal to 0;
constraint five, the score of the annotation evidence E on the annotation category y is higher than the scores of all proper subsets thereof. The penalty function for this constraint is as follows:
Figure BDA0002834806230000061
wherein alpha is 5 Distance super ginseng is more than or equal to 0;
and constraint six, the score of the marking evidence E on the marking category y is higher than that of the true superset of the marking evidence E. The penalty function for this constraint is as follows:
Figure BDA0002834806230000062
wherein alpha is 6 Distance greater than or equal to 0.
Preferably, the evaluation model optimization is to minimize the following loss function as an optimization target, and a stochastic gradient descent algorithm is used for optimization to complete model back propagation:
L=L 1 +L 2 +L 3 +L 4 +L 5 +L 6
compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the traditional fact detection task is a segment pipeline task consisting of three subtasks of document retrieval, evidence construction and declaration checking. The invention simplifies the three-stage pipeline type framework, combines the evidence construction and the statement check into one stage, combines a large amount of prior language knowledge contained in the pre-training language model, and obtains better effect in the aspect of accurate evidence search.
In the traditional fact verification method, in the stage of evidence construction, a score ranking method is adopted, namely each sentence is evaluated, and 5 sentences with the highest evaluation score are taken as the evidence, so that the problems that the sentences cannot find accurate evidence exist in the sentences, namely a plurality of irrelevant sentences are introduced into the evidence, the quality of the evidence is reduced, and the manual verification is difficult. The method adopts an evidence search method based on a greedy strategy, and converts the method into an equivalent loss function for optimizing the evaluation model. The method can effectively search the accurate evidence and obtain better effect on the accurate evidence search.
Pre-trained language models have been widely applied to solve natural language inference problems. The invention fully utilizes a large amount of language prior knowledge contained in the pre-training language model, can more effectively encode the semantic information of the sentence, and is beneficial to improving the understanding of the model on the semantic relation between the evidence and the statement.
Drawings
FIG. 1 is a flowchart of a method for extracting evidence and declaration jointly for fact-oriented detection in embodiment 1.
FIG. 2 is a flow chart of the training phase.
FIG. 3 is a flowchart of an evidence search method based on a greedy strategy.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the present embodiments, certain elements of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
The embodiment provides a method for extracting evidence and statement jointly for fact detection, which comprises the following steps:
the method comprises the following steps:
s1: appointing a prediction base for retrieval and a section of statement to be verified, cleaning the corpus, and performing entity extraction on the statement to obtain an entity set;
s2: document retrieval: retrieving and constructing a corresponding candidate document set from the cleaned corpus according to the entity set by using an entity linking method for the statement to be checked, and taking all sentences in the set as a candidate sentence subset;
s3: constructing an evidence by an evidence search method based on a greedy strategy, and training and testing an evaluation model by using a pre-trained language model BERT as an evaluation model of the evidence to obtain a final target evidence and a final target category;
Wherein the evidence is a subset of the candidate sentence set, i.e. the sentences of the evidence are derived from the candidate sentence set.
Cleaning the corpus in S1 refers to performing text cleaning on all documents in the corpus, including removing stop words, low-frequency words, and special symbols.
The extraction of the entity of the declaration means that all entities in the declaration are extracted by using a hidden Markov model-based method, and the information comprises organization name, person name and place name.
The entity link in S2 specifically includes:
obtaining a corresponding entity set according to the step S1; and traversing all the documents in the corpus, and adding the documents into the candidate document set if the titles of the documents contain any entity in the statement to be checked.
Training and testing the evaluation model in S3, comprising the following steps:
s3.1: converting a search scheme based on a greedy strategy into six equivalent constraints, so that an evaluation model can learn the six constraints, and converting the six constraints into six corresponding loss objective functions;
constructing training samples and testing samples corresponding to six kinds of constraints according to existing marking evidences and candidate sentence sets in the data set;
for each sample in the training data, at least one constraint must be satisfied;
Substituting the training samples into the objective functions corresponding to the constraints which are met by the training samples to calculate corresponding loss values, and then performing parameter optimization updating on the evaluation model by using a random gradient descent method based on the loss values;
s3.2: for a given test sample, an evidence search method based on a greedy strategy is adopted to iteratively construct an evidence:
during each iterative search, calculating the scores of all candidate sentences in the candidate sentence set by using a pre-trained language model BERT for each candidate sentence based on the currently searched evidence (the evidence is initialized to an empty set before iteration is not started), and then obtaining the candidate sentence with the highest score and the corresponding category;
updating the candidate sentence subset, namely deleting the selected candidate sentences from the candidate sentence set;
updating the evidence searched currently, namely adding the selected candidate sentences into the evidence searched currently;
and taking the currently searched evidence and the corresponding category as the prediction evidence and the prediction category obtained by the current iterative search.
Stopping iteration if the number of sentences contained in the currently searched evidence reaches a preset threshold value;
because a prediction evidence, a prediction category and the highest score corresponding to the stage are obtained in each iteration; therefore, the highest scoring one of the predicted evidence and the category is used as the final target evidence and the category.
In S3.1, the six constraints are respectively as follows:
constraint one, if the labeling category y of a declaration is N, that is, "it is impossible to establish the authenticity of the declaration", then all candidate evidences corresponding to the declaration score higher on the N category than on other categories. The penalty function for this constraint is as follows:
Figure BDA0002834806230000081
wherein
Figure BDA0002834806230000082
Representing categories
Figure BDA0002834806230000083
Score of above, α 1 Distance super ginseng is more than or equal to 0; d is a given data set D ═ tone<c i ,S i ,e i ,y i >:1≤i≤N},c i ,S i ,e i Y _ i sequentially represents the ith statement, a candidate sentence subset corresponding to the ith statement, a labeling evidence of the ith statement and a labeling type of the ith statement;
constraint two, if the labeling category y of a declaration is T or F, that is, "declare true" or "declare false", then the score of the singleton subset of labeling evidence corresponding to the declaration on the N category is lower than the scores on the T and F categories; the penalty function for this constraint is as follows:
Figure BDA0002834806230000091
wherein alpha is 2 Distance super ginseng is more than or equal to 0;
constraint III, the score of the marking evidence e on the marking category y is higher than the score of the marking evidence e on the error category; the penalty function for this constraint is as follows:
Figure BDA0002834806230000092
Figure BDA0002834806230000093
wherein alpha is 3 Distance super ginseng is more than or equal to 0;
constraint four, for any subset of the annotated evidence e, its score is higher than the scores of the other sets, which are consistent with the size of the subset, and only one element is the element of the subset. The penalty function for this constraint is as follows:
Figure BDA0002834806230000094
Wherein alpha is 4 Distance super ginseng is more than or equal to 0;
constraint five, the score of the marking evidence e on the marking category y is higher than the scores of all proper subsets of the marking evidence e; the penalty function for this constraint is as follows:
Figure BDA0002834806230000095
wherein alpha is 5 Distance super ginseng is more than or equal to 0;
constraint six, the score of the marking evidence e on the marking category y is higher than that of the true superset of the marking evidence e; the penalty function for this constraint is as follows:
Figure BDA0002834806230000101
wherein alpha is 6 Distance greater than or equal to 0.
The evaluation model optimization is to minimize the following loss function as an optimization target, and a random gradient descent algorithm is used for optimization to complete the back propagation of the model:
L=L 1 +L 2 +L 3 +L 4 +L 5 +L 6
the evidence searching method based on the greedy strategy comprises the following steps:
step 1: set the evidence currently looked up as
Figure BDA0002834806230000102
The current prediction is of the class
Figure BDA0002834806230000103
Evidence of an object
Figure BDA0002834806230000104
Object classes
Figure BDA0002834806230000105
The candidate sentence set contained in the candidate document set is S ═ S 1 ,s 2 ,…,s N In which s is i The ith sentence is represented, and the statement is c;
step 2: constructing a set of candidate evidences
Figure BDA0002834806230000106
Wherein
Figure BDA0002834806230000107
Representing the ith candidate evidence;
and step 3: evaluating each evidence in the candidate evidence set using the pre-trained language model BERT, i.e.
Figure BDA0002834806230000108
Wherein V ∈ R C Is a C-dimensional vector, C representing the number of categories;
and 4, step 4: the candidate evidence and the category corresponding to the highest score are made For the current evidence and prediction categories, i.e.
Figure BDA0002834806230000109
And 5: if the current highest score is higher than the historical highest score, then the target evidence and target category are updated, i.e.
Figure BDA00028348062300001010
Step 6: removing sentences that have been selected as evidence from the set of candidate sentences, i.e.
Figure BDA00028348062300001011
And 7: if the number of sentences contained in the currently searched evidence reaches a preset threshold value K, that is to say
Figure BDA00028348062300001012
The search is stopped and output
Figure BDA00028348062300001013
Otherwise, repeating the step 2 to the step 6.
The training examples corresponding to the six constraints are constructed in the following manner:
giving a statement c to be checked in the training set, marking the corresponding labeled type y of the statement, and marking evidence
Figure BDA00028348062300001014
And a candidate sentence subset S ═ S 1 ,s 2 ,…,s N And constructing a training sample by the following method:
for constraint one, if y is equal to N, namely the label category of the declaration is "false proof of declaration can not be established", the training examples of the constraint are all the single element subsets in S, namely the training example set is T 1 ={{s i }:s i E.g., S), where S i Is a training example of the constraint;
for constraint two, if y ═ T or y ═ F, then the label category of the claim is "claim true" or "soundMin is false ", the training examples of the constraint are all the singleton subsets of e, i.e., the set of training examples is
Figure BDA0002834806230000111
Wherein
Figure BDA0002834806230000112
Is a training example of the constraint;
For constraint three, if y ═ T or y ═ F, i.e. the label category of the statement is "claim true" or "claim false", the training example of the constraint is e itself, i.e. the set of training examples is T 3 E, where e is a training example of the constraint;
for constraint four, if y is T or y is F, i.e. the labeled category of the statement is "declare true" or "declare false", the training sample set of the constraint is
Figure BDA0002834806230000113
Figure BDA0002834806230000114
Wherein S sub Is any subset of e, S vsub Is any subset of S, and S sub And S vsub The number of the contained sentences is the same, and only one sentence is different. { S sub ,S vsub Is a training example of the constraint;
for constraint five, if y ═ T or y ═ F, that is, the labeled category of the statement is "claim true" or "claim false", the training sample set of the constraint is
Figure BDA0002834806230000115
Wherein S sub Is any proper subset of e; { e, S sub Is a training example of the constraint;
for constraint six, if y is T or y is F, that is, the labeled category of the statement is "declare true" or "declare false", the training sample set of the constraint is
Figure BDA0002834806230000116
Wherein S sup Is any subset of S, and e is S sup Is a proper subset of (1) and S sup Only one sentence more than e. { e, S sup Is a training example of the constraint. The present embodiment is described below with reference to specific examples:
Given one example: statement c is "Giada at Home wa only available on DVD", with labeled category y of N and labeled evidence E of { s e1 ,s e2 In which s is e1 Is "Giada at Home a television show and first aid on October 18,2008, on the Food Network. ", s e2 Is "Food Network is an American basic cable and satellite telecommunications channel".
A data preprocessing stage, as shown in fig. 1, performing entity labeling on c to obtain an entity set { Giada at Home, DVD, Giada, Home }; and then, using a physical link technology to retrieve a candidate document set from the corpus, wherein the document title is { Giada _ at _ Home, DVD, Giada }, the text of the document "Giada _ at _ Home" has 3 sentences, the text of the document "DVD" has 2 sentences, the text of the document "Giada" has 4 sentences, and therefore the candidate sentence subset corresponding to c is S { S ═ S 1 ,s 2 ,…,s 9 In which s is 1 Is (Giada _ at _ Home,0) the first sentence representing the document "Giada _ at _ Home", the other s i And so on.
In the training phase, as shown in fig. 2, the distance hyperparameter of each constraint is set to be 1. Constructing training data according to the candidate sentence subset S and the marking evidence E, wherein the construction process is as follows:
1. constructing a proper subset S of E sub ={{s e1 },{s e2 And } the subset is required to satisfy constraint two and constraint five, so substituting it into the corresponding objective function calculates the corresponding loss value:
Figure BDA0002834806230000121
Figure BDA0002834806230000122
2. construction set S vsub ={{s e1 ,s i }:s i ∈S∧s i ≠s e1 ∧s i ≠s e2 }∪{{s e2 ,s i }:s i ∈S∧s i ≠s e2 ∧s i ≠s e2 And the set S sub E, they need to satisfy the constraint four, so substituting it into the corresponding objective function calculates the corresponding loss value:
Figure BDA0002834806230000123
3. true superset S of structure E sup ={{s e1 ,s e2 ,s i }:s i ∈S∧s i ≠s e1 ∧s i ≠s e2 H, it and E need to satisfy the constraint six, so substituting it into the corresponding objective function calculates the corresponding loss value:
Figure BDA0002834806230000124
e should satisfy the constraint three, so substituting it into the corresponding objective function calculates the corresponding loss value:
Figure BDA0002834806230000125
Figure BDA0002834806230000126
based on the six loss values for the six constraints, a final target loss is calculated:
L=L 1 +L 2 +L 3 +L 4 +L 5 +L 6
this loss is then used to perform a random gradient descent, updating the parameters of the evaluation model.
In the prediction stage, an evidence search method based on a greedy strategy is adopted for prediction, and as shown in fig. 3, a model prediction process is as follows:
step 1: set the evidence currently looked up as
Figure BDA0002834806230000127
The current prediction is of the class
Figure BDA0002834806230000128
Target evidence
Figure BDA0002834806230000129
Object classes
Figure BDA00028348062300001210
All sentence subsets contained in the candidate document set are S ═ { S ═ S 1 ,s 2 ,…,s 9 The statement is c;
step 2: constructing a set of candidate evidences
Figure BDA0002834806230000131
Wherein
Figure BDA0002834806230000132
Representing the ith candidate evidence;
and step 3: evaluating each evidence in the candidate evidence set using the pre-trained language model BERT, i.e.
Figure BDA0002834806230000133
Wherein V ∈ R C Is a C-dimensional vector, C representing the number of categories;
and 4, step 4: the candidate evidence and category corresponding to the highest score is taken as the current evidence and prediction category, i.e.
Figure BDA0002834806230000134
And 5: if the current highest score is higher than the historical highest score, then the target evidence and target category are updated, i.e.
Figure BDA0002834806230000135
Step 6: removing sentences that have been selected as evidence from the set of candidate sentences, i.e.
Figure BDA0002834806230000136
And 7: if the number of sentences contained in the currently searched evidence reaches a given threshold value K, that is
Figure BDA0002834806230000137
The search is stopped and output
Figure BDA0002834806230000138
Otherwise, repeating the step 2 to the step 6.
The terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (6)

1. A method for extracting evidence and declaration combined for fact detection, the method comprising the following steps:
s1: appointing a prediction base for retrieval and a section of statement to be verified, cleaning the corpus, and performing entity extraction on the statement to obtain an entity set;
s2: document retrieval: retrieving and constructing a corresponding candidate document set from the cleaned corpus according to the entity set by using an entity linking method for the statement to be checked, and taking all sentences in the candidate document set as a candidate sentence subset;
the entity link is specifically: obtaining a corresponding entity set according to the step S1; traversing all documents in the corpus, and if the title of the document contains any entity in the statement to be checked, adding the document into the candidate document set;
s3: constructing an evidence by an evidence search method based on a greedy strategy, and training and testing an evaluation model by using a pre-trained language model BERT as an evaluation model of the evidence to obtain a final target evidence and a final target category;
wherein the evidence is a subset of the candidate sentence subset;
training and testing the evaluation model, comprising the following steps:
s3.1: converting a search scheme based on a greedy strategy into six equivalent constraints, and converting the six constraints into six corresponding loss objective functions;
Constructing training samples and testing samples corresponding to six constraints according to the existing marking evidence and candidate sentence sets in the data set;
for each sample in the training data, at least one constraint must be satisfied;
substituting the training samples into the objective functions corresponding to the constraints which are met by the training samples to calculate corresponding loss values, and then performing parameter optimization updating on the evaluation model by using a random gradient descent method based on the loss values;
the six constraints are respectively as follows:
constraint one, if the labeling category y of a statement is N, that is, "the truth of the statement can not be established", then the scores of all candidate evidences corresponding to the statement on the N category are higher than the scores on the other categories; the penalty function for this constraint is as follows:
Figure FDA0003642165860000011
wherein
Figure FDA0003642165860000021
Representing categories
Figure FDA0003642165860000022
Score of (a) f N (. h) represents a score on category N, { s } represents a notational evidence with only one candidate sentence, α 1 Distance super ginseng is more than or equal to 0; d is a given data set D ═ tone<c i ,S i ,e i ,y i >:1≤i≤N},c i ,S i ,e i ,y i Sequentially representing the ith statement, a candidate sentence subset corresponding to the ith statement, a labeling evidence of the ith statement and a labeling category of the ith statement;
constraint two, if the labeling category y of a declaration is T or F, that is, "declare true" or "declare false", then the score of the singleton subset of labeling evidence corresponding to the declaration on the N category is lower than the scores on the T and F categories; the penalty function for this constraint is as follows:
Figure FDA0003642165860000023
Wherein alpha is 2 Distance super parameter is more than or equal to 0;
thirdly, the score of the marking evidence e on the marking category y is higher than that of the error category; the penalty function for this constraint is as follows:
Figure FDA0003642165860000024
Figure FDA0003642165860000025
wherein alpha is 3 Distance super ginseng is more than or equal to 0;
constraint four, for any subset of the marked evidences e, the score is higher than the scores of other sets which are consistent with the size of the subset and only one element is not the element of the subset; the penalty function for this constraint is as follows:
Figure FDA0003642165860000026
wherein alpha is 4 Distance greater than or equal to 0, S sub Representing any subset of the annotated evidence e, S vsub Is represented by the formula sub Sets of the same size, with and only one element different;
constraint five, the score of the marking evidence e on the marking category y is higher than the scores of all proper subsets of the marking evidence e; the penalty function for this constraint is as follows:
Figure FDA0003642165860000031
wherein alpha is 5 Not less than 0 is distance of super ginseng, S' sub Any proper subset of the marked evidence e is represented;
constraint six, the score of the marking evidence e on the marking category y is higher than that of the true superset of the marking evidence e; the penalty function for this constraint is as follows:
Figure FDA0003642165860000032
wherein alpha is 6 Distance super ginseng is more than or equal to 0;
s3.2: for a given test sample, an evidence search method based on a greedy strategy is adopted to iteratively construct an evidence:
During each iterative search, calculating the score of each candidate sentence in the candidate sentence set on all categories by using a pre-training language model BERT on the basis of the currently searched evidence, and then obtaining the candidate sentence with the highest score and the corresponding category;
updating the candidate sentence set, namely deleting the selected candidate sentences from the candidate sentence set;
updating the current searched evidence, namely adding the selected candidate sentences into the current searched evidence;
taking the currently searched evidence and the corresponding category as a prediction evidence and a prediction category obtained by current iterative search;
stopping iteration if the number of sentences contained in the currently searched evidence reaches a preset threshold value;
because a prediction evidence, a prediction category and the highest score corresponding to the iteration stage are obtained in each iteration; therefore, the highest score in the predicted evidence and the categories is taken as the final target evidence and the categories;
the evidence searching method based on the greedy strategy comprises the following steps:
step 1: set the evidence currently looked up as
Figure FDA0003642165860000033
The current prediction is of the class
Figure FDA0003642165860000034
Target evidence
Figure FDA0003642165860000035
Object classes
Figure FDA0003642165860000036
The candidate sentence set contained in the candidate document set is S ═ S 1 ,s 2 ,…,s N In which s i The ith sentence is represented, and the statement is c;
and 2, step: constructing a set of candidate evidences
Figure FDA0003642165860000037
Wherein
Figure FDA0003642165860000038
Representing the ith candidate evidence;
and step 3: evaluating each evidence in the candidate evidence set using the pre-trained language model BERT, i.e.
Figure FDA0003642165860000039
Wherein V ∈ R C Is a C-dimensional vector, C representing the number of categories;
and 4, step 4: the candidate evidence and category corresponding to the highest score is taken as the current evidence and prediction category, i.e.
Figure FDA0003642165860000041
And 5: if the current highest score is higher than the historical highest score, then the target evidence and target category are updated, i.e.
Figure FDA0003642165860000042
Step 6: removing sentences that have been selected as evidence from the set of candidate sentences, i.e.
Figure FDA0003642165860000043
And 7: if the number of sentences contained in the currently searched evidence reaches a preset threshold value K, that is to say
Figure FDA0003642165860000044
The search is stopped and output
Figure FDA0003642165860000045
Otherwise, repeating the step 2 to the step 6.
2. The method for extracting evidence and declaration jointly oriented to fact detection according to claim 1, wherein cleansing the corpus in S1 is to perform text cleansing on all documents in the corpus, including removing stop words, low-frequency words and special symbols.
3. The method for extracting evidence and declaration jointly oriented to fact detection as claimed in claim 2, wherein the entity extraction of the declaration is to extract all entities in the declaration, including information of organization name, person name and place name, by using a hidden markov model-based method.
4. The fact-oriented detection evidence and statement combined extraction method as claimed in claim 1, wherein the evaluation model optimization is based on a minimization loss function as an optimization target, and a stochastic gradient descent algorithm is used for optimization to complete back propagation of the model.
5. The method for extracting evidence and declaration jointly oriented to fact detection according to claim 4, wherein the loss function is:
L=L 1 +L 2 +L 3 +L 4 +L 5 +L 6
6. the method for extracting evidence and statement jointly oriented to fact detection according to claim 5, wherein training examples corresponding to six constraints are constructed as follows:
given a statement c to be checked in the training set, the corresponding labeled category y of the statement, labeled evidence
Figure FDA0003642165860000046
And candidate sentence set S ═ S { [ S ] 1 ,s 2 ,…,s N And constructing a training sample by the following method:
for constraint one, if y is equal to N, namely the label category of the declaration is "false proof of declaration can not be established", the training examples of the constraint are all the single element subsets in S, namely the training example set is T 1 ={{s i }:s i E.g., S), where S i Is a training example of the constraint;
for constraint two, if y is T or y is F, i.e. the labeled class of the statement is "declare true" or "declare false", the training examples of the constraint are all the single element subsets of e, i.e. the training example set is
Figure FDA0003642165860000047
Wherein
Figure FDA0003642165860000048
Is a training example of the constraint;
for constraint three, if y ═ T or y ═ F, i.e. the label category of the statement is "claim true" or "claim false", the training example of the constraint is e itself, i.e. the set of training examples is T 3 E, where e is a training example of the constraint;
for constraint four, if y is T or y is F, i.e. the labeled category of the statement is "declare true" or "declare false", the training sample set of the constraint is
Figure FDA0003642165860000051
Figure FDA0003642165860000052
Wherein S sub Is any subset of e, S vsub Is any subset of S, and S sub And S vsub The number of the contained sentences is the same, and only one sentence is different; { S sub ,S vsub Is a training example of the constraint;
for constraint five, if y ═ T or y ═ F, that is, the labeled category of the statement is "claim true" or "claim false", the training sample set of the constraint is
Figure FDA0003642165860000053
Wherein S' sub Is any proper subset of e; { e, S' sub Is a training example of the constraint;
for constraint six, if y is T or y is F, that is, the labeled category of the statement is "declare true" or "declare false", the training sample set of the constraint is
Figure FDA0003642165860000054
Wherein S sup Is any subset of S, and e is S sup Is a proper subset of (1) and S sup Only one sentence more than e; { e,S sup Is a training example of the constraint.
CN202011467223.0A 2020-12-14 2020-12-14 Evidence and statement combined extraction method for fact detection Active CN112579583B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011467223.0A CN112579583B (en) 2020-12-14 2020-12-14 Evidence and statement combined extraction method for fact detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011467223.0A CN112579583B (en) 2020-12-14 2020-12-14 Evidence and statement combined extraction method for fact detection

Publications (2)

Publication Number Publication Date
CN112579583A CN112579583A (en) 2021-03-30
CN112579583B true CN112579583B (en) 2022-07-29

Family

ID=75134819

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011467223.0A Active CN112579583B (en) 2020-12-14 2020-12-14 Evidence and statement combined extraction method for fact detection

Country Status (1)

Country Link
CN (1) CN112579583B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114048286B (en) * 2021-10-29 2024-06-07 南开大学 Automatic fact verification method integrating graph converter and common attention network
CN116383239B (en) * 2023-06-06 2023-08-15 中国人民解放军国防科技大学 Mixed evidence-based fact verification method, system and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5918240A (en) * 1995-06-28 1999-06-29 Xerox Corporation Automatic method of extracting summarization using feature probabilities
CN103488707A (en) * 2013-09-06 2014-01-01 中国人民解放军国防科学技术大学 Method of searching for candidate classes based on greedy strategy and heuristic algorithm
CN107533698A (en) * 2015-05-08 2018-01-02 汤森路透全球资源无限公司 The detection and checking of social media event
CN109766434A (en) * 2018-12-29 2019-05-17 北京百度网讯科技有限公司 Abstraction generating method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5918240A (en) * 1995-06-28 1999-06-29 Xerox Corporation Automatic method of extracting summarization using feature probabilities
CN103488707A (en) * 2013-09-06 2014-01-01 中国人民解放军国防科学技术大学 Method of searching for candidate classes based on greedy strategy and heuristic algorithm
CN107533698A (en) * 2015-05-08 2018-01-02 汤森路透全球资源无限公司 The detection and checking of social media event
CN109766434A (en) * 2018-12-29 2019-05-17 北京百度网讯科技有限公司 Abstraction generating method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Reasoning over semantic level graph for fact checking;W Zhong et al.;《Proceedings of the 58th annual meeting of the association for computational linguistics》;20200101;1-7 *
基于概念对象模型的文本摘要技术研究;孙秀胜;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑(月刊)》;20160815(第8期);I138-1486 *

Also Published As

Publication number Publication date
CN112579583A (en) 2021-03-30

Similar Documents

Publication Publication Date Title
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN112732934B (en) Power grid equipment word segmentation dictionary and fault case library construction method
CN112579477A (en) Defect detection method, device and storage medium
CN112507699B (en) Remote supervision relation extraction method based on graph convolution network
CN109918505B (en) Network security event visualization method based on text processing
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN113191148B (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN111091009B (en) Document association auditing method based on semantic analysis
CN110569355B (en) Viewpoint target extraction and target emotion classification combined method and system based on word blocks
CN111738007A (en) Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network
CN112579583B (en) Evidence and statement combined extraction method for fact detection
CN112183094A (en) Chinese grammar debugging method and system based on multivariate text features
CN113065356B (en) IT equipment operation and maintenance fault suggestion processing method based on semantic analysis algorithm
CN112836029A (en) Graph-based document retrieval method, system and related components thereof
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN113836896A (en) Patent text abstract generation method and device based on deep learning
CN116244445B (en) Aviation text data labeling method and labeling system thereof
CN113742733A (en) Reading understanding vulnerability event trigger word extraction and vulnerability type identification method and device
CN114265935A (en) Science and technology project establishment management auxiliary decision-making method and system based on text mining
CN113505583A (en) Sentiment reason clause pair extraction method based on semantic decision diagram neural network
CN103577414B (en) Data processing method and device
CN117272142A (en) Log abnormality detection method and system and electronic equipment
CN118093860A (en) Multi-level scientific research topic mining method based on text embedded vector clustering
CN112884087A (en) Biological enhancer and identification method for type thereof
CN114155913B (en) Gene regulation network construction method based on higher-order dynamic Bayes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant