CN112257443A

CN112257443A - MRC-based company entity disambiguation method combined with knowledge base

Info

Publication number: CN112257443A
Application number: CN202011070276.9A
Authority: CN
Inventors: 张汝宸; 朱德伟; 朱峰
Original assignee: Huatai Securities Co ltd
Current assignee: Huatai Securities Co ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2021-01-22
Anticipated expiration: 2040-09-30
Also published as: CN112257443B

Abstract

The invention discloses a company entity disambiguation method based on MRC combined with a knowledge base, which comprises the following steps: acquiring a statement to be disambiguated; splicing the sentence to be disambiguated and the question sentence to obtain an MRC structure; acquiring different entity description sentences corresponding to the ambiguity abbreviation in the sentences to be disambiguated from an entity knowledge base; splicing different entity description sentences at the end of the MRC structure; inputting the MRC structure spliced with different entity description sentences into a Bert model; and the Bert model outputs an ambiguity corresponding to the real entity for short, so as to realize statement disambiguation. The method effectively improves the accuracy of model prediction, and simultaneously has the generalization capability of the supervision model, thereby avoiding the need of re-labeling and model training when a company entity is newly added.

Description

MRC-based company entity disambiguation method combined with knowledge base

Technical Field

The invention relates to the field of artificial intelligence, in particular to a company entity disambiguation method based on MRC combined with a knowledge base.

Background

The text information is the main medium for information dissemination of company entities, and the accurate positioning of the company entities (company association) where news occurs directly determines how to carry out downstream financial work. In the financial information, many company entities (in tens of millions of company entities) appear in the form of domain short names, which is very easy to cause ambiguity. For example, the common people may refer to a listed company or "general public"; the wuliangye can be marketed company or liquor. The essence of entity disambiguation is that a word may have multiple meanings, the exact meaning it expresses being determined in conjunction with the context and knowledge of the knowledge base. The ambiguity resolution of the company entity has important significance for subsequent understanding of the financial news information content and accurate related company entity information.

At the present stage, the common methods for company entity disambiguation are: (1) the regular expression matching method comprises the following steps: maintaining positive and negative sample (unambiguous is a positive sample, and ambiguous is a negative sample) rules of all possible ambiguous companies, and judging whether ambiguity exists in a regular matching mode; (2) the unsupervised sample clustering-based method comprises the following steps: mining positive and negative sample clusters by semantic clustering of a text containing a company entity for short, and carrying out disambiguation; (3) a method based on supervised sample classification: and training a binary model by marking positive and negative samples of the company which possibly generates ambiguity to disambiguate.

Among the above methods, the regular expression matching based method has high accuracy, but has low recall, poor expansibility, and low efficiency because the rule base needs to be continuously maintained manually; on one hand, the accuracy is relatively low due to lack of supervision information, and on the other hand, for each newly added entity of the company to be disambiguated, corresponding unsupervised corpora need to be newly added and clustered again; on the one hand, the method based on supervised sample classification cannot determine the specific ambiguous class of the negative sample, namely the ambiguous sample, because only two classifications are carried out on the positive and negative samples, and on the other hand, the description of the knowledge base on the entity cannot be effectively utilized because the information of the entity knowledge base is not introduced.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a company entity disambiguation method based on MRC combined with a knowledge base, so as to solve the problem of relatively low accuracy rate in the prior art.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a MRC-based corporate entity disambiguation method incorporating a knowledge base, comprising the steps of:

acquiring a statement to be disambiguated;

splicing the sentence to be disambiguated and the question sentence to obtain an MRC structure;

acquiring different entity description sentences corresponding to the ambiguity abbreviation in the sentences to be disambiguated from an entity knowledge base;

splicing different entity description sentences at the end of the MRC structure;

inputting the MRC structure spliced with different entity description sentences into a Bert model;

and the Bert model outputs an ambiguity corresponding to the real entity for short, so as to realize statement disambiguation.

Furthermore, two loss functions are arranged at the output end of the Bert model; the loss function includes a first task loss function and a second task loss function.

Further, the first task loss function is a binary loss; the second task loss function is a multi-classification loss.

Further, the first task loss function is expressed by the following formula:

output₁＝Sigmoid(W₁×H_[CLS])

loss₁＝binary_crossentropy(output₁,label₁)

in the formula, output₁A model output representing task one; sigmoid () represents a logical (logical) function; w₁A weight matrix representing an output of a task computation task; h_【CLS】Representing semantic vectors at the beginning positions of the periods; loss₁Represents a loss of task one; binary _ crossentrypy () represents a two-class cross entropy loss computation function; label₁A true tag representing task one.

Further, the second task loss function is expressed by the following formula:

entity_output_i＝Sigmoid(W_{entity_i}×H_{entity_i})

loss_{entity_i}＝binary_crossentropy(entity_output_i,label_{entity_i})

in the formula, entity _ output_iRepresenting the model output at the ith entity; w_{entity_i}Representing calculating a weight matrix of an output at an ith entity; h_{entity_i}Representing a semantic vector at the ith entity location; loss_{entity_i}Represents the loss of the ith entity; label_{entity_i}A real tag representing an i-th entity; loss₂Represents the loss of task two; n represents the number of ambiguous entities or simply the possible corresponding entities.

Further, the Bert model disambiguates the MRC structure spliced with different entity description statements through the first task loss function and the second task loss function, and the specific disambiguation process is as follows:

judging whether an ambiguity abbreviation exists in the statement to be disambiguated or not through the first task loss function;

and if so, acquiring the true entity corresponding to the ambiguity abbreviation from different entity description sentences through a second task loss function.

Further, the Bert model is stacked by 12 layers of the basic neural network structure.

Further, training the Bert model, and inputting the MRC structure into the trained Bert model to realize statement disambiguation; the training method of the Bert model comprises the following steps:

setting parameters of a basic neural network structure in a Bert model;

initializing parameters of the last 3 layers of basic neural network structures at equal probability randomly;

and training the Bert model after the parameters are initialized randomly, and stopping training after the loss function of the Bert model is converged to obtain the Bert model after the training is optimized.

Compared with the prior art, the invention has the following beneficial effects:

the invention inputs more effective information into the model through the introduction of entity description in the entity knowledge base, improves the prediction capability of the model, simultaneously utilizes the input construction mode of MRC, conforms to the input characteristics of a pretraining stage of a Bert model, further improves the accuracy rate of entity disambiguation, and further finely distinguishes different types of ambiguities through the specific classification of entity reference content, and accelerates the convergence of the model and enhances the training stability through the use of multi-task learning and weight reinitialization.

Drawings

FIG. 1 is a sample diagram of a similar reading understanding input constructed by way of concatenation with a question;

FIG. 2 is a sample diagram of a sentence described by effectively associating a sentence to be disambiguated with an entity.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

(1) acquiring a statement to be disambiguated;

(2) splicing the sentences to be disambiguated and the question sentences to obtain an MRC structure;

firstly, constructing input in a reading and understanding-like mode, namely splicing a sentence to be disambiguated as a reading and understanding text with a question to obtain an MRC structure and taking the MRC structure as input.

(3) Acquiring different entity description sentences corresponding to the ambiguity abbreviation in the sentences to be disambiguated from an entity knowledge base;

(4) splicing different entity description sentences at the end of the MRC structure;

entity description statements in which the same abbreviation is used but different references are concatenated at the end of the MRC structure input by means of the entity knowledge base for providing more detailed disambiguation information.

(5) Inputting the MRC structure spliced with different entity description sentences into a Bert model;

(6) and the Bert model outputs an ambiguity corresponding to the real entity for short, so as to realize statement disambiguation.

At the output end of the Bert model, two tasks are designed and losses are accumulated by utilizing the characteristic that multi-task learning can be mutually promoted, wherein one is binary loss for judging whether ambiguity exists or not, and the other is multi-classification loss for determining a specific ambiguity category.

At the initial training of the Bert model, only the model weight close to the input layer is reserved, and the model weight close to the output layer is initialized randomly again, so that the convergence speed of the model can be accelerated, the loss jitter in the training process is reduced, and the training stability is improved.

The method comprises the following specific steps:

step 1-construct input of MRC mode

The method comprises the steps of constructing model input in an MRC mode, using a sentence needing disambiguation as a reading text, and constructing input similar to reading understanding through a mode of splicing with a question sentence, wherein an example is shown in FIG. 1, the sentence to be disambiguated in FIG. 1 is 'apple futures continuous pressure bearing, but fluctuation is gradually reduced', a reading understanding problem is 'indication of an ambiguous main body', and a question and the sentence to be disambiguated 'pay attention to each other' through a Self-attention (Self-attention) mechanism, so that the model learns to extract information favorable for disambiguation from the sentence according to the question.

Step 2-input in conjunction with entity knowledge base

In order to fully utilize effective information in the structured knowledge base, a structure which can fuse the knowledge of the model and the knowledge base needs to be designed. The most effective information in the entity knowledge base comes from entity description sentences, all possible entity descriptions for ambiguity short are spliced at the end of MRC structure input in sequence, and the sentence to be disambiguated and the entity description sentences are effectively associated through an attention mechanism, for example, as shown in FIG. 2, the ambiguity in the sentence to be disambiguated is simply referred to as "apple", and the keyword has a plurality of corresponding entities in the structured entity knowledge base, such as "Malus plant", that is, the fruit that we eat usually, and "one high-tech company", that is, the apple company in the United states. Through the fusion mode, the structured information in the entity knowledge base can be effectively utilized, the model is helped to understand the semantics of the sentence to be disambiguated, and the disambiguation accuracy is improved.

Step3 multitask learning

At the output, the design of the loss function is also crucial. And designing two tasks by utilizing the characteristic that the multi-task learning can mutually promote, and accumulating the loss.

1) The first task is to distinguish whether the sentence to be disambiguated contains an ambiguous entity abbreviation;

output₁＝Sigmoid(W₁×H_[CLS])

loss₁＝binary_crossentropy(output₁,label₁)

in the formula, output₁A model output representing task one; sigmoid () represents a logical (logical) function; w₁A weight matrix representing an output of a task computation task; h_【CLS】Representing semantic vectors at the beginning positions of the periods; loss₁Represents a loss of task one; binary _ cross entropy () represents a two-class cross entropy loss calculation function; label₁A true tag representing task one.

The second task is to determine which entity the ambiguous entity corresponds to in short;

entity_output_i＝Sigmoid(W_{entity_i}×H_{entity_i})

loss_{entity_i}＝binary_crossentropy(entity_output_i,label_{entity_i})

Step 4-weight reinitialization

Generally, for the model based on the Bert, only the structural parameters of all 12 layers of basic neural networks of the pre-trained Bert model need to be reserved, and fine tuning is directly performed on a downstream task. However, the training process of this method is unstable and the convergence rate is slow. The reason is that the weights of all 12-layer network structures of the pre-trained Bert model do not have positive effects on downstream tasks, general semantic information such as part of speech, syntax and the like is learned close to an input layer, and strongly related knowledge of the downstream tasks is learned close to an output layer. Obviously, the downstream task during the Bert pre-training is different from the disambiguation task in the scheme, so that the network weight close to the output layer in the pre-trained Bert has a negative effect on the training of the downstream task, and a weight reinitialization method is provided for solving the problem that the training of the downstream task is unstable:

1) copying all 12 layers of basic neural network structure parameters in the pretrained Bert into the model of the scheme;

2) replacing parameters of the last 3-layer network structure of the model in the scheme by an equal probability random initialization mode between 0 and 1;

3) training the model after the final 3-layer network structure parameters are reinitialized, and stopping training after the model loss function is converged to obtain a model after training optimization;

4) and (4) receiving input by using the trained model to obtain disambiguation output.

Step 1-construct a knowledge base containing information of company entities, which needs to contain various entities corresponding to the entities possibly causing ambiguity and description information thereof.

Step 2-label a certain amount of supervised corpora, including unambiguous corpora and ambiguous corpora, wherein the ambiguous corpora need to be labeled specifically for the corresponding ambiguous entity class.

Step 3-constructing input in MRC mode, splicing question sentences and sentences to be disambiguated, and splicing description sentences of all possible corresponding entities for short one by one at sentence tail.

Step 4-input Bert model, calculate two losses at the output and superimpose them, one is binary loss to determine if the sentence is ambiguous to the corporate entity description, and the other is multi-classification loss to determine the specific ambiguous class.

Step 5-training is initiated by keeping only the Bert weights close to the input layer and re-initializing the weights close to the output layer.

And Step 6-finishing training to obtain a complete company entity ambiguity resolution model, wherein in prediction, the input is consistent with that of Step3, and two outputs are provided, one is used for judging whether the sentence contains ambiguous company entities, and the other is used for determining the specific ambiguity categories corresponding to the entities.

The invention discloses a company Entity Disambiguation (Entity Disambiguation) method based on MRC (machine Reading company) combined with a knowledge base. Aiming at the ambiguity problem existing in company entity association, the technology is based on a Bert model, firstly a disambiguation question is constructed in an MRC (machine Reading compatibility) mode, then model input is constructed by combining entity description information in an entity knowledge base, at an output end, whether the entity binary loss and specifically the entity multi-classification loss are included is accumulated through multi-task learning, a loss function is constructed, finally the convergence speed of model training is accelerated through Weight Re-initialization (Weight Re-initialization), and the training stability is improved. The invention effectively solves the ambiguity problem of company entities which are mostly referred to as short in a large amount of financial news information, avoids semantic understanding deviation caused by company entities with various meanings, improves the accuracy rate of company association, and provides important basic technical support for various downstream financial analysis algorithms.

Compared with a regular expression matching-based method, the method has the advantages that by means of initial labeled linguistic data and by means of the generalization capability of the model, a large amount of follow-up manual rule maintenance work can be effectively avoided. Compared with a method based on unsupervised sample clustering, the introduction of the labeled data effectively improves the accuracy of model prediction, and the generalization capability of the supervised model also avoids the need of re-labeling and model training when company entities are added.

The present invention is not limited to the above-mentioned embodiments, and based on the technical solutions disclosed in the present invention, those skilled in the art can make some substitutions and modifications to some technical features without creative efforts according to the disclosed technical contents, and these substitutions and modifications are all within the protection scope of the present invention.

Claims

1. A MRC-based corporate entity disambiguation method incorporating a knowledge base, comprising the steps of:

acquiring a statement to be disambiguated;

2. The MRC-based corporate entity disambiguation method in combination with a knowledge base according to claim 1, characterised in that the output of the Bert model is provided with two loss functions; the loss function includes a first task loss function and a second task loss function.

3. The MRC-based corporate entity disambiguation method in combination with a knowledge base of claim 2, wherein said first mission loss function is a binary classification loss; the second task loss function is a multi-classification loss.

4. The MRC-based corporate entity disambiguation method in combination with a knowledge base of claim 2, wherein said first task loss function is represented by the following formula:

output₁＝Sigmoid(W₁×H_[CLS])

loss₁＝binary_crossentropy(output₁,label₁)

5. The MRC-based corporate entity disambiguation method in combination with a knowledge base of claim 2, wherein said second task loss function is represented by the following formula:

entity_output_i＝Sigmoid(W_{entity_i}×H_{entity_i})

loss_{entity_i}＝binary_crossentropy(entity_output_i,label_{entity_i})

in the formula, entity _ output_iRepresenting the model output at the ith entity; w_{entity_i}Representing calculating a weight matrix of an output at an ith entity; h_{entity_i}Representing a semantic vector at the ith entity location; loss_{entity_i}Is shown asLoss of i entities; label_{entity_i}A real tag representing an i-th entity; loss₂Represents the loss of task two; n represents the number of ambiguous entities or simply the possible corresponding entities.

6. The MRC based company entity disambiguation method in combination with knowledge base of claim 2, wherein the Bert model disambiguates MRC structure spliced with different entity description sentences by a first task loss function and a second task loss function, and the specific disambiguation process is as follows:

7. The MRC-based corporate entity disambiguation method in conjunction with a knowledge base of claim 1, wherein said Bert model is stacked through a 12-layer basis neural network structure.

8. The MRC based company entity disambiguation method in combination with knowledge base of claim 7, wherein the Bert model is trained, and the MRC structure is input into the trained Bert model to implement sentence disambiguation; the training method of the Bert model comprises the following steps:

setting parameters of a basic neural network structure in a Bert model;