CN112507126B

CN112507126B - Entity linking device and method based on recurrent neural network

Info

Publication number: CN112507126B
Application number: CN202011416594.6A
Authority: CN
Inventors: 洪万福; 钱智毅; 赵青欣
Original assignee: Xiamen Yuanting Information Technology Co ltd
Current assignee: Xiamen Yuanting Information Technology Co ltd
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2022-11-15
Anticipated expiration: 2040-12-07
Also published as: CN112507126A

Abstract

The invention discloses an entity linking device and method based on a recurrent neural network, wherein the device comprises the following steps: a text input unit; an entity recognition unit which performs an inference process of an entity recognition model on the target text input from the text input unit and outputs candidate entities; the knowledge base matching unit is used for performing database matching according to the candidate entities and outputting a preselected link result corresponding to each candidate entity; the text vectorization unit is used for vectorizing the input target text, the candidate entity and the preselected link result corresponding to the candidate entity to combine the target text, the candidate entity and the preselected link result into an embedded vector for output; the link model reasoning unit is used for carrying out entity link reasoning according to the embedded vector and outputting a reasoning result; and the link result output unit is used for determining the entity link result of each candidate entity in the knowledge base according to the inference result. The implementation mode can fully utilize external knowledge, thereby improving the accuracy of entity link.

Description

Entity linking device and method based on recurrent neural network

Technical Field

The invention relates to the field of artificial intelligence, in particular to an entity linking device and method based on a recurrent neural network.

Background

With the arrival of a new wave of artificial intelligence wave in the years, deep learning related technologies are applied to various industries and fields. The knowledge graph is a very important research direction in deep learning, at present, after the knowledge graph is extracted through entity-relation, the research technology is basically formed, but problems still exist when the knowledge graph is really used to a large extent, and mainly because natural language has multiple characteristics of complexity, multiple meanings and ambiguity.

Entity linking is the task of linking entities mentioned in the text with the corresponding entities in their knowledge base, and is to solve the ambiguity problem existing between the entities. Its potential applications include information extraction, information retrieval, and knowledge base population, but this task is challenging due to name variations and entity ambiguity.

The ambiguity of an entity is shown in two aspects, firstly, there may be multiple synonyms (need to link) in the entity, i.e. one entity can be represented by multiple entity references, for example, the Massachusetts institute of technology and MIT refer to the same entity in Massachusetts in America. Meanwhile, an entity also has a phenomenon of word ambiguity (needs to be disambiguated), that is, the same entity name can represent multiple entities, for example, apple can be fruit or Apple company. The entity linking algorithm needs to link the entity to the correct mapped entity in the knowledge-graph by the target knowledge-graph through the entity's designation and its context's text information.

Disclosure of Invention

In view of the above-mentioned defects of the prior art, the present invention aims to provide an entity linking apparatus and method, so as to fully utilize external knowledge, optimize the link model reasoning process and improve the accuracy of entity linking.

In order to achieve the above object, the present invention provides an entity linking apparatus based on a recurrent neural network, including:

the text input unit is used for inputting text data, performing data processing on the text data and outputting a target text;

the entity recognition unit is used for executing a reasoning process of an entity recognition model on the input target text and outputting candidate entities;

the knowledge base matching unit is used for inputting the candidate entities of the entity identification unit, performing database matching according to the candidate entities and outputting a preselected link result corresponding to each candidate entity;

the text vectorization unit is used for vectorizing the input target text, the candidate entity and the preselected link result corresponding to the candidate entity to combine the input target text, the candidate entity and the preselected link result into an embedded vector for outputting;

the link model reasoning unit is used for inputting the embedded vector, carrying out entity link reasoning according to the embedded vector and outputting a reasoning result;

and the link result output unit is used for inputting the inference result and determining the entity link result of each candidate entity in the knowledge base, namely outputting the id, the entity name, the entity type and the text information of each candidate entity in the knowledge base.

Further, the text input unit includes:

the file reading module is used for receiving input text data;

and the data processing module is used for converting the input text data into a specified structured text to form a target text.

Further, the entity identification unit includes:

the data preprocessing module is configured to perform a data preprocessing process on input text data, wherein the data preprocessing process comprises data cleaning, screening and word segmentation;

a vectorization processing module configured to perform a vector encoding operation after data preprocessing, and output an embedded vector;

the entity recognition model storage module is used for storing the trained entity recognition model;

the entity recognition model loading module is used for loading an entity recognition model and determining all candidate entities in the target text;

and a candidate entity result output module for performing normalization processing for outputting the candidate entity.

Further, the knowledge base matching unit includes:

the knowledge base storage module is used for storing a pre-prepared knowledge base file;

and the knowledge base matching module is used for matching the input candidate entity with the knowledge base file and acquiring a preselected link result of the candidate entity in the knowledge base.

Further, the link model inference unit includes:

the entity link model storage module is used for storing the entity link model which is trained;

and the entity link model loading module is used for loading the entity link model and the embedded vector and executing model reasoning.

Further, the link result output unit comprises an entity link result output module, and the entity link result output module is used for performing standardization processing on the acquired entity link results of all candidate entities after the model reasoning is finished, and outputting the results according to a set output mode and an output format.

The invention also provides an entity linking method based on the recurrent neural network, which comprises the following steps:

step S1: inputting text data, performing data processing on the text data, and outputting a target text;

step S2: executing a reasoning process of an entity recognition model on the target text, and outputting candidate entities;

and step S3: obtaining a preselected link result corresponding to each candidate entity through knowledge base matching;

and step S4: vectorizing the target text, the candidate entities and the preselected link results corresponding to the candidate entities, and combining the vectorized vectors into an embedded vector;

step S5: executing the reasoning process of the entity link model according to the embedded vector, and outputting a reasoning result;

step S6: and determining an entity link result of each candidate entity in the knowledge base according to the reasoning result.

Further, the vectorization processing in step S4 specifically includes: processing a target text, a candidate entity and a preselected link result of the candidate entity by adopting a mode of splicing a plurality of semantic codes, wherein the plurality of semantic codes comprise: word encoding, word segmentation, and n _ gram models.

Further, the step S5 specifically includes: inputting the context semantics of the candidate entities and the preselected link result corresponding to each candidate into a trained entity link model, and outputting a reasoning result; the entity link model adopts a recurrent neural network, and the frame of the entity link model is based on BilSTM + CNN + CRF, wherein the BilSTM is used for acquiring the information of the whole sequence of the preselected link result; the CNN is used for extracting local features of the current word; the CRF is used for sequence labeling to provide relevance separation at the output level.

Further, the entity link result in step S6 at least includes the specific id, entity name, entity type and text information of the entity in the knowledge base.

The invention realizes the following technical effects:

according to the entity linking method, the neural network models are arranged in the entity linking process and the link reasoning process to carry out model reasoning to obtain the candidate entities, and the entity linking result is obtained by carrying out model reasoning according to the context semantics of the candidate entities and the preselected link result corresponding to each candidate, so that the external knowledge can be fully utilized, the link model reasoning process is optimized, and the accuracy of entity linking is improved.

Drawings

FIG. 1 is a system framework and flow diagram of a physically linking device of the present invention;

FIG. 2 is a schematic diagram of the entity linking method of the present invention;

FIG. 3 is a flow chart of entity recognition model training of the present invention;

FIG. 4 is a flow chart of the training of the entity-link model of the present invention.

Detailed Description

To further illustrate the various embodiments, the invention provides the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the embodiments. With these references, one of ordinary skill in the art will appreciate other possible embodiments and advantages of the present invention. Elements in the figures are not drawn to scale and like reference numerals are generally used to indicate like elements.

The invention will now be further described with reference to the accompanying drawings and detailed description.

Referring to fig. 1 to 4, the present invention discloses an entity linking apparatus based on a recurrent neural network, which is a set of application programs or a set of control components applied to a server side, and includes: the system comprises a text input unit, an entity recognition unit, a knowledge base matching unit, a text vectorization unit, a link model reasoning unit and a link result output unit. The following description will be made for each functional unit:

1. and the text input unit is used for inputting text data, performing data processing on the text data and outputting a target text. The system specifically comprises a file reading module and a data processing module, wherein the file reading module is used for receiving input text data. The file reading module is configured to receive a text format of a text, wherein the text format can be an unstructured text (such as txt), a semi-structured text, a structured text (such as excel, csv, json) and other text uploading modes; and the data processing module is configured to convert the input text data into a structured text according to different text formats to form a target text output.

2. And the entity identification unit is used for executing an inference process of an entity identification model on the input target text and outputting candidate entities. The method specifically comprises the following steps: the system comprises a data preprocessing module, a vectorization processing module, an entity recognition model storage module, an entity recognition model loading module and a candidate entity result output module. The data preprocessing module is configured to perform a data preprocessing process on input text data, wherein the data preprocessing process comprises data cleaning, screening and word segmentation; a vectorization processing module configured to perform vector encoding operations after the data pre-processing module to provide an embedded vector for the entity identification module; the entity recognition model storage module is used for storing the entity recognition model which is trained; the entity recognition model loading module is used for loading the entity recognition model and determining and acquiring all candidate entities in the target text; and the candidate entity result output module is used for executing standardization processing and outputting the candidate entities.

In the entity recognition unit, the entity recognition model is obtained by training, as shown in fig. 3, the training process of the entity recognition model includes: inputting text data as training data, and preprocessing the training data, such as performing data cleaning, screening, word segmentation and other operations; the vectorization processing module carries out vectorization processing on the training data; inputting the embedded vector output after vectorization processing into an entity recognition model framework for training; and monitoring the training effect of the entity model, and storing the trained entity recognition model.

3. And the knowledge base matching unit is used for inputting the candidate entities of the entity identification unit, performing database matching according to the candidate entities and outputting a preselected link result corresponding to each candidate entity. The method specifically comprises the following steps: the system comprises a knowledge base storage module and a knowledge base matching module. The knowledge base storage module is used for storing a pre-prepared knowledge base file; and the knowledge base matching module is used for matching the input candidate entity with the knowledge base file and acquiring a preselected link result of the candidate entity in the knowledge base.

4. And the text vectorization unit is used for vectorizing the input target text, the candidate entity and the preselected link result corresponding to the candidate entity to combine the input target text, the candidate entity and the preselected link result into an embedded vector for outputting. The vectorization processing refers to processing a target text and a candidate entity by adopting a splicing mode of multiple semantic codes, wherein the multiple semantic codes comprise: word encoding, word segmentation, and multiple word segmentation encoding of N _ gram models (also referred to as N-grams).

5. And the link model reasoning unit is used for inputting the embedded vector, carrying out entity link reasoning according to the embedded vector and outputting a reasoning result. The method specifically comprises the following steps: the entity link model storage module is used for storing the entity link model which is trained; and the entity link model loading module is used for loading the entity link model and the embedded vector and executing model reasoning. The entity link model is a cyclic neural network model, and the cyclic neural network framework is mainly BilSTM + CNN + CRF, wherein the BilSTM can acquire the information of the whole sequence, and the context information of the input sequence can be fully utilized in the entity link task to more accurately match a certain entity in the knowledge base unit. When processing sequence data, because BilSTM adds a backward calculation process compared with the one-way LSTM, the process can utilize the following information of the sequence, and finally, the values calculated in the forward direction and the backward direction are simultaneously output to an output layer, thus obtaining all information of the sequence through the two directions. However, biLSTM may discard some important information due to the problem of model capacity when learning some longer sentences, so a CNN layer is added to the model to extract local features of the current word. The CRF (conditional random field) is used as a sequence labeling module to separate the relevance of an output layer, the relevance of context information can be fully considered when an entity in a knowledge base is predicted, more importantly, the Viterbi algorithm for solving the CRF utilizes a dynamic programming method to calculate a path with the maximum probability, which is quite consistent with the aim of an entity linking task, and the problem that an illegal sequence of a 'B-LOC' label followed by an 'I-ORG' label appears in a result can be avoided.

In the link model inference unit, the entity link model is obtained through training. Referring to fig. 4, the training process of the entity link model is similar to the training process of the entity recognition model, and includes: inputting text data as training data, and preprocessing the training data; associating the training data with a knowledge base, and verifying the correctness of the training data; vectorizing the training data; introducing the embedded vector output after vectorization processing into an entity link model framework for training; monitoring the training effect of the entity link model and storing the trained entity link model.

6. And the link result output unit is used for inputting the inference result and determining an entity link result of each candidate entity in the knowledge base, namely outputting the id, the entity name, the entity type and the text information of each candidate entity in the knowledge base. The method specifically comprises the following steps: and the entity link result output module is used for executing standardized processing on the acquired entity link results of all candidate entities after the model reasoning is finished and outputting the result after the standardization. The standardized processing comprises that the names and the output formats of the output fields need to be standardized, and the output modes and the output formats which are specified in advance and the corresponding meanings of each field need to be standardized so that the system can correctly receive the output result and carry out correct processing.

The invention also discloses an entity linking method, which comprises the following steps:

step S1: inputting original text data, performing structural conversion processing on the text data, and outputting a structured target text.

Step S2: and executing the inference process of the entity recognition model on the target text, and outputting a candidate entity.

And step S3: and obtaining a preselected link result corresponding to each candidate entity through knowledge base matching.

And step S4: and vectorizing the target text, the candidate entities and the preselected link results corresponding to the candidate entities, and combining the vectorized vectors into an embedded vector.

More specifically, a target text and a candidate entity are processed in a splicing mode of multiple semantic codes, wherein the multiple semantic codes comprise: word coding, word segmentation, various word segmentation coding in an n _ gram mode and the like.

Step S5: and performing entity link reasoning according to the embedded vector, and outputting a reasoning result.

And transmitting the embedded vector into the entity link model to execute a model reasoning process, namely inputting the context semantics of the candidate entities and the preselected link result corresponding to each candidate into a trained entity link model built on the basis of the recurrent neural network, and outputting a reasoning result.

Step S6: and determining an entity link result of each candidate entity in the knowledge base according to the reasoning result. I.e., the entity's specific id in the knowledge base, entity name, entity type, textual information, etc.

Example 2

To facilitate understanding of those skilled in the art, a specific implementation example of the entity linking method of the present invention is as follows:

step S1: in this embodiment, a data interaction example is expressed in a JSON format (JSON is a lightweight data storage format unrelated to development language, and is a standard specification of a data format), and an example of a format of a sent request data is as follows:

examples of return data are as follows:

and S2, identifying candidate entities. And after the target text is obtained, identifying candidate entities in the target text by adopting an entity identification technology. The entity identification technology here refers to: and vectorizing the target text to generate an embedded vector, transmitting the embedded vector into the trained entity recognition model, and acquiring an entity recognition result. Operationally, the process comprises: the client sends an entity identification request to the server service portal, and the server returns a result to the client. The request data format sent by the client to the server is as follows:

wherein "component" and "entity _ identification" represent an entity identification request; "text" represents the original text, "Montreal Bank Foundation and precious Metal derivatives deal with Tai Wong," gold still benefits from USD due to the high uncertainty of this year's incentive program. "is the content of the target text.

An example of the returned result from the server to the client is as follows:

the above is an example of entity linking in the financial field, that is, the server side outputs the name of the related entity mentioned in the text as an output result through a series of receiving data, model reasoning and finally outputting a reasoning result.

Briefly described: since the target entities in the financial field can be divided into: products (which may be subdivided into metal futures, agricultural futures, foreign exchange futures, etc.), organization names (which may be subdivided into marketing companies, futures companies, other companies, etc.). The relevant entities in the text are thus taken as one output in the above example.

And 3, matching the preselected link result of the candidate entity by the knowledge base. And the client side issues a knowledge base matching instruction, and the server side quickly queries the preselected link result of the candidate entity in the knowledge base after receiving the instruction. An example of the result sent by the server to the client is as follows:

the result shows that in the knowledge base, the entity "tengwangge's preselected link result includes" thining "," music "," poery ", i.e. article, musical composition, poetry composition.

Step 4, inputting a text vectorization unit according to a target text input by a user and the obtained candidate entity and the obtained preselected link result of the candidate entity, wherein the unit processes the target text and the candidate entity by adopting a splicing mode of multiple semantic codes, and the multiple semantic codes comprise: character coding, word segmentation, various word segmentation codes in an n _ gram mode and the like; to output an embedded vector to the server for the next vector application, and to output a vector result to the client.

And 5, transmitting the embedded vector obtained in the step 4 into a recurrent neural network model, and performing a model reasoning process, namely inputting the context semantics of the candidate entity and the possible result corresponding to each candidate into a trained model built based on the recurrent neural network. The reasoning process is operated at the server side, and the client side obtains a reasoning completion progress;

and 6, according to the inference result obtained in the step 5, the server side obtains an entity link result corresponding to each candidate entity, namely the concrete id, entity name, entity type, text information and the like of the entity in the knowledge base. Meanwhile, the inference result is output to the client, the client performs interface display, and examples of information sent to the server by the client are as follows:

examples of information sent by the server to the client are as follows:

according to the reasoning result, the 'Tengwangge' is a musical composition.

According to the entity linking method, the neural network models are arranged in the entity linking process and the link reasoning process to carry out model reasoning to obtain the candidate entities, and the entity linking result is obtained by carrying out model reasoning according to the context semantics of the candidate entities and the preselected link result corresponding to each candidate, so that the external knowledge can be fully utilized, the link model reasoning process is optimized, and the entity linking accuracy is improved.

While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An entity linking device based on a recurrent neural network, comprising:

the text input unit is used for inputting text data, performing structured conversion processing on the text data and outputting a target text; the structural conversion processing adopts JSON format;

the text vectorization unit is used for vectorizing the input target text, the candidate entity and a preselected link result corresponding to the candidate entity, combining the vectorized target text, the candidate entity and the preselected link result into an embedded vector and outputting the embedded vector;

the link model reasoning unit is used for inputting the embedded vector, carrying out a reasoning process of the entity link model according to the embedded vector and outputting a reasoning result; the entity-linking model is a recurrent neural network-based model;

the link result output unit is used for inputting the reasoning result and determining an entity link result of each candidate entity in the knowledge base, namely outputting id, entity name, entity type and text information of each candidate entity in the knowledge base;

the text input unit includes: the file reading module is used for receiving input text data; the data processing module is used for converting the input text data into a specified structured text to form a target text;

the entity identification unit includes: the data preprocessing module is configured to perform data preprocessing on input text data; a vectorization processing module configured to perform a vector encoding operation after data preprocessing, and output an embedded vector; the entity recognition model storage module is used for storing the trained entity recognition model; the entity recognition model loading module is used for loading an entity recognition model and determining all candidate entities in the target text; and a candidate entity result output module for performing a normalization process for outputting a candidate entity; the vectorization processing module processes a target text, a candidate entity and a preselected link result of the candidate entity in a mode of splicing a plurality of semantic codes, wherein the plurality of semantic codes comprise: word encoding, word segmentation and n _ gram models;

the knowledge base matching unit includes: the knowledge base storage module is used for storing a pre-prepared knowledge base file; the knowledge base matching module is used for matching the input candidate entity with the knowledge base file and acquiring a preselected link result of the candidate entity in the knowledge base;

the link model inference unit includes: the entity link model storage module is used for storing the entity link model which is trained; the entity link model loading module is used for loading the entity link model and the embedded vector and executing model reasoning; the entity link model adopts a recurrent neural network, and the framework of the entity link model is based on BilSTM + CNN + CRF, wherein the BilSTM is used for acquiring the information of the whole sequence of the preselected link result; the CNN is used for extracting local features of the current word; CRF is used for sequence marking to provide relevance separation of output level;

and the link result output unit comprises an entity link result output module, and the entity link result output module is used for executing standardized processing on the acquired entity link results of all candidate entities after the model reasoning is finished and outputting the results according to a set output mode and an output format.

2. An entity linking method based on a recurrent neural network is characterized by comprising the following steps:

step S1: inputting text data, performing structured conversion processing on the text data, and outputting a target text; the structural conversion processing adopts JSON format;

step S2: executing a reasoning process of an entity recognition model on the target text, and outputting a candidate entity;

and step S4: vectorizing the target text, the candidate entities and the preselected link results corresponding to the candidate entities, and combining the vectorized results into an embedded vector;

step S6: determining an entity link result of each candidate entity in a knowledge base according to the reasoning result;

the vectorization processing in step S4 specifically includes: processing a target text, a candidate entity and a preselected link result of the candidate entity by adopting a mode of splicing a plurality of semantic codes, wherein the plurality of semantic codes comprise: word encoding, word segmentation and n _ gram models;

the step S5 specifically includes: inputting the context semantics of the candidate entities and the preselected link result corresponding to each candidate entity into a trained entity link model, and outputting a reasoning result; the entity link model adopts a recurrent neural network, and the framework of the entity link model is based on BilSTM + CNN + CRF, wherein the BilSTM is used for acquiring the information of the whole sequence of the preselected link result; the CNN is used for extracting local features of the current word; CRF is used for sequence marking to provide relevance separation of output level;

the entity link result in step S6 at least includes the specific id, entity name, entity type and text information of the entity in the knowledge base.