CN114861669A

CN114861669A - Chinese entity linking method integrating pinyin information

Info

Publication number: CN114861669A
Application number: CN202210614477.3A
Authority: CN
Inventors: 李永强; 范陈强; 冯远静; 赵永智; 姚辉; 李文伟; 林栋�; 吴毕亮; 叶衍统; 汤家睿; 薛志豪
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2022-08-05

Abstract

A Chinese entity linking method for integrating pinyin information comprises the following steps: 1) constructing a local knowledge base, wherein the knowledge base comprises a plurality of entities, and each entity corresponds to a unique identification id, an alias and a related descriptive text; 2) selecting training data and carrying out corresponding format processing; 3) named entity recognition; 4) generating a candidate entity; 5) entity disambiguation is mainly based on a two-classification thought, pinyin information is merged in the step, and then the pinyin information is sorted according to the probability of candidate entities, and the highest probability is taken as a correct entity. The invention has the beneficial effects that: the entity link problem of homomorphism and heteronymy in Chinese entities can be well solved.

Description

Chinese entity linking method integrating pinyin information

Technical Field

The invention relates to information extraction, information retrieval and deep learning, and provides a Chinese entity linking method blended with pinyin information.

Background

Entity linking techniques are mainly directed to the process of referring to nominal words appearing in text, often called these words as references, and then corresponding these references to entities in a knowledge base. For example, "china is a good place", in this sentence, "china" is referred to as a reference in the text, and "china" generally exists in a knowledge base in a form of a full name of "the people's republic of china", and entity linking refers to linking these two entities together, so as to determine what the true entity represented by the reference in the text is. However, because of the cultural characteristics of Chinese, the situations of one-word polysemy and one-word polysemy cannot be avoided, and the problem of ambiguity always restricts the development of a solid link technology, for example, apple can refer to apple company and apple, which is a fruit. Today's entity linking algorithms focus primarily on improvements in how to improve the ability of an entity to disambiguate so that the correct entity of the entity in the knowledge base can be inferred using the mentioned context information.

The entity link technology for Chinese includes two subtasks of entity identification and entity link, and the task currently has the following problems: 1) due to the influence of Chinese culture, a plurality of isomorphic and allophone entities exist, such as a station in a "sky station", a name of a place is represented by a murmur, and a part of a house is represented by a murmur; 2) the context of the short text is not rich enough, and the analysis of the reference of the entity has great challenge; 3) the ambiguity problem remains a significant difficulty with current entity linking techniques.

Disclosure of Invention

In order to solve the technical problems, the invention provides an entity linking method based on the Pinyin information blended by Bert, which can well solve the entity linking problem of homophones and heterophones in Chinese characters.

In order to solve the technical problems, the invention provides the following technical scheme:

an entity linking method for blending pinyin information comprises the following steps:

1) constructing a local knowledge base, wherein the knowledge base comprises a plurality of entities, and each entity corresponds to a unique identification id, an alias and a related descriptive text;

2) selecting training data and carrying out corresponding format processing;

3) named entity recognition;

4) candidate entity generation: the method comprises the steps of constructing an entity name mapping table and then obtaining the entity name mapping table by using a matching means;

5) entity disambiguation: and (4) regarding the task requirements as two categories, sequencing according to the predicted probability, and taking the entity with the highest probability as a correct entity.

Further, the step of step 2) is as follows:

2.1) constructing a pinyin letter mapping table, wherein the first sound is 1, the second sound is 2, and the like, silence is 0, no pinyin is 5, and the pinyin are stored in a list form according to numbers and letters, such as [0,1,2,3,4,5, a, b … x, y, z ], corresponding to [0,1,2,3 … 30,31 ];

2.2) carrying out format processing on the training data, and processing special symbols and punctuations;

2.3) constructing an entity alias list;

2.4) constructing entity mapping dictionary of the entry _ to _ id and id _ to _ entry.

Still further, the step 3) comprises the following steps:

3.1) carrying out an entity recognition task by utilizing a bert pre-training model;

3.2) utilizing a maximum forward matching technology, and taking the entity table constructed previously as an entity library;

3.3) the Chinese text is sent into a BERT pre-training model to obtain an output vector of the text, then the output vector passes through a CRF layer to obtain entities in the text, the entities are compared with the entities in an entity library, and the entity with the maximum length in the entity library is selected according to the character matching degree of the entities and the entities in the entity library.

Further, the step 5) comprises the following steps:

5.1) obtaining an entity representation: the entity identified in the Chinese text and the candidate entity in the entity alias mapping table are removed, the corresponding entity information including the related descriptive text is searched in the knowledge base, firstly converted into the corresponding id through the entity-id mapping table, then the corresponding pinyin mapping id is obtained according to the pinyin of each Chinese character, the two representations are added, for example, a word of 'Chinese', the id in the corresponding word table is assumed to be [705, 1475], the corresponding pinyin is [ zhong1, guo2], the pinyin id is converted into the pinyin mapping of [31,13,20,19,12,1], [12,26,20,2], [16,15] after the average pooling processing, the pinyin of each Chinese character is expressed as [721,1490] and is merged into the word representation as the initial vector representation of two characters of 'Chinese', the initial vector representation is combined with the related descriptive information, and then the initial vector representation is sent into a Bert pre-training model, obtaining an output vector representation of each entity at [ CLS ];

5.2) entity disambiguation: based on the thought of two classifications, the matched entity is used as a positive example, then two negative examples are selected from a candidate entity set, texts and text descriptions of entities needing disambiguation are spliced together to be used as the input of a Bert model, a vector output at the [ CLS ] position is connected with feature vectors of the starting and ending positions in the candidate entity, the probability of the candidate entity is obtained through a full connection layer and a sigmoid activation function, and the entity with the highest probability is selected as a correct entity.

Where n is the sample size of the training,

is the result of model prediction, y _i Are true results, all represented in a probabilistic form.

The invention has the beneficial effects that: the entity link problem of homomorphism and heteronymy in Chinese entities can be well solved.

Drawings

FIG. 1 is a named entity recognition flow diagram of the present invention.

Fig. 2 is a flow chart of entity disambiguation of the present invention.

Fig. 3 is an overall flowchart.

Detailed Description

The following detailed description of the embodiments of the present invention will be made with reference to the accompanying drawings and examples, which are provided for illustration of the present invention and are not intended to limit the scope of the present invention.

As shown in fig. 1, fig. 2 and fig. 3, a method for linking chinese entities blended with pinyin information includes the following steps:

1) and constructing a local knowledge base, wherein the knowledge base comprises a plurality of entities, and each entity corresponds to a unique identification id, an alias and related descriptive text. The information that should be contained in the knowledge base is as follows: (including but not limited to, self-expandable)

Entity alias, entity id, entity name, entity type, and description information of the entity

2) Selecting training data and performing corresponding format processing, wherein an example format is as follows:

{ "text _ id": 3"," text ": interpretation fifteen years of the universe", "information _ data" [ { "kb _ id": 131751"," information ": fifteen years of the universe", "offset": 3"}, wherein" text _ id "denotes the sequence number of the chinese text in the dataset," kb _ id "denotes the id of the entity in the text in the knowledge base, and" offset "denotes the identified starting position mentioned in the text.

The step 2) comprises the following steps:

2.3) constructing an entity alias list;

2.4) constructing entity mapping dictionaries of the entry _ to _ id and the id _ to _ entry;

3) named entity recognition, comprising the steps of:

3.2) the technology mainly utilized is the maximum forward matching technology, and the entity table constructed in advance is used as an entity library;

5) entity disambiguation: the task requirements are considered as two categories, the categories are sorted according to the predicted probability, and the entity with the highest probability is taken as a correct entity;

the step 5) comprises the following steps:

5.2) entity disambiguation: based on the thought of two classifications, taking the matched entity as a positive example, then selecting two negative examples in a candidate entity set, splicing texts and text descriptions of entities needing disambiguation, using the texts and the text descriptions as the input of a Bert model, connecting vectors output at [ CLS ] positions with feature vectors at the starting positions and the ending positions in the candidate entities, obtaining the probability of the candidate entities through a full connection layer and a sigmoid activation function, and selecting the entity with the highest probability as a correct entity;

where n is the sample size of the training,

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the improvement principle of the present invention, and these modifications and variations should also be within the protection scope of the present invention.

Claims

1. A Chinese entity linking method for blending pinyin information is characterized by comprising the following steps:

2) selecting training data and carrying out corresponding format processing;

3) named entity recognition;

2. The method as claimed in claim 1, wherein the step 2) comprises the following steps:

2.1) constructing a pinyin letter mapping table, for example, the first sound is 1, the second sound is 2, and the like, silence is 0, no pinyin is 5, and the pinyin and letters are stored in a list form according to numbers and letters, such as [0,1,2,3,4,5, a, b … x, y, z ], corresponding to [0,1,2,3 … 30,31 ];

2.3) constructing an entity alias list;

3. The method as claimed in claim 1 or 2, wherein the step 3) comprises the steps of:

4. The method as claimed in claim 1 or 2, wherein the step 5) comprises the steps of:

5.1) obtaining an entity representation: the entity and the candidate entity in the entity alias mapping table are identified in the Chinese text, corresponding entity information including related descriptive texts is searched in a knowledge base, the entity information and the candidate entity are converted into corresponding ids through an entity-id mapping table, corresponding pinyin mapping ids are obtained according to pinyin of each Chinese character, the representation of the entity information and the corresponding id are added, the related descriptive information is combined, the obtained result is sent into a Bert pre-training model, and output vector representation of each entity is obtained at a [ CLS ];

where n is the sample size of the training,