CN114861669A - Chinese entity linking method integrating pinyin information - Google Patents

Chinese entity linking method integrating pinyin information Download PDF

Info

Publication number
CN114861669A
CN114861669A CN202210614477.3A CN202210614477A CN114861669A CN 114861669 A CN114861669 A CN 114861669A CN 202210614477 A CN202210614477 A CN 202210614477A CN 114861669 A CN114861669 A CN 114861669A
Authority
CN
China
Prior art keywords
entity
entities
pinyin
text
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210614477.3A
Other languages
Chinese (zh)
Inventor
李永强
范陈强
冯远静
赵永智
姚辉
李文伟
林栋�
吴毕亮
叶衍统
汤家睿
薛志豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202210614477.3A priority Critical patent/CN114861669A/en
Publication of CN114861669A publication Critical patent/CN114861669A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions

Abstract

A Chinese entity linking method for integrating pinyin information comprises the following steps: 1) constructing a local knowledge base, wherein the knowledge base comprises a plurality of entities, and each entity corresponds to a unique identification id, an alias and a related descriptive text; 2) selecting training data and carrying out corresponding format processing; 3) named entity recognition; 4) generating a candidate entity; 5) entity disambiguation is mainly based on a two-classification thought, pinyin information is merged in the step, and then the pinyin information is sorted according to the probability of candidate entities, and the highest probability is taken as a correct entity. The invention has the beneficial effects that: the entity link problem of homomorphism and heteronymy in Chinese entities can be well solved.

Description

Chinese entity linking method integrating pinyin information
Technical Field
The invention relates to information extraction, information retrieval and deep learning, and provides a Chinese entity linking method blended with pinyin information.
Background
Entity linking techniques are mainly directed to the process of referring to nominal words appearing in text, often called these words as references, and then corresponding these references to entities in a knowledge base. For example, "china is a good place", in this sentence, "china" is referred to as a reference in the text, and "china" generally exists in a knowledge base in a form of a full name of "the people's republic of china", and entity linking refers to linking these two entities together, so as to determine what the true entity represented by the reference in the text is. However, because of the cultural characteristics of Chinese, the situations of one-word polysemy and one-word polysemy cannot be avoided, and the problem of ambiguity always restricts the development of a solid link technology, for example, apple can refer to apple company and apple, which is a fruit. Today's entity linking algorithms focus primarily on improvements in how to improve the ability of an entity to disambiguate so that the correct entity of the entity in the knowledge base can be inferred using the mentioned context information.
The entity link technology for Chinese includes two subtasks of entity identification and entity link, and the task currently has the following problems: 1) due to the influence of Chinese culture, a plurality of isomorphic and allophone entities exist, such as a station in a "sky station", a name of a place is represented by a murmur, and a part of a house is represented by a murmur; 2) the context of the short text is not rich enough, and the analysis of the reference of the entity has great challenge; 3) the ambiguity problem remains a significant difficulty with current entity linking techniques.
Disclosure of Invention
In order to solve the technical problems, the invention provides an entity linking method based on the Pinyin information blended by Bert, which can well solve the entity linking problem of homophones and heterophones in Chinese characters.
In order to solve the technical problems, the invention provides the following technical scheme:
an entity linking method for blending pinyin information comprises the following steps:
1) constructing a local knowledge base, wherein the knowledge base comprises a plurality of entities, and each entity corresponds to a unique identification id, an alias and a related descriptive text;
2) selecting training data and carrying out corresponding format processing;
3) named entity recognition;
4) candidate entity generation: the method comprises the steps of constructing an entity name mapping table and then obtaining the entity name mapping table by using a matching means;
5) entity disambiguation: and (4) regarding the task requirements as two categories, sequencing according to the predicted probability, and taking the entity with the highest probability as a correct entity.
Further, the step of step 2) is as follows:
2.1) constructing a pinyin letter mapping table, wherein the first sound is 1, the second sound is 2, and the like, silence is 0, no pinyin is 5, and the pinyin are stored in a list form according to numbers and letters, such as [0,1,2,3,4,5, a, b … x, y, z ], corresponding to [0,1,2,3 … 30,31 ];
2.2) carrying out format processing on the training data, and processing special symbols and punctuations;
2.3) constructing an entity alias list;
2.4) constructing entity mapping dictionary of the entry _ to _ id and id _ to _ entry.
Still further, the step 3) comprises the following steps:
3.1) carrying out an entity recognition task by utilizing a bert pre-training model;
3.2) utilizing a maximum forward matching technology, and taking the entity table constructed previously as an entity library;
3.3) the Chinese text is sent into a BERT pre-training model to obtain an output vector of the text, then the output vector passes through a CRF layer to obtain entities in the text, the entities are compared with the entities in an entity library, and the entity with the maximum length in the entity library is selected according to the character matching degree of the entities and the entities in the entity library.
Further, the step 5) comprises the following steps:
5.1) obtaining an entity representation: the entity identified in the Chinese text and the candidate entity in the entity alias mapping table are removed, the corresponding entity information including the related descriptive text is searched in the knowledge base, firstly converted into the corresponding id through the entity-id mapping table, then the corresponding pinyin mapping id is obtained according to the pinyin of each Chinese character, the two representations are added, for example, a word of 'Chinese', the id in the corresponding word table is assumed to be [705, 1475], the corresponding pinyin is [ zhong1, guo2], the pinyin id is converted into the pinyin mapping of [31,13,20,19,12,1], [12,26,20,2], [16,15] after the average pooling processing, the pinyin of each Chinese character is expressed as [721,1490] and is merged into the word representation as the initial vector representation of two characters of 'Chinese', the initial vector representation is combined with the related descriptive information, and then the initial vector representation is sent into a Bert pre-training model, obtaining an output vector representation of each entity at [ CLS ];
5.2) entity disambiguation: based on the thought of two classifications, the matched entity is used as a positive example, then two negative examples are selected from a candidate entity set, texts and text descriptions of entities needing disambiguation are spliced together to be used as the input of a Bert model, a vector output at the [ CLS ] position is connected with feature vectors of the starting and ending positions in the candidate entity, the probability of the candidate entity is obtained through a full connection layer and a sigmoid activation function, and the entity with the highest probability is selected as a correct entity.
Figure BDA0003664052980000031
Where n is the sample size of the training,
Figure BDA0003664052980000032
is the result of model prediction, y i Are true results, all represented in a probabilistic form.
The invention has the beneficial effects that: the entity link problem of homomorphism and heteronymy in Chinese entities can be well solved.
Drawings
FIG. 1 is a named entity recognition flow diagram of the present invention.
Fig. 2 is a flow chart of entity disambiguation of the present invention.
Fig. 3 is an overall flowchart.
Detailed Description
The following detailed description of the embodiments of the present invention will be made with reference to the accompanying drawings and examples, which are provided for illustration of the present invention and are not intended to limit the scope of the present invention.
As shown in fig. 1, fig. 2 and fig. 3, a method for linking chinese entities blended with pinyin information includes the following steps:
1) and constructing a local knowledge base, wherein the knowledge base comprises a plurality of entities, and each entity corresponds to a unique identification id, an alias and related descriptive text. The information that should be contained in the knowledge base is as follows: (including but not limited to, self-expandable)
Entity alias, entity id, entity name, entity type, and description information of the entity
2) Selecting training data and performing corresponding format processing, wherein an example format is as follows:
{ "text _ id": 3"," text ": interpretation fifteen years of the universe", "information _ data" [ { "kb _ id": 131751"," information ": fifteen years of the universe", "offset": 3"}, wherein" text _ id "denotes the sequence number of the chinese text in the dataset," kb _ id "denotes the id of the entity in the text in the knowledge base, and" offset "denotes the identified starting position mentioned in the text.
The step 2) comprises the following steps:
2.1) constructing a pinyin letter mapping table, wherein the first sound is 1, the second sound is 2, and the like, silence is 0, no pinyin is 5, and the pinyin are stored in a list form according to numbers and letters, such as [0,1,2,3,4,5, a, b … x, y, z ], corresponding to [0,1,2,3 … 30,31 ];
2.2) carrying out format processing on the training data, and processing special symbols and punctuations;
2.3) constructing an entity alias list;
2.4) constructing entity mapping dictionaries of the entry _ to _ id and the id _ to _ entry;
3) named entity recognition, comprising the steps of:
3.1) carrying out an entity recognition task by utilizing a bert pre-training model;
3.2) the technology mainly utilized is the maximum forward matching technology, and the entity table constructed in advance is used as an entity library;
3.3) the Chinese text is sent into a BERT pre-training model to obtain an output vector of the text, then the output vector passes through a CRF layer to obtain entities in the text, the entities are compared with the entities in an entity library, and the entity with the maximum length in the entity library is selected according to the character matching degree of the entities and the entities in the entity library.
4) Candidate entity generation: the method comprises the steps of constructing an entity name mapping table and then obtaining the entity name mapping table by using a matching means;
5) entity disambiguation: the task requirements are considered as two categories, the categories are sorted according to the predicted probability, and the entity with the highest probability is taken as a correct entity;
the step 5) comprises the following steps:
5.1) obtaining an entity representation: the entity identified in the Chinese text and the candidate entity in the entity alias mapping table are removed, the corresponding entity information including the related descriptive text is searched in the knowledge base, firstly converted into the corresponding id through the entity-id mapping table, then the corresponding pinyin mapping id is obtained according to the pinyin of each Chinese character, the two representations are added, for example, a word of 'Chinese', the id in the corresponding word table is assumed to be [705, 1475], the corresponding pinyin is [ zhong1, guo2], the pinyin id is converted into the pinyin mapping of [31,13,20,19,12,1], [12,26,20,2], [16,15] after the average pooling processing, the pinyin of each Chinese character is expressed as [721,1490] and is merged into the word representation as the initial vector representation of two characters of 'Chinese', the initial vector representation is combined with the related descriptive information, and then the initial vector representation is sent into a Bert pre-training model, obtaining an output vector representation of each entity at [ CLS ];
5.2) entity disambiguation: based on the thought of two classifications, taking the matched entity as a positive example, then selecting two negative examples in a candidate entity set, splicing texts and text descriptions of entities needing disambiguation, using the texts and the text descriptions as the input of a Bert model, connecting vectors output at [ CLS ] positions with feature vectors at the starting positions and the ending positions in the candidate entities, obtaining the probability of the candidate entities through a full connection layer and a sigmoid activation function, and selecting the entity with the highest probability as a correct entity;
Figure BDA0003664052980000051
where n is the sample size of the training,
Figure BDA0003664052980000052
is the result of model prediction, y i Are true results, all represented in a probabilistic form.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the improvement principle of the present invention, and these modifications and variations should also be within the protection scope of the present invention.

Claims (4)

1. A Chinese entity linking method for blending pinyin information is characterized by comprising the following steps:
1) constructing a local knowledge base, wherein the knowledge base comprises a plurality of entities, and each entity corresponds to a unique identification id, an alias and a related descriptive text;
2) selecting training data and carrying out corresponding format processing;
3) named entity recognition;
4) candidate entity generation: the method comprises the steps of constructing an entity name mapping table and then obtaining the entity name mapping table by using a matching means;
5) entity disambiguation: and (4) regarding the task requirements as two categories, sequencing according to the predicted probability, and taking the entity with the highest probability as a correct entity.
2. The method as claimed in claim 1, wherein the step 2) comprises the following steps:
2.1) constructing a pinyin letter mapping table, for example, the first sound is 1, the second sound is 2, and the like, silence is 0, no pinyin is 5, and the pinyin and letters are stored in a list form according to numbers and letters, such as [0,1,2,3,4,5, a, b … x, y, z ], corresponding to [0,1,2,3 … 30,31 ];
2.2) carrying out format processing on the training data, and processing special symbols and punctuations;
2.3) constructing an entity alias list;
2.4) constructing entity mapping dictionary of the entry _ to _ id and id _ to _ entry.
3. The method as claimed in claim 1 or 2, wherein the step 3) comprises the steps of:
3.1) carrying out an entity recognition task by utilizing a bert pre-training model;
3.2) utilizing a maximum forward matching technology, and taking the entity table constructed previously as an entity library;
3.3) the Chinese text is sent into a BERT pre-training model to obtain an output vector of the text, then the output vector passes through a CRF layer to obtain entities in the text, the entities are compared with the entities in an entity library, and the entity with the maximum length in the entity library is selected according to the character matching degree of the entities and the entities in the entity library.
4. The method as claimed in claim 1 or 2, wherein the step 5) comprises the steps of:
5.1) obtaining an entity representation: the entity and the candidate entity in the entity alias mapping table are identified in the Chinese text, corresponding entity information including related descriptive texts is searched in a knowledge base, the entity information and the candidate entity are converted into corresponding ids through an entity-id mapping table, corresponding pinyin mapping ids are obtained according to pinyin of each Chinese character, the representation of the entity information and the corresponding id are added, the related descriptive information is combined, the obtained result is sent into a Bert pre-training model, and output vector representation of each entity is obtained at a [ CLS ];
5.2) entity disambiguation: based on the thought of two classifications, taking the matched entity as a positive example, then selecting two negative examples in a candidate entity set, splicing texts and text descriptions of entities needing disambiguation, using the texts and the text descriptions as the input of a Bert model, connecting vectors output at [ CLS ] positions with feature vectors at the starting positions and the ending positions in the candidate entities, obtaining the probability of the candidate entities through a full connection layer and a sigmoid activation function, and selecting the entity with the highest probability as a correct entity;
Figure FDA0003664052970000021
where n is the sample size of the training,
Figure FDA0003664052970000022
is the result of model prediction, y i Are true results, all represented in a probabilistic form.
CN202210614477.3A 2022-05-26 2022-05-26 Chinese entity linking method integrating pinyin information Pending CN114861669A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210614477.3A CN114861669A (en) 2022-05-26 2022-05-26 Chinese entity linking method integrating pinyin information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210614477.3A CN114861669A (en) 2022-05-26 2022-05-26 Chinese entity linking method integrating pinyin information

Publications (1)

Publication Number Publication Date
CN114861669A true CN114861669A (en) 2022-08-05

Family

ID=82640618

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210614477.3A Pending CN114861669A (en) 2022-05-26 2022-05-26 Chinese entity linking method integrating pinyin information

Country Status (1)

Country Link
CN (1) CN114861669A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560489A (en) * 2020-12-08 2021-03-26 北京航天云路有限公司 Entity linking method based on Bert

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560489A (en) * 2020-12-08 2021-03-26 北京航天云路有限公司 Entity linking method based on Bert

Similar Documents

Publication Publication Date Title
US10176804B2 (en) Analyzing textual data
CN101133411B (en) Fault-tolerant romanized input method for non-roman characters
CN102982021B (en) For eliminating the method for the ambiguity of the multiple pronunciations in language conversion
Zhou et al. Handwritten Chinese/Japanese text recognition using semi-Markov conditional random fields
KR101259558B1 (en) apparatus and method for detecting sentence boundaries
CN109960728B (en) Method and system for identifying named entities of open domain conference information
WO2006030302A1 (en) Optimization of text-based training set selection for language processing modules
CN111488468B (en) Geographic information knowledge point extraction method and device, storage medium and computer equipment
JP7266683B2 (en) Information verification method, apparatus, device, computer storage medium, and computer program based on voice interaction
CN108399157B (en) Dynamic extraction method of entity and attribute relationship, server and readable storage medium
CN113268576B (en) Deep learning-based department semantic information extraction method and device
Jain et al. “UTTAM” An Efficient Spelling Correction System for Hindi Language Based on Supervised Learning
KR20230009564A (en) Learning data correction method and apparatus thereof using ensemble score
CN113673228A (en) Text error correction method, text error correction device, computer storage medium and computer program product
CN114065738B (en) Chinese spelling error correction method based on multitask learning
CN115064154A (en) Method and device for generating mixed language voice recognition model
US20190286702A1 (en) Display control apparatus, display control method, and computer-readable recording medium
CN114861669A (en) Chinese entity linking method integrating pinyin information
CN110619119B (en) Intelligent text editing method and device and computer readable storage medium
CN112765977A (en) Word segmentation method and device based on cross-language data enhancement
US7831549B2 (en) Optimization of text-based training set selection for language processing modules
CN116720520A (en) Text data-oriented alias entity rapid identification method and system
CN111159405A (en) Irony detection method based on background knowledge
CN114996455A (en) News title short text classification method based on double knowledge maps
CN114970541A (en) Text semantic understanding method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination