CN114861669A - Chinese entity linking method integrating pinyin information - Google Patents
Chinese entity linking method integrating pinyin information Download PDFInfo
- Publication number
- CN114861669A CN114861669A CN202210614477.3A CN202210614477A CN114861669A CN 114861669 A CN114861669 A CN 114861669A CN 202210614477 A CN202210614477 A CN 202210614477A CN 114861669 A CN114861669 A CN 114861669A
- Authority
- CN
- China
- Prior art keywords
- entity
- entities
- pinyin
- text
- candidate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
Abstract
A Chinese entity linking method for integrating pinyin information comprises the following steps: 1) constructing a local knowledge base, wherein the knowledge base comprises a plurality of entities, and each entity corresponds to a unique identification id, an alias and a related descriptive text; 2) selecting training data and carrying out corresponding format processing; 3) named entity recognition; 4) generating a candidate entity; 5) entity disambiguation is mainly based on a two-classification thought, pinyin information is merged in the step, and then the pinyin information is sorted according to the probability of candidate entities, and the highest probability is taken as a correct entity. The invention has the beneficial effects that: the entity link problem of homomorphism and heteronymy in Chinese entities can be well solved.
Description
Technical Field
The invention relates to information extraction, information retrieval and deep learning, and provides a Chinese entity linking method blended with pinyin information.
Background
Entity linking techniques are mainly directed to the process of referring to nominal words appearing in text, often called these words as references, and then corresponding these references to entities in a knowledge base. For example, "china is a good place", in this sentence, "china" is referred to as a reference in the text, and "china" generally exists in a knowledge base in a form of a full name of "the people's republic of china", and entity linking refers to linking these two entities together, so as to determine what the true entity represented by the reference in the text is. However, because of the cultural characteristics of Chinese, the situations of one-word polysemy and one-word polysemy cannot be avoided, and the problem of ambiguity always restricts the development of a solid link technology, for example, apple can refer to apple company and apple, which is a fruit. Today's entity linking algorithms focus primarily on improvements in how to improve the ability of an entity to disambiguate so that the correct entity of the entity in the knowledge base can be inferred using the mentioned context information.
The entity link technology for Chinese includes two subtasks of entity identification and entity link, and the task currently has the following problems: 1) due to the influence of Chinese culture, a plurality of isomorphic and allophone entities exist, such as a station in a "sky station", a name of a place is represented by a murmur, and a part of a house is represented by a murmur; 2) the context of the short text is not rich enough, and the analysis of the reference of the entity has great challenge; 3) the ambiguity problem remains a significant difficulty with current entity linking techniques.
Disclosure of Invention
In order to solve the technical problems, the invention provides an entity linking method based on the Pinyin information blended by Bert, which can well solve the entity linking problem of homophones and heterophones in Chinese characters.
In order to solve the technical problems, the invention provides the following technical scheme:
an entity linking method for blending pinyin information comprises the following steps:
1) constructing a local knowledge base, wherein the knowledge base comprises a plurality of entities, and each entity corresponds to a unique identification id, an alias and a related descriptive text;
2) selecting training data and carrying out corresponding format processing;
3) named entity recognition;
4) candidate entity generation: the method comprises the steps of constructing an entity name mapping table and then obtaining the entity name mapping table by using a matching means;
5) entity disambiguation: and (4) regarding the task requirements as two categories, sequencing according to the predicted probability, and taking the entity with the highest probability as a correct entity.
Further, the step of step 2) is as follows:
2.1) constructing a pinyin letter mapping table, wherein the first sound is 1, the second sound is 2, and the like, silence is 0, no pinyin is 5, and the pinyin are stored in a list form according to numbers and letters, such as [0,1,2,3,4,5, a, b … x, y, z ], corresponding to [0,1,2,3 … 30,31 ];
2.2) carrying out format processing on the training data, and processing special symbols and punctuations;
2.3) constructing an entity alias list;
2.4) constructing entity mapping dictionary of the entry _ to _ id and id _ to _ entry.
Still further, the step 3) comprises the following steps:
3.1) carrying out an entity recognition task by utilizing a bert pre-training model;
3.2) utilizing a maximum forward matching technology, and taking the entity table constructed previously as an entity library;
3.3) the Chinese text is sent into a BERT pre-training model to obtain an output vector of the text, then the output vector passes through a CRF layer to obtain entities in the text, the entities are compared with the entities in an entity library, and the entity with the maximum length in the entity library is selected according to the character matching degree of the entities and the entities in the entity library.
Further, the step 5) comprises the following steps:
5.1) obtaining an entity representation: the entity identified in the Chinese text and the candidate entity in the entity alias mapping table are removed, the corresponding entity information including the related descriptive text is searched in the knowledge base, firstly converted into the corresponding id through the entity-id mapping table, then the corresponding pinyin mapping id is obtained according to the pinyin of each Chinese character, the two representations are added, for example, a word of 'Chinese', the id in the corresponding word table is assumed to be [705, 1475], the corresponding pinyin is [ zhong1, guo2], the pinyin id is converted into the pinyin mapping of [31,13,20,19,12,1], [12,26,20,2], [16,15] after the average pooling processing, the pinyin of each Chinese character is expressed as [721,1490] and is merged into the word representation as the initial vector representation of two characters of 'Chinese', the initial vector representation is combined with the related descriptive information, and then the initial vector representation is sent into a Bert pre-training model, obtaining an output vector representation of each entity at [ CLS ];
5.2) entity disambiguation: based on the thought of two classifications, the matched entity is used as a positive example, then two negative examples are selected from a candidate entity set, texts and text descriptions of entities needing disambiguation are spliced together to be used as the input of a Bert model, a vector output at the [ CLS ] position is connected with feature vectors of the starting and ending positions in the candidate entity, the probability of the candidate entity is obtained through a full connection layer and a sigmoid activation function, and the entity with the highest probability is selected as a correct entity.
Where n is the sample size of the training,is the result of model prediction, y i Are true results, all represented in a probabilistic form.
The invention has the beneficial effects that: the entity link problem of homomorphism and heteronymy in Chinese entities can be well solved.
Drawings
FIG. 1 is a named entity recognition flow diagram of the present invention.
Fig. 2 is a flow chart of entity disambiguation of the present invention.
Fig. 3 is an overall flowchart.
Detailed Description
The following detailed description of the embodiments of the present invention will be made with reference to the accompanying drawings and examples, which are provided for illustration of the present invention and are not intended to limit the scope of the present invention.
As shown in fig. 1, fig. 2 and fig. 3, a method for linking chinese entities blended with pinyin information includes the following steps:
1) and constructing a local knowledge base, wherein the knowledge base comprises a plurality of entities, and each entity corresponds to a unique identification id, an alias and related descriptive text. The information that should be contained in the knowledge base is as follows: (including but not limited to, self-expandable)
Entity alias, entity id, entity name, entity type, and description information of the entity
2) Selecting training data and performing corresponding format processing, wherein an example format is as follows:
{ "text _ id": 3"," text ": interpretation fifteen years of the universe", "information _ data" [ { "kb _ id": 131751"," information ": fifteen years of the universe", "offset": 3"}, wherein" text _ id "denotes the sequence number of the chinese text in the dataset," kb _ id "denotes the id of the entity in the text in the knowledge base, and" offset "denotes the identified starting position mentioned in the text.
The step 2) comprises the following steps:
2.1) constructing a pinyin letter mapping table, wherein the first sound is 1, the second sound is 2, and the like, silence is 0, no pinyin is 5, and the pinyin are stored in a list form according to numbers and letters, such as [0,1,2,3,4,5, a, b … x, y, z ], corresponding to [0,1,2,3 … 30,31 ];
2.2) carrying out format processing on the training data, and processing special symbols and punctuations;
2.3) constructing an entity alias list;
2.4) constructing entity mapping dictionaries of the entry _ to _ id and the id _ to _ entry;
3) named entity recognition, comprising the steps of:
3.1) carrying out an entity recognition task by utilizing a bert pre-training model;
3.2) the technology mainly utilized is the maximum forward matching technology, and the entity table constructed in advance is used as an entity library;
3.3) the Chinese text is sent into a BERT pre-training model to obtain an output vector of the text, then the output vector passes through a CRF layer to obtain entities in the text, the entities are compared with the entities in an entity library, and the entity with the maximum length in the entity library is selected according to the character matching degree of the entities and the entities in the entity library.
4) Candidate entity generation: the method comprises the steps of constructing an entity name mapping table and then obtaining the entity name mapping table by using a matching means;
5) entity disambiguation: the task requirements are considered as two categories, the categories are sorted according to the predicted probability, and the entity with the highest probability is taken as a correct entity;
the step 5) comprises the following steps:
5.1) obtaining an entity representation: the entity identified in the Chinese text and the candidate entity in the entity alias mapping table are removed, the corresponding entity information including the related descriptive text is searched in the knowledge base, firstly converted into the corresponding id through the entity-id mapping table, then the corresponding pinyin mapping id is obtained according to the pinyin of each Chinese character, the two representations are added, for example, a word of 'Chinese', the id in the corresponding word table is assumed to be [705, 1475], the corresponding pinyin is [ zhong1, guo2], the pinyin id is converted into the pinyin mapping of [31,13,20,19,12,1], [12,26,20,2], [16,15] after the average pooling processing, the pinyin of each Chinese character is expressed as [721,1490] and is merged into the word representation as the initial vector representation of two characters of 'Chinese', the initial vector representation is combined with the related descriptive information, and then the initial vector representation is sent into a Bert pre-training model, obtaining an output vector representation of each entity at [ CLS ];
5.2) entity disambiguation: based on the thought of two classifications, taking the matched entity as a positive example, then selecting two negative examples in a candidate entity set, splicing texts and text descriptions of entities needing disambiguation, using the texts and the text descriptions as the input of a Bert model, connecting vectors output at [ CLS ] positions with feature vectors at the starting positions and the ending positions in the candidate entities, obtaining the probability of the candidate entities through a full connection layer and a sigmoid activation function, and selecting the entity with the highest probability as a correct entity;
where n is the sample size of the training,is the result of model prediction, y i Are true results, all represented in a probabilistic form.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the improvement principle of the present invention, and these modifications and variations should also be within the protection scope of the present invention.
Claims (4)
1. A Chinese entity linking method for blending pinyin information is characterized by comprising the following steps:
1) constructing a local knowledge base, wherein the knowledge base comprises a plurality of entities, and each entity corresponds to a unique identification id, an alias and a related descriptive text;
2) selecting training data and carrying out corresponding format processing;
3) named entity recognition;
4) candidate entity generation: the method comprises the steps of constructing an entity name mapping table and then obtaining the entity name mapping table by using a matching means;
5) entity disambiguation: and (4) regarding the task requirements as two categories, sequencing according to the predicted probability, and taking the entity with the highest probability as a correct entity.
2. The method as claimed in claim 1, wherein the step 2) comprises the following steps:
2.1) constructing a pinyin letter mapping table, for example, the first sound is 1, the second sound is 2, and the like, silence is 0, no pinyin is 5, and the pinyin and letters are stored in a list form according to numbers and letters, such as [0,1,2,3,4,5, a, b … x, y, z ], corresponding to [0,1,2,3 … 30,31 ];
2.2) carrying out format processing on the training data, and processing special symbols and punctuations;
2.3) constructing an entity alias list;
2.4) constructing entity mapping dictionary of the entry _ to _ id and id _ to _ entry.
3. The method as claimed in claim 1 or 2, wherein the step 3) comprises the steps of:
3.1) carrying out an entity recognition task by utilizing a bert pre-training model;
3.2) utilizing a maximum forward matching technology, and taking the entity table constructed previously as an entity library;
3.3) the Chinese text is sent into a BERT pre-training model to obtain an output vector of the text, then the output vector passes through a CRF layer to obtain entities in the text, the entities are compared with the entities in an entity library, and the entity with the maximum length in the entity library is selected according to the character matching degree of the entities and the entities in the entity library.
4. The method as claimed in claim 1 or 2, wherein the step 5) comprises the steps of:
5.1) obtaining an entity representation: the entity and the candidate entity in the entity alias mapping table are identified in the Chinese text, corresponding entity information including related descriptive texts is searched in a knowledge base, the entity information and the candidate entity are converted into corresponding ids through an entity-id mapping table, corresponding pinyin mapping ids are obtained according to pinyin of each Chinese character, the representation of the entity information and the corresponding id are added, the related descriptive information is combined, the obtained result is sent into a Bert pre-training model, and output vector representation of each entity is obtained at a [ CLS ];
5.2) entity disambiguation: based on the thought of two classifications, taking the matched entity as a positive example, then selecting two negative examples in a candidate entity set, splicing texts and text descriptions of entities needing disambiguation, using the texts and the text descriptions as the input of a Bert model, connecting vectors output at [ CLS ] positions with feature vectors at the starting positions and the ending positions in the candidate entities, obtaining the probability of the candidate entities through a full connection layer and a sigmoid activation function, and selecting the entity with the highest probability as a correct entity;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210614477.3A CN114861669A (en) | 2022-05-26 | 2022-05-26 | Chinese entity linking method integrating pinyin information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210614477.3A CN114861669A (en) | 2022-05-26 | 2022-05-26 | Chinese entity linking method integrating pinyin information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114861669A true CN114861669A (en) | 2022-08-05 |
Family
ID=82640618
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210614477.3A Pending CN114861669A (en) | 2022-05-26 | 2022-05-26 | Chinese entity linking method integrating pinyin information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114861669A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112560489A (en) * | 2020-12-08 | 2021-03-26 | 北京航天云路有限公司 | Entity linking method based on Bert |
-
2022
- 2022-05-26 CN CN202210614477.3A patent/CN114861669A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112560489A (en) * | 2020-12-08 | 2021-03-26 | 北京航天云路有限公司 | Entity linking method based on Bert |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10176804B2 (en) | Analyzing textual data | |
CN101133411B (en) | Fault-tolerant romanized input method for non-roman characters | |
CN102982021B (en) | For eliminating the method for the ambiguity of the multiple pronunciations in language conversion | |
Zhou et al. | Handwritten Chinese/Japanese text recognition using semi-Markov conditional random fields | |
KR101259558B1 (en) | apparatus and method for detecting sentence boundaries | |
CN109960728B (en) | Method and system for identifying named entities of open domain conference information | |
WO2006030302A1 (en) | Optimization of text-based training set selection for language processing modules | |
CN111488468B (en) | Geographic information knowledge point extraction method and device, storage medium and computer equipment | |
JP7266683B2 (en) | Information verification method, apparatus, device, computer storage medium, and computer program based on voice interaction | |
CN108399157B (en) | Dynamic extraction method of entity and attribute relationship, server and readable storage medium | |
CN113268576B (en) | Deep learning-based department semantic information extraction method and device | |
Jain et al. | “UTTAM” An Efficient Spelling Correction System for Hindi Language Based on Supervised Learning | |
KR20230009564A (en) | Learning data correction method and apparatus thereof using ensemble score | |
CN113673228A (en) | Text error correction method, text error correction device, computer storage medium and computer program product | |
CN114065738B (en) | Chinese spelling error correction method based on multitask learning | |
CN115064154A (en) | Method and device for generating mixed language voice recognition model | |
US20190286702A1 (en) | Display control apparatus, display control method, and computer-readable recording medium | |
CN114861669A (en) | Chinese entity linking method integrating pinyin information | |
CN110619119B (en) | Intelligent text editing method and device and computer readable storage medium | |
CN112765977A (en) | Word segmentation method and device based on cross-language data enhancement | |
US7831549B2 (en) | Optimization of text-based training set selection for language processing modules | |
CN116720520A (en) | Text data-oriented alias entity rapid identification method and system | |
CN111159405A (en) | Irony detection method based on background knowledge | |
CN114996455A (en) | News title short text classification method based on double knowledge maps | |
CN114970541A (en) | Text semantic understanding method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |