CN108415902B

CN108415902B - Named entity linking method based on search engine

Info

Publication number: CN108415902B
Application number: CN201810138076.9A
Authority: CN
Inventors: 吴共庆; 何颖; 胡学钢; 胡东辉; 李磊; 吴信东
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2018-02-10
Filing date: 2018-02-10
Publication date: 2021-10-26
Anticipated expiration: 2038-02-10
Also published as: CN108415902A

Abstract

The invention discloses a named entity linking method for a search engine, which comprises the following steps: segmenting an input query text, and identifying a group of named entities needing to be linked to a knowledge base in the text; for each identified named entity, searching in a Chinese entity knowledge base through entity designation to obtain a candidate entity list; expanding the context information of the entity designation by using a search engine; expanding the context information of each candidate entity in the candidate entity list by using a search engine; calculating the matching degree between the entity designation and each candidate entity; and selecting the entity with the maximum matching degree for linking. The method searches the entity index and the candidate entity in the search engine, obtains information from the search result to expand the context information of the entity index and the candidate entity, and provides additional information for improving the accuracy of entity link.

Description

Named entity linking method based on search engine

Technical Field

The invention relates to a network information processing method, in particular to a named entity linking method based on a search engine.

Background

With the continuous improvement of informatization, people increasingly acquire information through the internet, and traditional information dissemination media such as newspapers, magazines and periodicals are gradually surpassed by portal websites, electronic libraries and search engines. Most of information acquired through the internet is text information spread through platforms such as news websites, microblogs, posts and the like, and a large number of named entities are contained in the text, so that the information era is accompanied by explosive growth of information quantity while providing an efficient reading mode. How to link the large number of named entities to the corresponding entities in the knowledge base in an unambiguous manner efficiently and accurately becomes a problem to be solved in the fields of information fusion, natural language processing, information retrieval and the like.

The named entity link is mainly divided into two stages, wherein the first stage is mainly to search a knowledge base according to entity names to obtain potential candidate entity sets, and the second stage is mainly to perform entity disambiguation according to the entity names and the candidate entity sets. At present, the main work of the named entity linking method is focused on optimizing an entity disambiguation algorithm, and the main methods of entity disambiguation are mainly divided into a disambiguation method based on entity popularity ranking, a disambiguation method based on context similarity ranking, a disambiguation method based on classification, a disambiguation method based on a graph, and the like.

The disambiguation method based on entity popularity ranking is the simplest and most direct entity disambiguation method, and the method judges to return the entity with the highest popularity according to the entity popularity. For example, a person entity, the more highly known the person is more likely to be linked, which is usually measured by how many links are related to the entity, but this method has the disadvantages: the same results are returned regardless of whether the querying entity is popular or not.

The disambiguation method based on context similarity ranking utilizes the similarity between the context referred by the entity and the candidate entity context to determine the entity that should be linked. This method usually represents the context of the entity reference and the context of the candidate entity as bag-of-words vectors or some keywords in them, such as named entities in the text, as space-of-words vectors, and then calculates the similarity between the vectors, such as cosine similarity. The method has the disadvantages that the calculation of the similarity depends on word co-occurrence information, a large amount of entity context information is needed, and the accuracy of the judgment result is not high when the context information is less.

The disambiguation method based on classification regards the link process as a classification process, extracts entity names from the knowledge base in the training stage, and takes < entity name, candidate entity > pairs linked with the entities in the knowledge base as positive examples, and < entity name, candidate entity > pairs not linked with the entities as negative examples. Then, a classifier is trained by using a training set formed by the positive examples and the negative examples, and common classification models used by the classifier comprise SVM, decision trees, naive Bayes and the like. In the link phase of the named entity, the trained classifier is used for classifying the candidate entities named by the entity. However, this approach is easy to obtain two or more linked entities, and requires re-ordering in conjunction with contextual similarity, etc.

The disambiguation method based on the graph takes the entity designation and the candidate entity as graph nodes, and takes the relation between the entity designation and the candidate entity as the graph edge to establish a graph structure. Most methods utilize contextual similarity features between an entity's designated nodes and its candidate entity nodes, usually when computing the relationship between the two. Therefore, the graph-based disambiguation method also depends on word co-occurrence information to some extent, and there is a problem that the accuracy of the determination result is not high when the context information is less.

Summarizing the entity disambiguation methods described above, most methods of naming entity links rely on the similarity between the entity reference context and the candidate entity context, however, the calculation of this feature requires a large amount of entity reference and the context information of its candidate entities to overcome the problem of excessive dependency on word co-occurrence information when calculating the similarity. In order to obtain more entity names and context information of candidate entities, a method capable of expanding the entity names and the context information of the candidate entities is needed to improve the accuracy of the calculation result.

Disclosure of Invention

The invention aims to provide a named entity linking method based on a search engine, which aims to solve the problems in the prior art.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a named entity linking method based on a search engine is characterized in that: the method comprises the following steps:

step 1, segmenting an input query text, and identifying a group of named entities needing to be linked to a knowledge base in the text;

step 2, obtaining a candidate entity list for each named entity through entity designation;

step 3, expanding the context information named by the entity by utilizing a search engine;

step 4, expanding the context information of each candidate entity in the candidate entity list by utilizing a search engine;

step 5, calculating the matching degree between the entity designation and each candidate entity;

and 6, sequencing the candidate entities according to the matching degree of the candidate entities, and selecting the candidate entity with the maximum matching degree as a target entity for linking.

The named entity linking method based on the search engine is characterized in that: in the step 1:

the named entities are human names, organization names, place names and other entities with names as identifiers, and the entities are objects or concepts in real life;

the query text is a text or a text set input by a user, and comprises an entity name and a context text thereof, wherein the text length is a fixed length;

the knowledge base is a set of a plurality of entity entries, including characters, organizations, place names and other entities with names as identifiers;

the entity entry is specific information stored in the knowledge base for each entity, and comprises a context text, an entity attribute and an attribute value of the entity, the context information of the entity is a text for describing the entity in detail, and the entity is named as a noun reference of an entity to be linked in a query text.

The named entity linking method based on the search engine is characterized in that: in the step 2, the method for obtaining the candidate entity by the entity designation comprises the following steps:

sequentially taking out each entity entry in the knowledge base, traversing the name attribute value and the alias attribute value of each entity entry, and calculating the character string similarity between the entity name in the query text and the name attribute value and the alias attribute value of each entity entry; if the similarity between the name attribute values of the entity names and the entity items of the query text is greater than a preset threshold value, outputting the current entity items to a candidate entity list, and if the similarity between the alias attribute values of the entity names and the entity items of the query text is greater than a preset threshold value, outputting the entity items corresponding to the alias attributes to the candidate entity list;

the initial candidate entity list is obtained through the method, and the candidate entity list is obtained by screening the initial candidate entity list, wherein the method comprises the following steps:

extracting part of entities in a knowledge base as a training set, training an entity-type classifier by using a labeled < entity, type > pair set, identifying the types of the entities in a candidate entity list by using the entity-type classifier, reserving the entities which belong to the same type as the entities to be linked in the query text, deleting the entities with different types, and obtaining a final candidate entity list;

the type of the entity is the category attribute of the entity, including people, places, mechanisms and time; the < entity, type > pair is a mapping of an entity and its type, wherein the < entity, type > pair with the correct mapping is a positive-going instance of the training classifier, and vice versa; the entity-type classifier is a multi-class classification model trained by < entity, type > pairs, by which the type of an entity can be judged.

The named entity linking method based on the search engine is characterized in that: in step 3, the method for expanding the context information referred by the entity by using the search engine comprises the following steps:

and (2) taking the entity names of a group of named entities in the query text in the step (1) as seeds to be retrieved in a search engine to obtain search result webpages related to the entity names, extracting the title and abstract information of the previous m search results from the search result webpages to obtain extended texts of the entity names, and forming the extended context information of the entity names by the query text and the extended texts.

The named entity linking method based on the search engine is characterized in that: in step 4, the method for expanding the context information of each candidate entity in the candidate entity list by using the search engine comprises the following steps:

the method comprises the steps of segmenting a context text of each candidate entity in a candidate entity list in a knowledge base, identifying the named entities of the segmented text to obtain a group of named entities related to the candidate entities, then taking the group of named entities as a seed to search in a search engine to obtain a search result webpage related to the candidate entities, extracting title and summary information of the first n search results from the search result webpage to obtain an expanded text of the candidate entities, wherein the context text of the candidate entities in the knowledge base and the expanded text together form the expanded context information of the candidate entities.

The named entity linking method based on the search engine is characterized in that: in step 5, the method for calculating the matching degree between the entity designation and each candidate entity comprises:

for each candidate entity in the candidate entity list, calculating cosine similarity a between the expanded context information of the entity designation and the expanded context information of the candidate entity, calculating the candidate entity popularity b, and calculating the matching degree c ═ w between the entity designation and each candidate entity_a×a+w_bX b, wherein: w is a_a、w_bIs a preset weight value, and w_a+w_b1 is ═ 1; the popularity of the candidate entity is: the entity designation is the ratio of the number of times the hyperlink links to the candidate entity page to the total number of times the hyperlink links to all candidate entities.

Compared with the prior art, the invention has the beneficial effects that:

1. the present invention employs a two-stage method of generating a list of candidate entities. Firstly, calculating the similarity of character strings to obtain an initial candidate entity list. And then, screening the initial candidate entity list to obtain a candidate entity list according to a method for classifying entity types. The two-stage method of generating the candidate entity list can generate more candidate entities related to the entity designation, and can also reduce the complexity of disambiguation by removing part of the noise candidate entities through entity type classification.

2. The key contribution of the invention is to provide a method for expanding the entity designation and the context text information of the candidate entity to solve the link problem of the named entity. The length of the entity reference context text is limited, and the entity context information contained in the knowledge base is also limited, so that the accuracy of calculating the similarity between the entity reference and the candidate entity context depending on the word co-occurrence information is not high. The entity names and the candidate entities are retrieved through a search engine, all webpages related to the entity names and the candidate entities are obtained, and the expanded context information is obtained after the webpages are sorted, so that more entity characteristics can be extracted, and additional information can be provided for improving the accuracy of entity links.

Drawings

FIG. 1 is a flowchart of a named entity linking method based on a search engine according to the present invention.

Detailed Description

A named entity linking method based on a search engine comprises the following steps:

step 1, performing word segmentation on an input query text, and identifying a group of named entities needing to be linked to a knowledge base in the text.

And 2, obtaining a candidate entity list for each named entity through entity designation.

And 3, expanding the context information named by the entity by utilizing a search engine.

And 4, expanding the context information of each candidate entity in the candidate entity list by utilizing a search engine.

And 5, calculating the matching degree between the entity designation and each candidate entity.

In the step 1, named entities are human names, organization names, place names and other entities with names as identifiers, and the entities are objects or concepts in real life; the query text is a text or a text set input by a user, and comprises an entity name and a context text, the text length is a fixed length, and the knowledge base is a set of a plurality of entity entries, including people, institutions, place names and other entities with names as identifiers; the entity entry is specific information stored in the knowledge base for each entity, and comprises a context text, an entity attribute and an attribute value of the entity, the context information of the entity is a text for describing the entity in detail, and the entity is named as a noun reference of an entity to be linked in a query text.

In step 2, the method for obtaining the candidate entity by the entity designation comprises the following steps: and sequentially taking out each entity entry in the knowledge base, traversing the name attribute value and the alias attribute value of the entity entry, and calculating the character string similarity between the entity name in the query text and the name attribute value and the alias attribute value of each entity entry. If the similarity between the name attribute values of the entity names and the entity items of the query text is greater than a preset threshold value, outputting the current entity items to a candidate entity list, and if the similarity between the alias attribute values of the entity names and the entity items of the query text is greater than the preset threshold value, outputting the entity items corresponding to the alias attributes to the candidate entity list; an initial candidate entity list is obtained by the method. And screening the initial candidate entity list to obtain a candidate entity list. The method comprises the following steps: part of the entities in the knowledge base are extracted as a training set, and the entity-type classifier is trained by using the labeled < entity, type > pairs. Then, an entity-type classifier is used for identifying the type of an entity in the candidate entity list, entities which belong to the same type as the entity to be linked in the query text are reserved, and entities with different types are deleted to obtain a final candidate entity list, wherein the type of the entity is the category attribute of the entity, including people, places, mechanisms and time; the < entity, type > pair is the mapping of the entity and its type, wherein the < entity, type > pair with correct mapping is the positive example of the training classifier, otherwise, the negative example; the entity-type classifier is a multi-class classification model trained by < entity, type > pairs, the multi-class classification model is used for classifying entities into a plurality of classes, one entity belongs to one of the plurality of classes, the different classes are mutually exclusive, and the types of the entities can be judged through the multi-class classification model.

In step 3, the method for expanding the context information named by the entity by using the search engine comprises the following steps: and (2) taking the entity names of a group of named entities in the query text in the step (1) as seeds to be retrieved in a search engine to obtain search result webpages related to the entity names, extracting the title and abstract information of the previous m search results from the search result webpages to obtain extended texts of the entity names, and forming the extended context information of the entity names by the query text and the extended texts.

In step 4, the method for expanding the context information of each candidate entity in the candidate entity list by using the search engine comprises the following steps: the method comprises the steps of segmenting a context text of each candidate entity in a candidate entity list in a knowledge base, identifying the named entities of the segmented text to obtain a group of named entities related to the candidate entities, then taking the group of named entities as a seed to search in a search engine to obtain a search result webpage related to the candidate entities, extracting title and summary information of the first n search results from the search result webpage to obtain an expanded text of the candidate entities, wherein the context text of the candidate entities in the knowledge base and the expanded text together form the expanded context information of the candidate entities.

In step 5, the method for calculating the matching degree between the entity designation and each candidate entity comprises the following steps: for each candidate entity in the candidate entity list, calculating cosine similarity a between the expanded context information of the entity designation and the expanded context information of the candidate entity, calculating popularity b of the candidate entity, and calculating matching degree c ═ w between the entity designation and each candidate entity_a×a+w_bX b. Wherein: w is a_a、w_bIs a preset weight value, and w_a+w_b1. The popularity of the candidate entities is: the entity designation is the ratio of the number of times the hyperlink links to the candidate entity page to the total number of times the hyperlink links to all candidate entities.

The specific embodiment is as follows:

the embodiment provides a named entity linking method based on a search engine, and the following describes the steps of the named entity linking method based on the search engine in the embodiment with reference to fig. 1:

(1) as shown in S101 of fig. 1, the input query text is tokenized to identify a set of named entities in the text that need to be linked to the knowledge base. A user enters a query text, such as a word "Lina retired ceremony Nadall flower donation" in some news, Najie's tears running the cheering all over. ", first, the entered query text is tokenized, and then a set of named entities in the text is identified using a named entity recognition tool. The named entities include names of people, organizations, places, and other entities identified by names, and the more extensive entities include numbers, dates, currencies, and addresses.

The word segmentation of the input query text is to divide the word of the query text according to a Chinese dictionary by using a word segmentation tool, and the tool used for word segmentation in the embodiment is an NLPIR Chinese word segmentation tool (website: http:// ictclas. NLPIR. org/downloads) provided by the institute of computing technology of Chinese academy of sciences. The set of words and their type labels obtained by word segmentation is: w ═ lina/nr retirement/vi ceremony/n nadal/nr flower donation/vi,/wd na sister/nr tear/n ben/v share/vg whole field/n cheer/v }, where "/nr", "/vi", "/n", "/wd", "/v", "/vg" are type labels for words.

After word segmentation is carried out to obtain a set of words, an entity set contained in the text after word segmentation is identified by utilizing a named entity identification technology: e ═ lina, nadal, and na sister }, the named entity recognition method adopted in this embodiment is to extract a phrase with types labeled nr, ns, nt, and nz to form a named entity set according to the words obtained by word segmentation and the type labels thereof. As used herein, "Lina", "Nadar" and "Najie" are references to the named entities identified as "Lina", "Nadar" and "Najie". The present embodiment describes the named entity linking process by taking the linking process of the entity named "lina" as an example.

(2) For each named entity, a list of candidate entities is obtained by entity designation, as shown at S102 of fig. 1. And sequentially taking out each entity entry in the knowledge base, traversing the name attribute value and the alias attribute value of the entity entry, and calculating the character string similarity between the entity name in the query text and the name attribute value and the alias attribute value of each entity entry. If the similarity between the name attribute values of the entity names and the entity items of the query text is greater than a preset threshold value, the current entity items are output to a candidate entity list, and if the similarity between the alias attribute values of the entity names and the entity items of the query text is greater than the preset threshold value, the entity items corresponding to the alias attributes are output to the candidate entity list. An initial candidate entity list is obtained by the method. The knowledge base used in this embodiment is a chinese knowledge base (website: http:// keg.cs. tsinghua.edu. cn/project/chinese kb) issued by the university of qinghua, which contains 800,000 different entities and their attributes, and the name attribute values are contained in the rdf: about attribute of the < rdf: Description rdf: about "" > < rdf: Description > tag, and the alias attributes are contained in the < ont: alias > </ont: alias > tag. The threshold value is set to 0.7 in advance in this embodiment. For example, the initial candidate entity entry that satisfies the threshold value that results from searching for the lina entity is shown in table 1:

table 1 initial candidate entity list with entity designation "lina

In table 1, the entity name is a name attribute value of an entity entry in the extracted repository, which is included in an rdf abeut attribute of a < rdf: Description rdf: abeut ″ ' > < rdf: Description > tag, and the entity context is a context attribute value of the entity entry in the extracted repository, which is included in an < ont: ABSTRACT > </ont: ABSTRACT > tag, and the other attributes are other attribute values of the entity entry in the repository, which are included in other tags except the < rdf: Description rdf: abeut [ ' > ' ardf: Description >, < ont: ABSTRACT >/ont: ABSTRACT > tag.

Then, screening the initial candidate entity list to obtain a candidate entity list, wherein the method comprises the following steps: and extracting partial entities in the knowledge base as a training set, and training an entity-type classifier by using the marked < entities, types > pair. The method for extracting part of the entities in the knowledge base as the training set in this embodiment is as follows: and randomly extracting 200 entity entries from the entity entry set of each type to serve as a training set, wherein an adopted classifier model is SVM (support Vector machine). And then, identifying the entity type in the candidate entity list by using an entity-type classifier, reserving the entities which belong to the same type as the entities to be linked in the query text, and deleting the entities with different types to obtain a final candidate entity list. The entity type is the category attribute of the entity, including people, places, organizations and time. The list of candidate entities after screening is shown in table 2:

table 2 list of candidate entities with entity designation "lina

(3) As shown in S103 of fig. 1, the context information of the entity designation is expanded using the search engine. Taking the example of an entity named as "lina", the query text named by the entity is "lina retirement ceremony nada flower donation, and na sister tears rushing to cheer all over. "the entity names" lina, nadal, na sister "of a group of named entities identified from the query text are used as seeds of a search engine to search, so as to obtain a search result webpage related to the entity name" lina ", title and summary information of the previous m search results are extracted from the search result webpage, so as to obtain an extended text of the entity name" lina "shown in table 3, and the query text and the extended text together form extended context information of the entity name" lina ". In this embodiment, the search engine is a necessary search engine, title and summary information of the search result is extracted by using the CSS selector, and the value of m is 8.

Table 3 extended text with entity named "lina

(4) As shown in S104 of fig. 1, the context information of each candidate entity in the candidate entity list is expanded using the search engine. Firstly, the context text of each candidate entity in the candidate entity list in the knowledge base is participled, in this embodiment, the context information of the candidate entity is included in the context text<ont:ABSTRACT></ont:ABSTRACT>In the label. Taking the candidate entity "Lina (tennis player)" as an example, the result of word segmentation on its context is W^*= lina/nr (/ wkz 1982/t 2/t 26/t —/wp)/wky,/wd china/nr women/n tennis balls/n players/n. The term "the" term "refers to"/wj 2011/t 6/t 4/t,/wd is/p france/nr/ude 1 paris/nr west/fmon/vi tex/dhoo/a land/n/ude 1 roland/nrf · garland/bso/b si/b stadium/n in/f,/wd lisa/nr acquisition/v french/n lady/n champion/n. The/wj becomes/ude 1 Asian/nr/n/v has been the first/m/q of/dl since the history/v is obtained/v big/a grand/n tennis match/n events/n championship/n. And/wj, and performing named entity identification on the text after word segmentation to obtain a group of named entities related to the candidate entities. The named entity recognition method adopted in this embodiment is to extract a word with the type labeled nr, ns, nt, and nz to form a named entity set according to the word obtained by word segmentation and the type label thereof. The result of named entity recognition is E^*Langa, china, france, paris, asia. Then, the group of named entities "lina, china, france, paris, asia" is used as a seed to be searched in a search engine to obtain a group of search result webpages related to the candidate entity "lina (tennis player)", title and summary information of the first n search results are extracted from the search result webpages to obtain an extended text of the candidate entity "lina (tennis player)" as shown in table 4, and the context text of the candidate entity "lina (tennis player)" in the knowledge base and the extended text together form the extended context information of the candidate entity "lina (tennis player)". Search engine in this embodimentAnd extracting title and summary information of the search result by using a CSS selector to ensure that the search engine is necessary, wherein n is 10.

TABLE 4 extended text of candidate entity "Lina (tennis player)

(5) As shown in S105 of fig. 1, a degree of match between the entity designation and each candidate entity is calculated. Firstly, for each candidate entity in the candidate entity list, calculating cosine similarity a between expanded context information of the entity name and expanded context information of the candidate entity, calculating popularity b of the candidate entity, and calculating matching degree c w between the entity name and each candidate entity_a×a+w_bX b. Wherein: w is a_a、w_bIs a preset weight value, and w_a+w_b1. The popularity of the candidate entities is: the entity designation is the ratio of the number of times the hyperlink links to the candidate entity page to the total number of times the hyperlink links to all candidate entities. In this embodiment, w is preset_a＝0.6，w_b0.4. Taking the example of calculating the matching degree between the entity name "lina" and the candidate entity "lina (tennis player)" in step (3), the cosine similarity a between the extended context information of the entity name "lina" and the extended context information of the candidate entity "lina (tennis player)" is 0.53, the popularity b of the candidate entity "lina (tennis player)" is 0.88, and thus the matching degree c between the entity name "lina" and the candidate entity "lina (tennis player)" is 0.6 + 0.53+0.4 is 0.88 is 0.67. Table 5 lists the degree of match between the entity designation "lina" and each candidate entity.

Table 5 degree of match between entity designation "lina" and each candidate entity

Candidate entities	Cosine similarity	Popularity of candidate entities	Degree of matching
				Lina (tennis athlete)	0.53	0.88	0.67
Lina (singer)	0.43	0.056	0.28
				Lina (professor north)	0.39	0	0.23
Li Naying	0.19	0.31	0.24

(6) As shown in S106 of fig. 1, the candidate entities are ranked according to the matching degrees of the candidate entities, and the candidate entity with the highest matching degree is selected as the target entity for linking. In this embodiment, table 5 is sorted from large to small according to the matching degree between the entity index and the candidate entity to obtain table 6, and the candidate entity in item 1 in table 6 is the candidate entity with the largest matching degree, so that the target entity linked to the "lina" is the "lina (tennis player)", which is called by the entity index.

Table 6 results of ranking candidate entities named "Lina" to entities according to degree of matching from large to small

Candidate entities	Cosine similarity	Popularity of candidate entities	Degree of matching
				Lina (tennis athlete)	0.53	0.88	0.67
Lina (singer)	0.43	0.056	0.28
				Li Naying	0.19	0.31	0.24
Lina (professor north)	0.39	0	0.23

Claims

1. A named entity linking method based on a search engine is characterized in that: the method comprises the following steps:

step 6, sorting the candidate entities according to the matching degrees of the candidate entities, and selecting the candidate entity with the maximum matching degree as a target entity for linking;

in the step 1:

the entity items are specific information stored in the knowledge base for each entity, and comprise context texts, entity attributes and attribute values of the entities, the context information of the entities is texts for describing the entities in detail, and the entities are named as noun references of the entities to be linked in the query texts;

in the step 2, the method for obtaining the candidate entity by the entity designation comprises the following steps:

the type of the entity is the category attribute of the entity, including people, places, mechanisms and time; the < entity, type > pair is a mapping of an entity and its type, wherein the < entity, type > pair with the correct mapping is a positive-going instance of the training classifier, and vice versa; the entity-type classifier is a multi-class classification model trained by an < entity, type > pair set, and the type of the entity can be judged through the model;

in step 3, the method for expanding the context information referred by the entity by using the search engine comprises the following steps:

retrieving entity names of a group of named entities in the query text in the step 1 as seeds in a search engine to obtain search result webpages related to the entity names, extracting title and abstract information of the previous m search results from the search result webpages to obtain extended texts of the entity names, wherein the query text and the extended texts together form extended context information of the entity names;

in step 4, the method for expanding the context information of each candidate entity in the candidate entity list by using the search engine comprises the following steps:

segmenting a context text of each candidate entity in a candidate entity list in a knowledge base, identifying the named entities of the segmented text to obtain a group of named entities related to the candidate entities, then taking the group of named entities as a seed to search in a search engine to obtain a search result webpage related to the candidate entities, extracting title and summary information of the first n search results from the search result webpage to obtain an expanded text of the candidate entities, wherein the context text of the candidate entities in the knowledge base and the expanded text together form expanded context information of the candidate entities;

in step 5, the method for calculating the matching degree between the entity designation and each candidate entity comprises:

for each candidate entity in the candidate entity list, calculating cosine similarity a between the expanded context information of the entity designation and the expanded context information of the candidate entity, calculating the candidate entity popularity b, calculating the degree of match c = w between the entity designation and each candidate entity_a×a+w_bX b, wherein: w is a_a、w_bIs a preset weight value, and w_a+w_b= 1; the popularity of the candidate entity is: the entity designation is the ratio of the number of times the hyperlink links to the candidate entity page to the total number of times the hyperlink links to all candidate entities.