CN108415902B - Named entity linking method based on search engine - Google Patents

Named entity linking method based on search engine Download PDF

Info

Publication number
CN108415902B
CN108415902B CN201810138076.9A CN201810138076A CN108415902B CN 108415902 B CN108415902 B CN 108415902B CN 201810138076 A CN201810138076 A CN 201810138076A CN 108415902 B CN108415902 B CN 108415902B
Authority
CN
China
Prior art keywords
entity
candidate
entities
text
named
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810138076.9A
Other languages
Chinese (zh)
Other versions
CN108415902A (en
Inventor
吴共庆
何颖
胡学钢
胡东辉
李磊
吴信东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN201810138076.9A priority Critical patent/CN108415902B/en
Publication of CN108415902A publication Critical patent/CN108415902A/en
Application granted granted Critical
Publication of CN108415902B publication Critical patent/CN108415902B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Abstract

The invention discloses a named entity linking method for a search engine, which comprises the following steps: segmenting an input query text, and identifying a group of named entities needing to be linked to a knowledge base in the text; for each identified named entity, searching in a Chinese entity knowledge base through entity designation to obtain a candidate entity list; expanding the context information of the entity designation by using a search engine; expanding the context information of each candidate entity in the candidate entity list by using a search engine; calculating the matching degree between the entity designation and each candidate entity; and selecting the entity with the maximum matching degree for linking. The method searches the entity index and the candidate entity in the search engine, obtains information from the search result to expand the context information of the entity index and the candidate entity, and provides additional information for improving the accuracy of entity link.

Description

Named entity linking method based on search engine
Technical Field
The invention relates to a network information processing method, in particular to a named entity linking method based on a search engine.
Background
With the continuous improvement of informatization, people increasingly acquire information through the internet, and traditional information dissemination media such as newspapers, magazines and periodicals are gradually surpassed by portal websites, electronic libraries and search engines. Most of information acquired through the internet is text information spread through platforms such as news websites, microblogs, posts and the like, and a large number of named entities are contained in the text, so that the information era is accompanied by explosive growth of information quantity while providing an efficient reading mode. How to link the large number of named entities to the corresponding entities in the knowledge base in an unambiguous manner efficiently and accurately becomes a problem to be solved in the fields of information fusion, natural language processing, information retrieval and the like.
The named entity link is mainly divided into two stages, wherein the first stage is mainly to search a knowledge base according to entity names to obtain potential candidate entity sets, and the second stage is mainly to perform entity disambiguation according to the entity names and the candidate entity sets. At present, the main work of the named entity linking method is focused on optimizing an entity disambiguation algorithm, and the main methods of entity disambiguation are mainly divided into a disambiguation method based on entity popularity ranking, a disambiguation method based on context similarity ranking, a disambiguation method based on classification, a disambiguation method based on a graph, and the like.
The disambiguation method based on entity popularity ranking is the simplest and most direct entity disambiguation method, and the method judges to return the entity with the highest popularity according to the entity popularity. For example, a person entity, the more highly known the person is more likely to be linked, which is usually measured by how many links are related to the entity, but this method has the disadvantages: the same results are returned regardless of whether the querying entity is popular or not.
The disambiguation method based on context similarity ranking utilizes the similarity between the context referred by the entity and the candidate entity context to determine the entity that should be linked. This method usually represents the context of the entity reference and the context of the candidate entity as bag-of-words vectors or some keywords in them, such as named entities in the text, as space-of-words vectors, and then calculates the similarity between the vectors, such as cosine similarity. The method has the disadvantages that the calculation of the similarity depends on word co-occurrence information, a large amount of entity context information is needed, and the accuracy of the judgment result is not high when the context information is less.
The disambiguation method based on classification regards the link process as a classification process, extracts entity names from the knowledge base in the training stage, and takes < entity name, candidate entity > pairs linked with the entities in the knowledge base as positive examples, and < entity name, candidate entity > pairs not linked with the entities as negative examples. Then, a classifier is trained by using a training set formed by the positive examples and the negative examples, and common classification models used by the classifier comprise SVM, decision trees, naive Bayes and the like. In the link phase of the named entity, the trained classifier is used for classifying the candidate entities named by the entity. However, this approach is easy to obtain two or more linked entities, and requires re-ordering in conjunction with contextual similarity, etc.
The disambiguation method based on the graph takes the entity designation and the candidate entity as graph nodes, and takes the relation between the entity designation and the candidate entity as the graph edge to establish a graph structure. Most methods utilize contextual similarity features between an entity's designated nodes and its candidate entity nodes, usually when computing the relationship between the two. Therefore, the graph-based disambiguation method also depends on word co-occurrence information to some extent, and there is a problem that the accuracy of the determination result is not high when the context information is less.
Summarizing the entity disambiguation methods described above, most methods of naming entity links rely on the similarity between the entity reference context and the candidate entity context, however, the calculation of this feature requires a large amount of entity reference and the context information of its candidate entities to overcome the problem of excessive dependency on word co-occurrence information when calculating the similarity. In order to obtain more entity names and context information of candidate entities, a method capable of expanding the entity names and the context information of the candidate entities is needed to improve the accuracy of the calculation result.
Disclosure of Invention
The invention aims to provide a named entity linking method based on a search engine, which aims to solve the problems in the prior art.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a named entity linking method based on a search engine is characterized in that: the method comprises the following steps:
step 1, segmenting an input query text, and identifying a group of named entities needing to be linked to a knowledge base in the text;
step 2, obtaining a candidate entity list for each named entity through entity designation;
step 3, expanding the context information named by the entity by utilizing a search engine;
step 4, expanding the context information of each candidate entity in the candidate entity list by utilizing a search engine;
step 5, calculating the matching degree between the entity designation and each candidate entity;
and 6, sequencing the candidate entities according to the matching degree of the candidate entities, and selecting the candidate entity with the maximum matching degree as a target entity for linking.
The named entity linking method based on the search engine is characterized in that: in the step 1:
the named entities are human names, organization names, place names and other entities with names as identifiers, and the entities are objects or concepts in real life;
the query text is a text or a text set input by a user, and comprises an entity name and a context text thereof, wherein the text length is a fixed length;
the knowledge base is a set of a plurality of entity entries, including characters, organizations, place names and other entities with names as identifiers;
the entity entry is specific information stored in the knowledge base for each entity, and comprises a context text, an entity attribute and an attribute value of the entity, the context information of the entity is a text for describing the entity in detail, and the entity is named as a noun reference of an entity to be linked in a query text.
The named entity linking method based on the search engine is characterized in that: in the step 2, the method for obtaining the candidate entity by the entity designation comprises the following steps:
sequentially taking out each entity entry in the knowledge base, traversing the name attribute value and the alias attribute value of each entity entry, and calculating the character string similarity between the entity name in the query text and the name attribute value and the alias attribute value of each entity entry; if the similarity between the name attribute values of the entity names and the entity items of the query text is greater than a preset threshold value, outputting the current entity items to a candidate entity list, and if the similarity between the alias attribute values of the entity names and the entity items of the query text is greater than a preset threshold value, outputting the entity items corresponding to the alias attributes to the candidate entity list;
the initial candidate entity list is obtained through the method, and the candidate entity list is obtained by screening the initial candidate entity list, wherein the method comprises the following steps:
extracting part of entities in a knowledge base as a training set, training an entity-type classifier by using a labeled < entity, type > pair set, identifying the types of the entities in a candidate entity list by using the entity-type classifier, reserving the entities which belong to the same type as the entities to be linked in the query text, deleting the entities with different types, and obtaining a final candidate entity list;
the type of the entity is the category attribute of the entity, including people, places, mechanisms and time; the < entity, type > pair is a mapping of an entity and its type, wherein the < entity, type > pair with the correct mapping is a positive-going instance of the training classifier, and vice versa; the entity-type classifier is a multi-class classification model trained by < entity, type > pairs, by which the type of an entity can be judged.
The named entity linking method based on the search engine is characterized in that: in step 3, the method for expanding the context information referred by the entity by using the search engine comprises the following steps:
and (2) taking the entity names of a group of named entities in the query text in the step (1) as seeds to be retrieved in a search engine to obtain search result webpages related to the entity names, extracting the title and abstract information of the previous m search results from the search result webpages to obtain extended texts of the entity names, and forming the extended context information of the entity names by the query text and the extended texts.
The named entity linking method based on the search engine is characterized in that: in step 4, the method for expanding the context information of each candidate entity in the candidate entity list by using the search engine comprises the following steps:
the method comprises the steps of segmenting a context text of each candidate entity in a candidate entity list in a knowledge base, identifying the named entities of the segmented text to obtain a group of named entities related to the candidate entities, then taking the group of named entities as a seed to search in a search engine to obtain a search result webpage related to the candidate entities, extracting title and summary information of the first n search results from the search result webpage to obtain an expanded text of the candidate entities, wherein the context text of the candidate entities in the knowledge base and the expanded text together form the expanded context information of the candidate entities.
The named entity linking method based on the search engine is characterized in that: in step 5, the method for calculating the matching degree between the entity designation and each candidate entity comprises:
for each candidate entity in the candidate entity list, calculating cosine similarity a between the expanded context information of the entity designation and the expanded context information of the candidate entity, calculating the candidate entity popularity b, and calculating the matching degree c ═ w between the entity designation and each candidate entitya×a+wbX b, wherein: w is aa、wbIs a preset weight value, and wa+wb1 is ═ 1; the popularity of the candidate entity is: the entity designation is the ratio of the number of times the hyperlink links to the candidate entity page to the total number of times the hyperlink links to all candidate entities.
Compared with the prior art, the invention has the beneficial effects that:
1. the present invention employs a two-stage method of generating a list of candidate entities. Firstly, calculating the similarity of character strings to obtain an initial candidate entity list. And then, screening the initial candidate entity list to obtain a candidate entity list according to a method for classifying entity types. The two-stage method of generating the candidate entity list can generate more candidate entities related to the entity designation, and can also reduce the complexity of disambiguation by removing part of the noise candidate entities through entity type classification.
2. The key contribution of the invention is to provide a method for expanding the entity designation and the context text information of the candidate entity to solve the link problem of the named entity. The length of the entity reference context text is limited, and the entity context information contained in the knowledge base is also limited, so that the accuracy of calculating the similarity between the entity reference and the candidate entity context depending on the word co-occurrence information is not high. The entity names and the candidate entities are retrieved through a search engine, all webpages related to the entity names and the candidate entities are obtained, and the expanded context information is obtained after the webpages are sorted, so that more entity characteristics can be extracted, and additional information can be provided for improving the accuracy of entity links.
Drawings
FIG. 1 is a flowchart of a named entity linking method based on a search engine according to the present invention.
Detailed Description
A named entity linking method based on a search engine comprises the following steps:
step 1, performing word segmentation on an input query text, and identifying a group of named entities needing to be linked to a knowledge base in the text.
And 2, obtaining a candidate entity list for each named entity through entity designation.
And 3, expanding the context information named by the entity by utilizing a search engine.
And 4, expanding the context information of each candidate entity in the candidate entity list by utilizing a search engine.
And 5, calculating the matching degree between the entity designation and each candidate entity.
And 6, sequencing the candidate entities according to the matching degree of the candidate entities, and selecting the candidate entity with the maximum matching degree as a target entity for linking.
In the step 1, named entities are human names, organization names, place names and other entities with names as identifiers, and the entities are objects or concepts in real life; the query text is a text or a text set input by a user, and comprises an entity name and a context text, the text length is a fixed length, and the knowledge base is a set of a plurality of entity entries, including people, institutions, place names and other entities with names as identifiers; the entity entry is specific information stored in the knowledge base for each entity, and comprises a context text, an entity attribute and an attribute value of the entity, the context information of the entity is a text for describing the entity in detail, and the entity is named as a noun reference of an entity to be linked in a query text.
In step 2, the method for obtaining the candidate entity by the entity designation comprises the following steps: and sequentially taking out each entity entry in the knowledge base, traversing the name attribute value and the alias attribute value of the entity entry, and calculating the character string similarity between the entity name in the query text and the name attribute value and the alias attribute value of each entity entry. If the similarity between the name attribute values of the entity names and the entity items of the query text is greater than a preset threshold value, outputting the current entity items to a candidate entity list, and if the similarity between the alias attribute values of the entity names and the entity items of the query text is greater than the preset threshold value, outputting the entity items corresponding to the alias attributes to the candidate entity list; an initial candidate entity list is obtained by the method. And screening the initial candidate entity list to obtain a candidate entity list. The method comprises the following steps: part of the entities in the knowledge base are extracted as a training set, and the entity-type classifier is trained by using the labeled < entity, type > pairs. Then, an entity-type classifier is used for identifying the type of an entity in the candidate entity list, entities which belong to the same type as the entity to be linked in the query text are reserved, and entities with different types are deleted to obtain a final candidate entity list, wherein the type of the entity is the category attribute of the entity, including people, places, mechanisms and time; the < entity, type > pair is the mapping of the entity and its type, wherein the < entity, type > pair with correct mapping is the positive example of the training classifier, otherwise, the negative example; the entity-type classifier is a multi-class classification model trained by < entity, type > pairs, the multi-class classification model is used for classifying entities into a plurality of classes, one entity belongs to one of the plurality of classes, the different classes are mutually exclusive, and the types of the entities can be judged through the multi-class classification model.
In step 3, the method for expanding the context information named by the entity by using the search engine comprises the following steps: and (2) taking the entity names of a group of named entities in the query text in the step (1) as seeds to be retrieved in a search engine to obtain search result webpages related to the entity names, extracting the title and abstract information of the previous m search results from the search result webpages to obtain extended texts of the entity names, and forming the extended context information of the entity names by the query text and the extended texts.
In step 4, the method for expanding the context information of each candidate entity in the candidate entity list by using the search engine comprises the following steps: the method comprises the steps of segmenting a context text of each candidate entity in a candidate entity list in a knowledge base, identifying the named entities of the segmented text to obtain a group of named entities related to the candidate entities, then taking the group of named entities as a seed to search in a search engine to obtain a search result webpage related to the candidate entities, extracting title and summary information of the first n search results from the search result webpage to obtain an expanded text of the candidate entities, wherein the context text of the candidate entities in the knowledge base and the expanded text together form the expanded context information of the candidate entities.
In step 5, the method for calculating the matching degree between the entity designation and each candidate entity comprises the following steps: for each candidate entity in the candidate entity list, calculating cosine similarity a between the expanded context information of the entity designation and the expanded context information of the candidate entity, calculating popularity b of the candidate entity, and calculating matching degree c ═ w between the entity designation and each candidate entitya×a+wbX b. Wherein: w is aa、wbIs a preset weight value, and wa+wb1. The popularity of the candidate entities is: the entity designation is the ratio of the number of times the hyperlink links to the candidate entity page to the total number of times the hyperlink links to all candidate entities.
The specific embodiment is as follows:
the embodiment provides a named entity linking method based on a search engine, and the following describes the steps of the named entity linking method based on the search engine in the embodiment with reference to fig. 1:
(1) as shown in S101 of fig. 1, the input query text is tokenized to identify a set of named entities in the text that need to be linked to the knowledge base. A user enters a query text, such as a word "Lina retired ceremony Nadall flower donation" in some news, Najie's tears running the cheering all over. ", first, the entered query text is tokenized, and then a set of named entities in the text is identified using a named entity recognition tool. The named entities include names of people, organizations, places, and other entities identified by names, and the more extensive entities include numbers, dates, currencies, and addresses.
The word segmentation of the input query text is to divide the word of the query text according to a Chinese dictionary by using a word segmentation tool, and the tool used for word segmentation in the embodiment is an NLPIR Chinese word segmentation tool (website: http:// ictclas. NLPIR. org/downloads) provided by the institute of computing technology of Chinese academy of sciences. The set of words and their type labels obtained by word segmentation is: w ═ lina/nr retirement/vi ceremony/n nadal/nr flower donation/vi,/wd na sister/nr tear/n ben/v share/vg whole field/n cheer/v }, where "/nr", "/vi", "/n", "/wd", "/v", "/vg" are type labels for words.
After word segmentation is carried out to obtain a set of words, an entity set contained in the text after word segmentation is identified by utilizing a named entity identification technology: e ═ lina, nadal, and na sister }, the named entity recognition method adopted in this embodiment is to extract a phrase with types labeled nr, ns, nt, and nz to form a named entity set according to the words obtained by word segmentation and the type labels thereof. As used herein, "Lina", "Nadar" and "Najie" are references to the named entities identified as "Lina", "Nadar" and "Najie". The present embodiment describes the named entity linking process by taking the linking process of the entity named "lina" as an example.
(2) For each named entity, a list of candidate entities is obtained by entity designation, as shown at S102 of fig. 1. And sequentially taking out each entity entry in the knowledge base, traversing the name attribute value and the alias attribute value of the entity entry, and calculating the character string similarity between the entity name in the query text and the name attribute value and the alias attribute value of each entity entry. If the similarity between the name attribute values of the entity names and the entity items of the query text is greater than a preset threshold value, the current entity items are output to a candidate entity list, and if the similarity between the alias attribute values of the entity names and the entity items of the query text is greater than the preset threshold value, the entity items corresponding to the alias attributes are output to the candidate entity list. An initial candidate entity list is obtained by the method. The knowledge base used in this embodiment is a chinese knowledge base (website: http:// keg.cs. tsinghua.edu. cn/project/chinese kb) issued by the university of qinghua, which contains 800,000 different entities and their attributes, and the name attribute values are contained in the rdf: about attribute of the < rdf: Description rdf: about "" > < rdf: Description > tag, and the alias attributes are contained in the < ont: alias > </ont: alias > tag. The threshold value is set to 0.7 in advance in this embodiment. For example, the initial candidate entity entry that satisfies the threshold value that results from searching for the lina entity is shown in table 1:
table 1 initial candidate entity list with entity designation "lina
Figure BDA0001576847430000081
In table 1, the entity name is a name attribute value of an entity entry in the extracted repository, which is included in an rdf abeut attribute of a < rdf: Description rdf: abeut ″ ' > < rdf: Description > tag, and the entity context is a context attribute value of the entity entry in the extracted repository, which is included in an < ont: ABSTRACT > </ont: ABSTRACT > tag, and the other attributes are other attribute values of the entity entry in the repository, which are included in other tags except the < rdf: Description rdf: abeut [ ' > ' ardf: Description >, < ont: ABSTRACT >/ont: ABSTRACT > tag.
Then, screening the initial candidate entity list to obtain a candidate entity list, wherein the method comprises the following steps: and extracting partial entities in the knowledge base as a training set, and training an entity-type classifier by using the marked < entities, types > pair. The method for extracting part of the entities in the knowledge base as the training set in this embodiment is as follows: and randomly extracting 200 entity entries from the entity entry set of each type to serve as a training set, wherein an adopted classifier model is SVM (support Vector machine). And then, identifying the entity type in the candidate entity list by using an entity-type classifier, reserving the entities which belong to the same type as the entities to be linked in the query text, and deleting the entities with different types to obtain a final candidate entity list. The entity type is the category attribute of the entity, including people, places, organizations and time. The list of candidate entities after screening is shown in table 2:
table 2 list of candidate entities with entity designation "lina
Figure BDA0001576847430000091
(3) As shown in S103 of fig. 1, the context information of the entity designation is expanded using the search engine. Taking the example of an entity named as "lina", the query text named by the entity is "lina retirement ceremony nada flower donation, and na sister tears rushing to cheer all over. "the entity names" lina, nadal, na sister "of a group of named entities identified from the query text are used as seeds of a search engine to search, so as to obtain a search result webpage related to the entity name" lina ", title and summary information of the previous m search results are extracted from the search result webpage, so as to obtain an extended text of the entity name" lina "shown in table 3, and the query text and the extended text together form extended context information of the entity name" lina ". In this embodiment, the search engine is a necessary search engine, title and summary information of the search result is extracted by using the CSS selector, and the value of m is 8.
Table 3 extended text with entity named "lina
Figure BDA0001576847430000092
Figure BDA0001576847430000101
(4) As shown in S104 of fig. 1, the context information of each candidate entity in the candidate entity list is expanded using the search engine. Firstly, the context text of each candidate entity in the candidate entity list in the knowledge base is participled, in this embodiment, the context information of the candidate entity is included in the context text<ont:ABSTRACT></ont:ABSTRACT>In the label. Taking the candidate entity "Lina (tennis player)" as an example, the result of word segmentation on its context is W*= lina/nr (/ wkz 1982/t 2/t 26/t —/wp)/wky,/wd china/nr women/n tennis balls/n players/n. The term "the" term "refers to"/wj 2011/t 6/t 4/t,/wd is/p france/nr/ude 1 paris/nr west/fmon/vi tex/dhoo/a land/n/ude 1 roland/nrf · garland/bso/b si/b stadium/n in/f,/wd lisa/nr acquisition/v french/n lady/n champion/n. The/wj becomes/ude 1 Asian/nr/n/v has been the first/m/q of/dl since the history/v is obtained/v big/a grand/n tennis match/n events/n championship/n. And/wj, and performing named entity identification on the text after word segmentation to obtain a group of named entities related to the candidate entities. The named entity recognition method adopted in this embodiment is to extract a word with the type labeled nr, ns, nt, and nz to form a named entity set according to the word obtained by word segmentation and the type label thereof. The result of named entity recognition is E*Langa, china, france, paris, asia. Then, the group of named entities "lina, china, france, paris, asia" is used as a seed to be searched in a search engine to obtain a group of search result webpages related to the candidate entity "lina (tennis player)", title and summary information of the first n search results are extracted from the search result webpages to obtain an extended text of the candidate entity "lina (tennis player)" as shown in table 4, and the context text of the candidate entity "lina (tennis player)" in the knowledge base and the extended text together form the extended context information of the candidate entity "lina (tennis player)". Search engine in this embodimentAnd extracting title and summary information of the search result by using a CSS selector to ensure that the search engine is necessary, wherein n is 10.
TABLE 4 extended text of candidate entity "Lina (tennis player)
Figure BDA0001576847430000111
Figure BDA0001576847430000121
(5) As shown in S105 of fig. 1, a degree of match between the entity designation and each candidate entity is calculated. Firstly, for each candidate entity in the candidate entity list, calculating cosine similarity a between expanded context information of the entity name and expanded context information of the candidate entity, calculating popularity b of the candidate entity, and calculating matching degree c w between the entity name and each candidate entitya×a+wbX b. Wherein: w is aa、wbIs a preset weight value, and wa+wb1. The popularity of the candidate entities is: the entity designation is the ratio of the number of times the hyperlink links to the candidate entity page to the total number of times the hyperlink links to all candidate entities. In this embodiment, w is preseta=0.6,wb0.4. Taking the example of calculating the matching degree between the entity name "lina" and the candidate entity "lina (tennis player)" in step (3), the cosine similarity a between the extended context information of the entity name "lina" and the extended context information of the candidate entity "lina (tennis player)" is 0.53, the popularity b of the candidate entity "lina (tennis player)" is 0.88, and thus the matching degree c between the entity name "lina" and the candidate entity "lina (tennis player)" is 0.6 + 0.53+0.4 is 0.88 is 0.67. Table 5 lists the degree of match between the entity designation "lina" and each candidate entity.
Table 5 degree of match between entity designation "lina" and each candidate entity
Candidate entities Cosine similarity Popularity of candidate entities Degree of matching
Lina (tennis athlete) 0.53 0.88 0.67
Lina (singer) 0.43 0.056 0.28
Lina (professor north) 0.39 0 0.23
Li Naying 0.19 0.31 0.24
(6) As shown in S106 of fig. 1, the candidate entities are ranked according to the matching degrees of the candidate entities, and the candidate entity with the highest matching degree is selected as the target entity for linking. In this embodiment, table 5 is sorted from large to small according to the matching degree between the entity index and the candidate entity to obtain table 6, and the candidate entity in item 1 in table 6 is the candidate entity with the largest matching degree, so that the target entity linked to the "lina" is the "lina (tennis player)", which is called by the entity index.
Table 6 results of ranking candidate entities named "Lina" to entities according to degree of matching from large to small
Candidate entities Cosine similarity Popularity of candidate entities Degree of matching
Lina (tennis athlete) 0.53 0.88 0.67
Lina (singer) 0.43 0.056 0.28
Li Naying 0.19 0.31 0.24
Lina (professor north) 0.39 0 0.23

Claims (1)

1. A named entity linking method based on a search engine is characterized in that: the method comprises the following steps:
step 1, segmenting an input query text, and identifying a group of named entities needing to be linked to a knowledge base in the text;
step 2, obtaining a candidate entity list for each named entity through entity designation;
step 3, expanding the context information named by the entity by utilizing a search engine;
step 4, expanding the context information of each candidate entity in the candidate entity list by utilizing a search engine;
step 5, calculating the matching degree between the entity designation and each candidate entity;
step 6, sorting the candidate entities according to the matching degrees of the candidate entities, and selecting the candidate entity with the maximum matching degree as a target entity for linking;
in the step 1:
the named entities are human names, organization names, place names and other entities with names as identifiers, and the entities are objects or concepts in real life;
the query text is a text or a text set input by a user, and comprises an entity name and a context text thereof, wherein the text length is a fixed length;
the knowledge base is a set of a plurality of entity entries, including characters, organizations, place names and other entities with names as identifiers;
the entity items are specific information stored in the knowledge base for each entity, and comprise context texts, entity attributes and attribute values of the entities, the context information of the entities is texts for describing the entities in detail, and the entities are named as noun references of the entities to be linked in the query texts;
in the step 2, the method for obtaining the candidate entity by the entity designation comprises the following steps:
sequentially taking out each entity entry in the knowledge base, traversing the name attribute value and the alias attribute value of each entity entry, and calculating the character string similarity between the entity name in the query text and the name attribute value and the alias attribute value of each entity entry; if the similarity between the name attribute values of the entity names and the entity items of the query text is greater than a preset threshold value, outputting the current entity items to a candidate entity list, and if the similarity between the alias attribute values of the entity names and the entity items of the query text is greater than a preset threshold value, outputting the entity items corresponding to the alias attributes to the candidate entity list;
the initial candidate entity list is obtained through the method, and the candidate entity list is obtained by screening the initial candidate entity list, wherein the method comprises the following steps:
extracting part of entities in a knowledge base as a training set, training an entity-type classifier by using a labeled < entity, type > pair set, identifying the types of the entities in a candidate entity list by using the entity-type classifier, reserving the entities which belong to the same type as the entities to be linked in the query text, deleting the entities with different types, and obtaining a final candidate entity list;
the type of the entity is the category attribute of the entity, including people, places, mechanisms and time; the < entity, type > pair is a mapping of an entity and its type, wherein the < entity, type > pair with the correct mapping is a positive-going instance of the training classifier, and vice versa; the entity-type classifier is a multi-class classification model trained by an < entity, type > pair set, and the type of the entity can be judged through the model;
in step 3, the method for expanding the context information referred by the entity by using the search engine comprises the following steps:
retrieving entity names of a group of named entities in the query text in the step 1 as seeds in a search engine to obtain search result webpages related to the entity names, extracting title and abstract information of the previous m search results from the search result webpages to obtain extended texts of the entity names, wherein the query text and the extended texts together form extended context information of the entity names;
in step 4, the method for expanding the context information of each candidate entity in the candidate entity list by using the search engine comprises the following steps:
segmenting a context text of each candidate entity in a candidate entity list in a knowledge base, identifying the named entities of the segmented text to obtain a group of named entities related to the candidate entities, then taking the group of named entities as a seed to search in a search engine to obtain a search result webpage related to the candidate entities, extracting title and summary information of the first n search results from the search result webpage to obtain an expanded text of the candidate entities, wherein the context text of the candidate entities in the knowledge base and the expanded text together form expanded context information of the candidate entities;
in step 5, the method for calculating the matching degree between the entity designation and each candidate entity comprises:
for each candidate entity in the candidate entity list, calculating cosine similarity a between the expanded context information of the entity designation and the expanded context information of the candidate entity, calculating the candidate entity popularity b, calculating the degree of match c = w between the entity designation and each candidate entitya×a+wbX b, wherein: w is aa、wbIs a preset weight value, and wa+wb= 1; the popularity of the candidate entity is: the entity designation is the ratio of the number of times the hyperlink links to the candidate entity page to the total number of times the hyperlink links to all candidate entities.
CN201810138076.9A 2018-02-10 2018-02-10 Named entity linking method based on search engine Active CN108415902B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810138076.9A CN108415902B (en) 2018-02-10 2018-02-10 Named entity linking method based on search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810138076.9A CN108415902B (en) 2018-02-10 2018-02-10 Named entity linking method based on search engine

Publications (2)

Publication Number Publication Date
CN108415902A CN108415902A (en) 2018-08-17
CN108415902B true CN108415902B (en) 2021-10-26

Family

ID=63127161

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810138076.9A Active CN108415902B (en) 2018-02-10 2018-02-10 Named entity linking method based on search engine

Country Status (1)

Country Link
CN (1) CN108415902B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408802A (en) * 2018-08-28 2019-03-01 厦门快商通信息技术有限公司 A kind of method, system and storage medium promoting sentence vector semanteme
CN109783797A (en) * 2018-12-11 2019-05-21 北京百度网讯科技有限公司 Abstracting method, device, equipment and the storage medium of semantic relation
CN109635297B (en) * 2018-12-11 2022-01-04 湖南星汉数智科技有限公司 Entity disambiguation method and device, computer device and computer storage medium
CN110110046B (en) * 2019-04-30 2021-10-01 北京搜狗科技发展有限公司 Method and device for recommending entities with same name
CN110222156B (en) * 2019-06-14 2021-11-16 北京百度网讯科技有限公司 Method and device for discovering entity, electronic equipment and computer readable medium
CN110298039B (en) * 2019-06-20 2023-05-30 北京百度网讯科技有限公司 Event place identification method, system, equipment and computer readable storage medium
CN110245357B (en) * 2019-06-26 2023-05-02 北京百度网讯科技有限公司 Main entity identification method and device
CN110990524A (en) * 2019-10-24 2020-04-10 清华大学 Academic achievement organization naming disambiguation method and device based on reliable information base
CN112825112A (en) * 2019-11-20 2021-05-21 阿里巴巴集团控股有限公司 Data processing method and device and computer terminal
CN111104520B (en) * 2019-11-21 2023-06-30 新华智云科技有限公司 Personage entity linking method based on personage identity
CN111737430B (en) * 2020-06-16 2024-04-05 北京百度网讯科技有限公司 Entity linking method, device, equipment and storage medium
CN111737422B (en) * 2020-08-14 2020-12-29 科大讯飞(苏州)科技有限公司 Entity linking method and device, electronic equipment and storage medium
CN112560485B (en) * 2020-11-24 2023-04-14 北京三快在线科技有限公司 Entity linking method and device, electronic equipment and storage medium
CN112989235B (en) * 2021-03-09 2023-08-01 北京百度网讯科技有限公司 Knowledge base-based inner link construction method, device, equipment and storage medium
CN113535883B (en) * 2021-07-16 2023-10-31 携程旅游信息技术(上海)有限公司 Commercial venue entity linking method, system, electronic equipment and storage medium
CN116258138B (en) * 2023-03-15 2024-01-02 北京百度网讯科技有限公司 Knowledge base construction method, entity linking method, device and equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224648A (en) * 2015-09-29 2016-01-06 浪潮(北京)电子信息产业有限公司 A kind of entity link method and system
CN105912625A (en) * 2016-04-07 2016-08-31 北京大学 Linked data oriented entity classification method and system
CN107145545A (en) * 2017-04-18 2017-09-08 东北大学 Top k zone users text data recommends method in a kind of location-based social networks
CN107608960A (en) * 2017-09-08 2018-01-19 北京奇艺世纪科技有限公司 A kind of method and apparatus for naming entity link

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10460327B2 (en) * 2006-07-28 2019-10-29 Palo Alto Research Center Incorporated Systems and methods for persistent context-aware guides
EP2765783A1 (en) * 2013-02-11 2014-08-13 Thomson Licensing Method and device for enriching a multimedia content defined by a timeline and a chronological text description

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224648A (en) * 2015-09-29 2016-01-06 浪潮(北京)电子信息产业有限公司 A kind of entity link method and system
CN105912625A (en) * 2016-04-07 2016-08-31 北京大学 Linked data oriented entity classification method and system
CN107145545A (en) * 2017-04-18 2017-09-08 东北大学 Top k zone users text data recommends method in a kind of location-based social networks
CN107608960A (en) * 2017-09-08 2018-01-19 北京奇艺世纪科技有限公司 A kind of method and apparatus for naming entity link

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"English-Korean Cross-lingual Link Discovery Using Link Probability and Named Entity Recognition";강신재;《Journal of Korean Institute of Intelligent Systems》;20130601;第23卷(第3期);第103-107页 *
"实体识别与链接的研究与实现";郑迪;《中文信息学报》;20170315;第31卷(第2期);第36-41页 *

Also Published As

Publication number Publication date
CN108415902A (en) 2018-08-17

Similar Documents

Publication Publication Date Title
CN108415902B (en) Named entity linking method based on search engine
US9928296B2 (en) Search lexicon expansion
US8527506B2 (en) Media discovery and playlist generation
US9026543B2 (en) System and method for generating a relationship network
US20080154886A1 (en) System and method for summarizing search results
TWI695277B (en) Automatic website data collection method
US20100293159A1 (en) Systems and methods for extracting phases from text
CN105426529A (en) Image retrieval method and system based on user search intention positioning
CN110888991B (en) Sectional type semantic annotation method under weak annotation environment
Agirre et al. Matching Cultural Heritage items to Wikipedia.
CN113553429A (en) Normalized label system construction and text automatic labeling method
Bellare et al. Lightly-supervised attribute extraction
CN105404677A (en) Tree structure based retrieval method
CN105426490A (en) Tree structure based indexing method
JP2004078446A (en) Keyword extraction device, extraction method, document retrieval system, retrieval method, device and method for classifying document, and program
Pakojwar et al. Web data extraction and alignment using tag and value similarity
CN112100500A (en) Example learning-driven content-associated website discovery method
Dange et al. Hashing based re-ranking of web images using query-specific semantic signatures
Thannaing et al. Improving information retrieval based on query classification algorithm
Zhang Classification and Retrieval Method of Library Book Information Based on Data Mining
Jayabhaduri Automatic Discovery of Association Orders between Name and Aliases from the Web using Anchor Texts-based Co-occurrences
Liu et al. A Bottom-up Approach of Web Data Extraction based on Entity Recognition and Integration
Tian et al. Textual ontology and visual features based search for a paleontology digital library
Deshmane et al. Survey on web mining techniques for Extraction of top k list
CN115238025A (en) Skyline-based knowledge graph spectrum semantic place retrieval method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant