CN110008352B - Entity discovery method and device - Google Patents

Entity discovery method and device Download PDF

Info

Publication number
CN110008352B
CN110008352B CN201910242996.XA CN201910242996A CN110008352B CN 110008352 B CN110008352 B CN 110008352B CN 201910242996 A CN201910242996 A CN 201910242996A CN 110008352 B CN110008352 B CN 110008352B
Authority
CN
China
Prior art keywords
entity
candidate
entities
designated
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910242996.XA
Other languages
Chinese (zh)
Other versions
CN110008352A (en
Inventor
徐程程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910242996.XA priority Critical patent/CN110008352B/en
Publication of CN110008352A publication Critical patent/CN110008352A/en
Application granted granted Critical
Publication of CN110008352B publication Critical patent/CN110008352B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a method and a device for discovering an entity, wherein the method comprises the following steps: acquiring entity candidate data of at least one data source; selecting candidate entities from each entity according to entity parameters of each entity included in the entity candidate data; if the candidate entity is contained in the designated entity set, extracting entity characteristics of at least one designated entity including the candidate entity from the designated entity set; determining a target entity from the at least one designated entity according to the entity characteristics of the at least one designated entity, and determining at least one associated entity of the target entity from the designated entity set based on the association relationship between the target entity and other designated entities in the designated entity set; and generating a target entity set according to the target entity and at least one associated entity of the target entity. By adopting the embodiment of the application, the hot entity can be found in time, the recall rate and the recall efficiency of the hot entity are improved, and the applicability is high.

Description

Entity discovery method and device
Technical Field
The present application relates to the field of data processing, and in particular, to a method and an apparatus for discovering an entity.
Background
Knowledge maps need to ensure comprehensiveness and real-time performance of knowledge. When the whole process of knowledge graph construction is successfully built, the automatic discovery and downloading of the entity are important entries for keeping knowledge automatically updated. Generally, a website has many new entities appearing every day, and the prior art can only find the entities shown in the homepage, so that the recalling of hot entities is insufficient. Meanwhile, many existing but important entities in the knowledge graph need to be downloaded regularly for updating, the capturing rules cannot be found effectively by means of configuration or manual operation, and if all the entities are updated, the occupied resources are large, which is not practical, so that the timeliness of many knowledge is poor.
Disclosure of Invention
The embodiment of the application provides an entity discovery method and device, which can discover hot entities in time, improve the recall rate and recall efficiency of the hot entities, and have high applicability.
In a first aspect, an embodiment of the present application provides a method for entity discovery, where the method includes:
acquiring entity candidate data of at least one data source;
selecting candidate entities from each entity according to entity parameters of each entity included in the entity candidate data;
if the candidate entity is contained in a specified entity set, extracting entity characteristics of at least one specified entity including the candidate entity from the specified entity set;
determining a target entity from the at least one designated entity according to the entity characteristics of the at least one designated entity, and determining at least one associated entity of the target entity from the designated entity set based on the association relationship between the target entity and other designated entities in the designated entity set;
and generating a target entity set according to the target entity and the at least one associated entity of the target entity.
The embodiment of the application can find the popular entities in time, can improve the recall rate and recall efficiency of the popular entities by determining the target entities and the associated entities of the target entities, and has high applicability.
With reference to the first aspect, in one possible implementation, the method further includes:
and if the candidate entity is not contained in the specified entity set, generating a target entity set according to the candidate entity and each specified entity contained in the specified entity set.
The embodiment of the application can find the entity which is not contained in the appointed entity set in time, improves the recall rate and recall efficiency of the entity and has strong applicability.
With reference to the first aspect, in one possible implementation manner, the data source includes at least one of a news channel, a search log, and a social platform; the obtaining of the entity candidate data of at least one data source includes:
acquiring one or more items of data in news headlines, news abstracts and news texts in a news channel, and determining the acquired data as entity candidate data; and/or
Acquiring a search record in a search log, and determining the acquired search record as entity candidate data; and/or
And acquiring discussion topics in the social platform, and determining the acquired discussion topics as entity candidate data.
The embodiment of the application can find the entity in time, increases the diversity of data sources, further can improve the recall rate of the hot entity, and has high flexibility and strong applicability.
With reference to the first aspect, in one possible implementation, the method further includes:
identifying and extracting each entity included in the entity candidate data based on a named entity identification algorithm;
and determining entity parameters respectively corresponding to the entities from the entity candidate data.
The embodiment of the application can improve the accuracy of entity identification, further increase the recall rate and accuracy of hot entities, and has strong applicability.
With reference to the first aspect, in a possible implementation manner, the entity parameter includes any one of an entity occurrence number, an entity update number, and an entity browsing number; the selecting a candidate entity from each entity according to the entity parameter of each entity included in the entity candidate data includes:
if the entity candidate data comprises one or more first entities from a single data source, determining a first entity with an entity parameter greater than or equal to a first preset entity parameter threshold value in the one or more first entities as a candidate entity;
if the entity candidate data comprises one or more second entities from at least two data sources, the entity parameters of any second entity in each data source are summed, and the second entity with the sum of the entity parameters being greater than or equal to a second preset entity parameter threshold is determined as the candidate entity.
The embodiment of the application can increase the recall rate of the hot entity, and has high flexibility and wide application range.
With reference to the first aspect, in a possible implementation manner, the entity parameter includes an entity data source number; the selecting a candidate entity from each entity according to the entity parameter of each entity included in the entity candidate data includes:
and determining one or more third entities from at least two data sources from the entity candidate data, and determining the third entities of which the entity data source quantity is not less than a preset data source quantity threshold value from the one or more third entities as candidate entities.
The embodiment of the application can increase the recall rate of the hot entity, and has high flexibility and wide application range.
With reference to the first aspect, in a possible implementation manner, the entity characteristics include at least two of an entity importance value, an entity data source number, an entity attribute number, an entity occurrence number, an entity update number, and an entity browsing number; the determining a target entity from the at least one designated entity according to the entity characteristics of the at least one designated entity includes:
respectively carrying out normalization processing on each entity characteristic of any one designated entity in the at least one designated entity to obtain the normalized entity characteristic corresponding to each designated entity;
inputting the normalized entity features corresponding to each designated entity into an entity classification model, and outputting a target entity included in the at least one designated entity based on the entity classification model;
the entity classification model is obtained by training a linear model and/or a nonlinear model and has the capability of identifying entities with the heat degrees larger than or equal to a preset heat degree threshold value.
The embodiment of the application can improve the recall rate and the accuracy of the hot entity, is not easy to make mistakes, is simple and convenient to operate and has strong applicability.
With reference to the first aspect, in a possible implementation manner, the determining, from the set of specified entities, at least one associated entity of the target entity based on an association relationship between the target entity and other specified entities in the set of specified entities includes:
acquiring a target entity type of the target entity and determining a related entity type set of the target entity type;
determining one or more designated entities of which the entity types are contained in the associated entity type set from each designated entity which is included in the designated entity set and has the associated relationship with the target entity;
and determining the determined one or more designated entities as the associated entities of the target entity.
The embodiment of the application can increase the recall rate of the hot entity, improve the recall efficiency of the hot entity, and has the advantages of simple and convenient operation, high flexibility and strong applicability.
In a second aspect, an embodiment of the present application provides an apparatus for entity discovery, where the apparatus includes:
the candidate data acquisition module is used for acquiring entity candidate data of at least one data source;
a candidate entity determining module, configured to select a candidate entity from each entity according to the entity parameter of each entity included in the entity candidate data determined by the candidate data obtaining module;
an entity feature extraction module, configured to extract, if the candidate entity determined by the candidate entity determination module is included in a designated entity set, an entity feature of at least one designated entity including the candidate entity from the designated entity set;
a target entity determining module, configured to determine a target entity from the at least one designated entity according to the entity feature of the at least one designated entity determined by the entity feature extracting module, and determine at least one associated entity of the target entity from the designated entity set based on an association relationship between the target entity and other designated entities in the designated entity set;
a first entity set generating module, configured to generate a target entity set according to the target entity determined by the target entity determining module and the at least one associated entity of the target entity.
With reference to the second aspect, in a possible implementation manner, the apparatus further includes:
a second entity set generating module, configured to generate a target entity set according to the candidate entity and each designated entity included in the designated entity set if the candidate entity determined by the candidate entity determining module is not included in the designated entity set.
With reference to the second aspect, in one possible implementation manner, the data source includes at least one of a news channel, a search log, and a social platform; the candidate data acquisition module is specifically configured to:
acquiring one or more items of data of news headlines, news abstracts and news texts in news channels, and determining the acquired data as entity candidate data; and/or
Acquiring a search record in a search log, and determining the acquired search record as entity candidate data; and/or
And acquiring discussion topics in the social platform, and determining the acquired discussion topics as entity candidate data.
With reference to the second aspect, in a possible implementation manner, the apparatus further includes:
and the entity identification module is used for identifying and extracting each entity and the entity parameter of each entity in the entity candidate data based on a named entity identification algorithm.
With reference to the second aspect, in a possible implementation manner, the entity parameter includes any one of an entity occurrence number, an entity update number, and an entity browsing number; the candidate entity determining module is specifically configured to:
if the entity candidate data comprises one or more first entities from a single data source, determining a first entity of the one or more first entities with an entity parameter greater than or equal to a first preset entity parameter threshold value as a candidate entity;
if the entity candidate data comprises one or more second entities from at least two data sources, the entity parameters of any second entity in each data source are summed, and the second entity with the entity parameter sum larger than or equal to a second preset entity parameter threshold is determined as the candidate entity.
With reference to the second aspect, in a possible implementation manner, the entity parameter includes an entity data source number; the candidate entity determining module is specifically configured to:
and determining one or more third entities from at least two data sources from the entity candidate data, and determining the third entities of which the entity data source quantity is not less than a preset data source quantity threshold value from the one or more third entities as candidate entities.
With reference to the second aspect, in a possible implementation manner, the entity characteristics include at least two of an entity importance value, an entity data source number, an entity attribute number, an entity occurrence number, an entity update number, and an entity browsing number; the target entity determining module includes:
the target entity discovering unit is used for respectively carrying out normalization processing on each entity characteristic of any one appointed entity in the at least one appointed entity so as to obtain the entity characteristic which corresponds to each appointed entity and is subjected to normalization processing;
inputting the normalized entity features corresponding to each designated entity into an entity classification model, and outputting a target entity included in the at least one designated entity based on the entity classification model;
the entity classification model is obtained by training a linear model and/or a nonlinear model and has the capability of identifying entities with the heat degrees larger than or equal to a preset heat degree threshold value.
With reference to the second aspect, in a possible implementation manner, the target entity determining module includes:
the associated entity discovery unit is used for acquiring the target entity type of the target entity and determining an associated entity type set of the target entity type;
determining one or more designated entities of which the entity types are contained in the associated entity type set from each designated entity which is included in the designated entity set and has the associated relationship with the target entity;
and determining the determined one or more designated entities as the associated entities of the target entity.
In a third aspect, an embodiment of the present application provides a terminal device, where the terminal device includes a processor and a memory, and the processor and the memory are connected to each other. The memory is configured to store a computer program that supports the terminal device to execute the method provided by the first aspect and/or any one of the possible implementation manners of the first aspect, where the computer program includes program instructions, and the processor is configured to call the program instructions to execute the method provided by the first aspect and/or any one of the possible implementation manners of the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, where the computer program includes program instructions, and the program instructions, when executed by a processor, cause the processor to execute the method provided by the first aspect and/or any one of the possible implementation manners of the first aspect.
The embodiment of the application has the following beneficial effects:
based on the obtained entity candidate data of at least one data source, a candidate entity can be determined according to entity parameters of each entity included in the entity candidate data, if the candidate entity is included in the designated entity set, entity features of at least one designated entity including the candidate entity can be extracted from the designated entity set, and a target entity can be determined according to the entity features. At least one associated entity of the target entity can be determined by utilizing the association relationship between the target entity and other specified entities in the specified entity set, and the target entity set is finally generated, so that the entities can be found in time, the recall rate and the recall efficiency of the entities can be improved, and the applicability is high.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flowchart of an entity discovery method provided in an embodiment of the present application;
FIG. 2 is a schematic diagram of a data source provided by an embodiment of the present application;
FIG. 3 is a schematic diagram of physical features provided in an embodiment of the present application;
FIG. 4 is a schematic diagram of a one-degree relationship diffusion provided by an embodiment of the present application;
fig. 5 is a schematic view of an application scenario of first-degree relation diffusion according to an embodiment of the present application;
FIG. 6 is a schematic diagram of two-degree relationship diffusion provided by an embodiment of the present application;
FIG. 7 is a schematic diagram of three-degree relationship diffusion provided by an embodiment of the present application;
fig. 8 is a schematic structural diagram of an entity discovery apparatus provided in an embodiment of the present application;
fig. 9 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The method for discovering entities provided in the embodiment of the present application may be widely applied to trending Entity update, trending Entity recall, or trending Entity discovery of various Knowledge graphs (Knowledge graphs) or Entity-relationship model (ERDs), and for convenience of description, the trending Entity update, trending Entity recall, or trending Entity discovery in a Knowledge Graph may be described as an example. The knowledge Graph is a new concept proposed by Google corporation in 2012, and is a semantic network in nature, and for understanding, the knowledge Graph can be understood as a Multi-relational Graph (Multi-relational Graph). In a data structure, a Graph (Graph) is composed of nodes (Vertex) and edges (Edge), but the graphs usually only contain one type of nodes and edges, and the multi-relationship Graph usually contains multiple types of nodes and multiple types of edges. In the knowledge graph, each node represents an "Entity (Entity)", each edge represents a "relationship (relationship)" between entities, wherein an Entity refers to things in the real world, such as a person name, a place name, an organization name, a concept, a proper noun, and the like, and the relationship is used for expressing a certain Relation between different entities, such as a person- "living in" -beijing, zhangsan and Li Sishi "friend", a logical regression is a deeply learned "leading knowledge", and the like. In general, the popular entities referred to by us generally include two types, one is an entity which has been mentioned more recently in some time, such as movie stars, popular television shows, etc.; the other is a relatively important entity, and the knowledge of the entity is updated frequently, such as a movie star, a variety program and the like.
The method provided by the embodiment of the present application may be performed by a terminal device or a system for performing trending entity update, trending entity recall, or trending entity discovery in a knowledge graph, where the terminal device includes, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like, and is not limited herein. For convenience of description, the following description will be given taking a terminal device as an example.
The method provided by the embodiment of the application can determine the candidate entity from each entity according to the entity parameter of each entity included in the entity candidate data based on the acquired entity candidate data of at least one data source, and if the candidate entity is included in a specified entity set (for example, a certain knowledge graph), can extract the entity feature of at least one specified entity including the candidate entity from the specified entity set, and can determine the target entity (for example, the popular entity) according to the entity feature. And determining at least one associated entity of the target entity by using the association relationship between the target entity and other specified entities in the specified entity set, and finally generating the target entity set (such as a hot entity set). By adopting the method provided by the embodiment of the application, the hot entity can be found in time, the recall rate and the recall efficiency of the hot entity can be improved, and the applicability is high.
The method and the related apparatus provided by the embodiments of the present application will be described in detail with reference to fig. 1 to 9, respectively. The method provided by the embodiment of the application can comprise data processing stages of acquiring entity candidate data, determining a candidate entity based on entity parameters in the entity candidate data, determining a target entity based on entity characteristics of a specified entity extracted from a specified entity set, determining an associated entity of the target entity based on an incidence relation between entities, generating the target entity set and the like. The implementation manner of each data processing stage can be referred to the implementation manner shown in fig. 1 below.
Referring to fig. 1, fig. 1 is a schematic flowchart of an entity discovery method according to an embodiment of the present application. The method provided by the embodiment of the application can comprise the following steps 101 to 104:
101. and acquiring entity candidate data of at least one data source, and selecting candidate entities from each entity according to entity parameters of each entity included in the entity candidate data.
In some possible implementations, the entity generally does not exist independently from the text, in other words, the entity is generally contained within the text. Therefore, in order to increase the recall rate of the popular entities and ensure the comprehensiveness of knowledge updating and the diversity of the recalled entities, the data existing in a webpage form, a log form, a text form and/or a table form can be selected from a plurality of data sources to be used as entity candidate data. The data source includes but is not limited to one or more of a news channel, a search log and a social platform, the data source preferably selects the news channel, and the news has the characteristics of high timeliness, strong authenticity and high accuracy, so that the timeliness and the effectiveness of the data acquired from the news channel as entity candidate data can be improved, and the applicability is higher. Referring to fig. 2, fig. 2 is a schematic diagram of data sources provided by an embodiment of the present application, wherein the news channels include entertainment channels, technology channels, military channels, sports channels, and the like. The search logs include search logs in a QQ browser, search logs in a TT browser, or search logs in any other browser or search engine. The social platform may include a microblog, a bar, a discussion group, and the like, which is determined according to an actual application scenario, and is not limited herein. Specifically, by acquiring one or more items of data of a news title, a news abstract and a news text in a news channel, the acquired news title, news abstract and/or news text can be determined as entity candidate data. By acquiring the search record and the search return result in the search log, the acquired search record and the acquired search return result can be determined as entity candidate data. For example, if the search log records "who the search record XX" is directed "and the search return result" which the search is returned XX "are guo X", the search log records "who the search record XX" is directed "and the search return result" guo X "of the search record" which the flow is directed "may be used as the entity candidate data. By acquiring the discussion topics of the users in the social platform, the acquired discussion topics can be determined as entity candidate data. The discussion topic can be a popular topic with the discussion frequency exceeding a preset discussion frequency threshold, or with the reading frequency exceeding a preset reading frequency threshold, or positioned at the top of the topic board. By acquiring various data of different data sources as entity candidate data and extracting entities from the entity candidate data, the data sources of the entity candidate data are more various, and the data content of the entity candidate data is richer, so that the recall rate of hot entities can be increased, and the knowledge graph can be more perfect.
In some possible embodiments, the Entity candidate data further includes meaningless parts of speech such as verbs, adjectives, quantifiers, auxiliary words, and exclamations, in addition to the entities, so that each Entity included in the Entity candidate data may be identified based on a Named Entity Recognition (NER) algorithm, where the identified entities include names of people, places, organizations, proper nouns, and the like, and may be determined according to an actual application scenario, which is not limited herein. Optionally, if only the NER algorithm is used, some entities may be omitted sometimes, and therefore, in order to improve recall rate and accuracy of the entities, a word segmentation technique and/or a relationship extraction technique may be used to identify the entities included in the entity candidate data, so that all the entities included in the entity candidate data may be obtained. It is understood that, with the development of the mobile internet and the continuous upgrade of various business requirements, data generated by information circulation is increasing in a blowout manner, so that the number of entities or the number of entities extracted from entity candidate data is huge. Therefore, in order to reduce the workload of subsequently classifying the popular entities, the coarse screening or the preliminary filtering of the popular entities can be performed on all the acquired entities based on the entity parameters. Generally, the more times an entity is mentioned or searched or viewed or updated, the more popular it is, the more likely it is to become a popular entity, therefore, the counted number of occurrences of the entity or the number of updates of the entity or the number of views of the entity can be used as an entity parameter, and the entity parameter of each entity can be compared with a preset entity parameter threshold to select a candidate popular entity, which is referred to as a candidate entity for convenience of description. By setting the entity parameter threshold, all entities with entity parameters smaller than the entity parameter threshold can be filtered or removed, the operation is simple and convenient, and errors are not easy to occur.
It is understood that all entities extracted based on the NER algorithm and/or the word segmentation technique and/or the relationship extraction technique are from entity candidates of the respective data sources. Therefore, by counting the extracted entity parameters of each entity and storing the entity parameters of each entity in the entity candidate data, the candidate entities can be selected from each entity based on the size relationship between the entity parameters of each entity included in the entity candidate data and the obtained preset entity parameter threshold, wherein the entity parameters include, but are not limited to, any one of the number of entity occurrences, the number of entity updates, and the number of entity views. Specifically, if the entity candidate data includes one or more entities from a single data source (for convenience of description, the entities from the single data source may be represented by first entities), the first entities with entity parameters greater than or equal to a first preset entity parameter threshold among the one or more first entities may be determined as candidate entities. Here, the size of the first preset entity parameter threshold may be set based on an empirical value, and may also be determined based on the number of entities appearing in the entity candidate data and the parameter size of each entity to screen out a candidate entity having more interest to the user from a large number of entities included in the entity candidate data. Wherein the single data source comprises a single data source of the same type. For example, assuming that the news channels include an entertainment channel, a technology channel, a military channel, and a sports channel, the first entity may be any entity in the news data from only the entertainment channel, or the first entity may also be any entity in the news data from only the sports channel.
Optionally, if the entity candidate data includes one or more entities from at least two data sources (for convenience of description, the entities from at least two data sources may be represented by second entities), the entity parameters of any second entity in each data source may be summed, and a second entity whose sum of the entity parameters is greater than or equal to a second preset entity parameter threshold may be determined as the candidate entity. It will be appreciated that a hot entity is typically present in more than one data source, and thus the hot nature of the entity is more apparent by summing the entity parameters in each data source. Here, the second predetermined entity parameter threshold may be set based on actual conditions, and in general, since the second entity is an entity from a plurality of data sources, the entity parameter of the second entity integrates the entity parameter of the same second entity in the plurality of data sources, and the entity parameter is generally larger, the set second predetermined entity parameter threshold may be larger than the first predetermined entity parameter threshold. The at least two data sources include different data sources of the same type or different data sources of different types. For example, assuming that the news channels include an entertainment channel, a science and technology channel, a military channel, and a sports channel, and the browser includes an a browser, a B browser, and a C browser, if the second entity is from different data sources of the same type, the second entity may be any entity in the news data from the entertainment channel and the sports channel at the same time, and if the second entity is from different data sources of different types, the second entity may be any entity in the news data from the entertainment channel and the search log of the a browser at the same time. It is understood that, since the entity parameter includes any one of the number of occurrences of the entity, the number of updates of the entity, and the number of views of the entity, if the entity parameter includes the number of occurrences of the entity, the first preset number of occurrences of the entity (i.e., the first preset entity parameter threshold) may be smaller than the second preset number of occurrences of the entity (i.e., the second preset entity parameter threshold), and if the entity parameter includes the number of updates of the entity, the first preset number of updates of the entity (i.e., the first preset entity parameter threshold) may be smaller than the second preset number of updates of the entity (i.e., the second preset entity parameter threshold).
Optionally, if the entity candidate data includes one or more entities from at least two data sources (for convenience of description, the entities from at least two data sources may be represented by the second entity), the maximum entity parameter of the entity parameters of any second entity may be obtained, where the maximum entity parameter may also be used to measure the heat degree of any second entity, and then the second entity with the maximum entity parameter greater than or equal to the first preset entity parameter threshold may be determined as the candidate entity.
For example, assuming that the entity parameters of each entity in the entity candidate data include the number of occurrences of the entity, and the determining of the candidate entity is determined based on the number of occurrences of the entity and a first predetermined threshold of the number of occurrences of the entity, when the candidate entity is selected from each entity based on the entity parameters of one or more first entities from a single data source in the entity candidate data, the first entity with the number of occurrences of the entity greater than or equal to the first predetermined threshold of the number of occurrences of the entity in the one or more first entities may be determined as the candidate entity. For example, assume that the first preset threshold number of entity occurrences is 300, where entity 1 is news data from an entertainment channel and entity 1 is 500, entity 2 is news data from a sports channel and entity 2 is 203, entity 3 is from a search log of an a browser and entity 3 is 150, and thus entity 1, for which the number of entity occurrences (i.e., 500) is greater than the first preset threshold number of entity occurrences (i.e., 300), may be determined as a candidate entity.
For another example, assuming that the entity parameters of each entity in the entity candidate data include the number of occurrences of the entity, and the determination of the candidate entity is determined based on the number of occurrences of the entity and a second preset threshold of the number of occurrences of the entity, when the candidate entity is selected from each entity based on the entity parameters of one or more second entities from at least two data sources in the entity candidate data, the number of occurrences of the entity in each data source of any second entity may be summed, and the second entity whose sum of the number of occurrences of the entity is greater than or equal to the second preset threshold of the number of occurrences of the entity is determined as the candidate entity. For example, assume that the second preset entity occurrence threshold is 1000, where entity 4 is from news data in entertainment and sports channels and the number of occurrences of entity 4 in the news data of entertainment and sports channels is 800 and 700, respectively, i.e., the sum of the number of occurrences of entity 4 is 1500. The entity 5 is from the news data of the entertainment channel and the search log of the a-browser and the number of occurrences of the entity 5 in the news data of the entertainment channel and the search log of the a-browser is 630 and 270, respectively, i.e., the sum of the number of occurrences of the entity 4 is 900. Thus, entities 4 whose sum of the number of occurrences of the entities (i.e., 1500) is greater than or equal to a second preset threshold number of occurrences of the entities (i.e., 1000) may be determined as candidate entities.
Optionally, the hot entity may also measure the number of entity data sources besides the number of entity occurrence times, the number of entity update times, or the number of entity browsing times, generally, the more data sources of an entity, the more popular the entity is, and therefore the entity parameter may also include the number of entity data sources. Specifically, one or more entities from at least two data sources may be determined from the entity candidate data (for convenience of description, the third entity may be used to represent the entities from at least two data sources), and a third entity having a data source number of the one or more third entities that is not less than a preset data source number threshold is determined as the candidate entity, where the size of the data source number threshold may be set based on an empirical value or an actual situation, and is not limited herein. For example, assuming that the determining of the candidate entities is performed based on the number of entity data sources and the threshold of the number of data sources, when the candidate entities are selected from the entities based on the entity parameters of one or more third entities from at least two data sources in the entity candidate data, the third entities with the number of entity data sources not less than the threshold of the number of data sources in the one or more third entities may be determined as the candidate entities. For example, assume that the preset data source number threshold is 3, where entity 6 is from news data in entertainment channels and sports channels, i.e., entity 6 has an entity data source number of 2. Entity 7 is from news data for entertainment channels and sports channels and also from the search logs of the a browser, i.e. entity 7 has an entity data source number of 3. Accordingly, entities 7 having a number of entity data sources (i.e., 3) greater than or equal to the predetermined threshold number of data sources (i.e., 3) may be determined as candidate entities.
Optionally, in order to increase the recall rate and accuracy of the trending entities, multi-level screening of the entities may also be set. For example, two-level filtering may be provided, where the condition of the two-level filtering may be that when the sum of the entity parameters of any second entity is less than a second preset entity parameter threshold, a maximum entity parameter of the entity parameters of any second entity is obtained, and if the maximum entity parameter is not less than the first preset entity parameter threshold, the second entity may be determined as a candidate entity. The screening mode can select the entities from a plurality of data sources, wherein the sum of the entity parameters is less than a second preset entity parameter threshold value, and the entities are used as candidate entities, so that the preliminary filtering of non-hot entities is realized, and the entities which are potential or possibly hot entities can be reserved. For another example, similarly, two-level filtering is also provided, the filtering condition may be set to obtain the number of the entity data sources of any second entity when the sum of the entity parameters of the any second entity is smaller than a second preset entity parameter threshold, and if the number of the entity data sources is not smaller than the preset data source number threshold, the second entity may be determined as a candidate entity. The screening method can also select entities from a plurality of data sources, wherein the sum of the entity parameters is smaller than a second preset entity parameter threshold value, and the entities are used as candidate entities, so that preliminary filtering of non-trending entities is realized, and entities which are potential or possibly trending entities can be reserved.
102. And if the candidate entity is contained in the specified entity set, extracting entity characteristics of at least one specified entity including the candidate entity from the specified entity set.
In some possible embodiments, as the information updating speed increases, some new entities may appear, where the new entities are entities appearing for the first time, and in the embodiment of the present application, in order to ensure comprehensiveness of knowledge, if the candidate entity is a new entity, the new entity may be determined as a target entity (a hot entity). Specifically, by matching the candidate entity with each designated entity included in the designated entity set one by one, it is determined whether the candidate entity is included in the designated entity set, where the designated entity set includes a knowledge graph, an ERD, and the like, which is not limited herein. If the same designated entity as the candidate entity is not matched from the designated entity set, it indicates that the candidate entity is a new entity, and therefore, the candidate entity and each designated entity included in the designated entity set can be generated into a target entity set (hot entity set). If the same designated entity as the candidate entity is matched from the designated entity set, the entity characteristics of the candidate entity are extracted, wherein the candidate entity is the designated entity from any data source and is included in the designated entity set.
Referring to fig. 3, fig. 3 is a schematic diagram of entity features provided in the embodiment of the present application, in which the entity features of each designated entity may include at least two of an entity importance value, an entity data source number, an entity attribute number, an entity occurrence number, an entity update number, and an entity browsing number, and each entity feature may be counted in advance and stored in a designated entity set. Here, the entity importance degree distinction value is an index for measuring the importance degree of an entity, and particularly, the importance degree of entities with the same name but different meanings is different. For example, "Ma Yun" may correspond to Ma Yun, or may correspond to a singer of Ma Yun, so that the importance of the entity can be distinguished by the entity importance value. In general, the entity importance level distinction value may be an integer having a value range from 0 to 1000, and a higher value of the entity importance level distinction value indicates that the entity is more important. The number of entity data sources refers to an original website from which entities and entity knowledge can be extracted, for example, the entity "Zhang Mou yi" can find a corresponding introduction page in websites of an encyclopedia, a clan, a post and the like, and the number of the entity data sources of the entity "Zhang Mou yi" refers to the number of the linked websites. It will be understood that the number of entity data sources can also reflect the importance or heat of the entity in the side, and generally, the greater the number of entity data sources, the greater the importance or heat of the entity. The number of entity attributes refers to the number of other entities having an association relationship with the entity, and generally, the greater the number of entity attributes, the higher the importance or the heat of the entity. The number of occurrences of an entity refers to the number of occurrences or the sum of occurrences of the entity in entity candidates from one or more data sources. The number of entity updates refers to the total number of times the entity has been updated, and it is understood that the more times the entity has been updated in the past, the more likely it is that the entity will be updated continuously in the future. The number of times an entity is viewed is referred to as the number of times the entity is viewed, and it is understood that the more times the entity is viewed, the more popular the entity is. For convenience of description, the entity characteristics mentioned below all include 6 characteristic values of entity importance value, entity data source number, entity attribute number, entity occurrence number, entity update number and entity browsing number.
Optionally, in some possible embodiments, there is also a part of the designated entities in the designated entity set, although the part of the designated entities does not come from any data source in the operation, but exists in the designated entity set and is marked, so that such designated entities (marked designated entities, such as certain specific categories of entities) may also be used as objects for entity feature extraction. This has the advantage that on the one hand the recall rate of the entity can be increased and the properties of the entity itself are taken into account. It is understood that, the reason why some designated entities are selected to be marked in the designated entity set instead of all designated entities is that the knowledge information of the designated entities like "term class" or "word class" included in the designated entity set is usually not updated, and the knowledge updating of the designated entities in specific categories, such as "software class", "product class", "person class", "movie and TV series class" or "novel class", is often complicated, so that such designated entities can be marked or labeled as important entities, so that the entity features of the designated entities marked in advance can be extracted later.
103. And determining a target entity from the at least one designated entity according to the entity characteristics of the at least one designated entity, and determining at least one associated entity of the target entity from the designated entity set based on the association relationship between the target entity and other designated entities in the designated entity set.
In some possible embodiments, by obtaining the entity characteristics of at least one specified entity, the entity importance differentiation value, the number of entity data sources, the number of entity attributes, the number of entity occurrences, the number of entity updates, the number of entity views, and the like included in the entity characteristics of each specified entity can be obtained. It is understood that the 6 feature values extracted by the above steps may represent the importance or heat of the designated entity from different aspects or angles, but in practical applications, the entity is not hot unless all the 6 feature values are relatively large. In other words, some hot entities may have a large number of feature values, but other feature values are small, so in order to improve the accuracy of the determination, and at the same time make the operability of the determination process stronger, and the determination result is more reliable, the 6 feature values may be input into the entity classification model, and then the target entity (i.e., the hot entity) included in the at least one designated entity may be output based on the entity classification model. It is understood that determining whether the designated entity is the target entity is a two-class problem, wherein the construction of the entity classification model may include data processing stages such as modeling data acquisition of the entity classification model, training of the entity classification model, and testing of the entity classification model. It is understood that the modeling data of the entity classification model may be entity features from a knowledge graph or ERD, where the classification result may be set to 1 or 0. When the entity classification model is trained, information feature pairs composed of entity features and classification results may be input into an initial network model of the entity classification model, where the initial network model may be a linear model, such as a logistic regression, a Support Vector Machine (SVM), or the initial network model may also be a non-linear model, such as a Tree-based model, a Gradient Boosting Decision Tree (GBDT), a random forest, or the like, which may be specifically determined according to an actual application scenario, and is not limited herein. And learning the entity characteristics and classification results included in the input information characteristic pairs through the initial network model, and constructing an entity classification model capable of outputting the corresponding classification results when the entity characteristics of any specified entity are input. After the entity classification model is built, entity characteristics of any groups of known classification results can be collected to serve as test data of the entity classification model. And inputting each group of test data into the constructed entity classification model, comparing the classification result output by the entity classification model with the actual classification result of the specified entity, if the probability that the classification result output by the entity classification model in the plurality of groups of test data is the same as the actual classification result of the specified entity is greater than or equal to the preset precision, indicating that the constructed entity classification model meets the construction requirement, otherwise, indicating that the constructed entity classification model does not meet the construction requirement, and continuing to train the entity classification model until the constructed entity classification model meets the requirement.
Optionally, in the field of machine learning, different evaluation indexes (that is, different feature values in the entity features are the different evaluation indexes) often have different dimensions and dimension units, which may affect the result of data analysis, for example, the number of entity data sources of a given entity is generally dozens, the number of entity occurrences is generally thousands, and in order to eliminate the dimension influence among the feature values, data standardization processing is required to solve the comparability among the data indexes. In other words, the data standardization processing is carried out on the original data, so that each index is in the same order of magnitude, and comprehensive comparison and evaluation can be carried out subsequently. The most typical data normalization processing method is data normalization processing, and the normalized data can be limited to a certain range (such as [0,1] or [ -1,1 ]). In this embodiment, a range transform method or a 0-mean normalization method may be used to normalize each of the feature values (e.g., 6 feature values) included in the entity features of any specified entity, such as the entity importance differentiation value, the entity data source number, the entity attribute number, the entity occurrence number, the entity update number, and the entity browsing number, to obtain the normalized entity features corresponding to each specified entity. The target entities (trending entities) included in the at least one designated entity may be output based on the entity classification model by inputting the normalized 6-item feature values into the entity classification model. Here, when the entity classification model is constructed, the collected modeling data should also be data after normalization processing, and the specific modeling process can be described in the previous paragraph, which is not described herein again, so that the complexity of data processing in the construction process of the entity classification model can be reduced, and the data processing efficiency can be improved.
Optionally, in some possible embodiments, the number of hit entities (i.e., target entities) is often limited, and therefore, in order to improve the recall rate of the hit entities and enhance the recall efficiency of the hit entities, the embodiments of the present application further obtain more hit entities through a relationship diffusion method. Here, the relationship diffusion is generally a one-degree relationship diffusion, in which an entity obtained by a certain entity through the one-degree relationship diffusion is an entity closely or directly related to the entity. For example, if a first degree relationship is explained by taking a person's social circle (herein specifically a friend) as an example, a person who has a first degree relationship with himself has the most familiar friends with himself. In addition, the relationship is extended to the friend of the friend through introduction of the friend, namely, the second degree relationship, the relationship is extended to the friend of the friend through the friend of the friend, namely, the third degree relationship, and so on, and the limitation is not made herein.
Referring to fig. 4, fig. 4 is a schematic view of a one-degree relationship diffusion provided in the embodiment of the present application, and the associated entities of the popular entities obtained through the one-degree relationship diffusion are associated entity 1, associated entity 2, and associated entity 3. Specifically, when identifying and extracting each entity included in the entity candidate data based on the NER algorithm, each extracted entity may be further classified. For example, we extract the entity "Zhang Mou yi" from news data of an entertainment channel and mark the entity type as "people class", or we extract the entity "wandering XX" from news data of an entertainment channel and mark the entity type as "movie & drama class". The designated entity set comprises each associated entity type set corresponding to each entity type. One entity type corresponds to a related entity type set, one related entity type set comprises at least one diffusible entity type of the entity type, and the diffusible entity type included in the related entity type set can be preset by a user. Therefore, the associated entity type set of the target entity type can be determined from the specified entity set by acquiring the target entity type to which the popular entity belongs from the specified entity set, wherein the associated entity type set of the target entity type comprises at least one diffusible entity type of the target entity type. For example, assuming that the target entity type is "person class", the set of associated entity types of the target entity type is set 1, wherein the diffusible entity types included in set 1 are "person class" and "movie and television series class". By determining the entity types to which all the designated entities having the association with the hot entity (target entity) respectively belong, the obtained entity type of each designated entity can be respectively compared with the diffusible entity types in the association entity type set, so as to determine the association between the entity type of each designated entity and the association entity type set of the target entity type, and if the entity type of any designated entity is contained in the association entity type set of the target entity type, any designated entity is determined as the associated entity of the hot entity.
For example, referring to fig. 5, fig. 5 is a schematic view of an application scenario of first-degree relation diffusion provided in the embodiment of the present application. Assuming that a certain popular entity is determined to be "zhangzhin", wherein a target entity type of the popular entity "zhangzhin" is "person type", a set of associated entity types of the target entity type "person type" may be determined as a set 1, wherein the diffusible entity types included in the set 1 are "person type" and "movie and television series type". In the designated entity set, designated entities having a one-degree relationship with the popular entity "zhanglingfeng" include wife "Yuan Mouyi", coworkers "Zhou Mouhao", birthdays "27/8/1971", native hong kong ", and evolution works" de-gree storm 4 ", wherein the entity types to which" Yuan Mouyi "and" Zhou Mou hao "belong are" person class ", the entity types to which" 8/27/1971 "belongs are" date ", and the entity types to which" hong kong "belongs are" place name ", and" de-gree storm 4 "is" movie drama class ". By determining the entity types of all the specified entities having a one-degree relationship with the Zhang Lin of the hot entity, the associated entities of the Zhang Lin of the hot entity are determined to be 'Yuan Mouyi', 'Zhou Mouhao' and 'anti-storm 4' according to the relationship between the entity type of each specified entity and the associated entity types 'people class' and 'movie and television series class' included in the set 1.
Optionally, in some possible embodiments, in addition to taking the entity corresponding to the determined first degree relationship of the trending entity as the associated entity, an entity corresponding to the second degree relationship and/or the third degree relationship of the trending entity may also be determined as the associated entity. Referring to fig. 6, fig. 6 is a schematic diagram of two-degree relationship diffusion provided in the embodiment of the present application, and the associated entities of the popular entities obtained through the two-degree relationship diffusion are associated entity 4, associated entity 5, associated entity 6, and associated entity 7. Referring to fig. 7, fig. 7 is a schematic diagram of three-degree relationship diffusion provided in the embodiment of the present application, and the associated entities of the popular entities obtained through the three-degree relationship diffusion are an associated entity 8, an associated entity 9, and an associated entity 10. It is understood that the first degree relation is the entity most closely related to the hot entity, the second degree relation is the entity less closely related to the hot entity, the degree of closeness between the third degree relation and the hot entity is lower than that between the second degree relation and the hot entity, and the specific implementation manner of finding the associated entity based on the second degree relation and/or the third degree relation diffusion is shown in the implementation process of the first degree relation diffusion, and is not described herein again.
104. And generating a target entity set according to the target entity and at least one associated entity of the target entity.
In some possible embodiments, after determining the trending entity (target entity) and at least one associated entity of the trending entity, all of the trending entities and associated entities of the trending entity may be utilized together to generate a target entity set, and for convenience of description, all entities included in the target entity set may be collectively referred to as a final trending entity. It should be understood that the target entity set may further include a Uniform Resource Locator (URL) address of at least one data source corresponding to each final hot entity, and by extracting the URL address of each data source corresponding to each final hot entity in the target entity set, entity candidate data corresponding to each final hot entity may be obtained, and the knowledge update on the final hot entity is completed based on the entity candidate data.
In the embodiment of the application, based on the named entity recognition algorithm, each entity and entity parameters respectively corresponding to each entity included in entity candidate data can be recognized and extracted from the entity candidate data of data sources such as each news channel, a search log and/or a social platform. The entity parameters comprise any one of entity occurrence times, entity updating times, entity browsing times and entity data source quantity. According to the size relationship between the entity parameters of each entity and the entity parameter threshold, candidate entities can be determined from each entity, if the candidate entities are contained in the designated entity set, entity characteristics of at least one designated entity including the candidate entities can be extracted from the designated entity set, wherein the entity characteristics comprise entity importance degree distinguishing values, entity data source quantity, entity attribute quantity, entity occurrence times, entity updating times and entity browsing times. By inputting each entity feature after the normalization process into the entity classification model, a target entity included in the at least one specified entity can be output based on the entity classification model. And determining at least one associated entity of the target entity by using the associated relationship between the target entity and other specified entities in the specified entity set, and finally generating the target entity set. By adopting the method provided by the embodiment of the application, the entity can be found in time, the recall rate and the accuracy of the entity are improved, and the applicability is high.
Referring to fig. 8, fig. 8 is a schematic structural diagram of an entity discovery apparatus according to an embodiment of the present application. The apparatus for entity discovery provided by the embodiment of the application comprises:
a candidate data obtaining module 31, configured to obtain entity candidate data of at least one data source;
a candidate entity determining module 32, configured to select a candidate entity from the entities according to the entity parameters of the entities included in the entity candidate data determined by the candidate data obtaining module 31;
an entity feature extracting module 33, configured to extract, if the candidate entity determined by the candidate entity determining module 32 is included in a specified entity set, an entity feature of at least one specified entity including the candidate entity from the specified entity set;
a target entity determining module 34, configured to determine a target entity from the at least one designated entity according to the entity feature of the at least one designated entity determined by the entity feature extracting module 33, and determine at least one associated entity of the target entity from the designated entity set based on an association relationship between the target entity and other designated entities in the designated entity set;
a first entity set generating module 35, configured to generate a target entity set according to the target entity determined by the target entity determining module 34 and the at least one associated entity of the target entity.
In some possible embodiments, the apparatus further comprises:
a second entity set generating module 36, configured to generate a target entity set according to the candidate entity and each designated entity included in the designated entity set, if the candidate entity determined by the candidate entity determining module is not included in the designated entity set.
In some possible embodiments, the data source includes at least one of a news channel, a search log, and a social platform; the candidate data acquiring module 31 is specifically configured to:
acquiring one or more items of data in news headlines, news abstracts and news texts in a news channel, and determining the acquired data as entity candidate data; and/or
Acquiring a search record in a search log, and determining the acquired search record as entity candidate data; and/or
And acquiring discussion topics in the social platform, and determining the acquired discussion topics as entity candidate data.
In some possible embodiments, the apparatus further comprises:
the entity identification module 37 is configured to identify and extract each entity included in the entity candidate data and the entity parameter of each entity based on a named entity identification algorithm.
In some possible embodiments, the entity parameter includes any one of an entity occurrence number, an entity update number, and an entity browse number; the candidate entity determining module is specifically configured to:
if the entity candidate data comprises one or more first entities from a single data source, determining a first entity of the one or more first entities with an entity parameter greater than or equal to a first preset entity parameter threshold value as a candidate entity;
if the entity candidate data comprises one or more second entities from at least two data sources, the entity parameters of any second entity in each data source are summed, and the second entity with the sum of the entity parameters being greater than or equal to a second preset entity parameter threshold is determined as the candidate entity.
In some possible embodiments, the entity parameter includes a number of entity data sources; the candidate entity determining module 32 is specifically configured to:
and determining one or more third entities from at least two data sources from the entity candidate data, and determining the third entities of which the entity data source quantity is not less than a preset data source quantity threshold value from the one or more third entities as candidate entities.
In some possible embodiments, the entity characteristics include at least two of an entity importance value, an entity data source number, an entity attribute number, an entity occurrence number, an entity update number, and an entity browsing number; the target entity determining module 34 includes:
a target entity discovery unit 3401, configured to perform normalization processing on each entity feature of any specified entity in the at least one specified entity respectively to obtain normalized entity features corresponding to each specified entity;
inputting the normalized entity features corresponding to each designated entity into an entity classification model, and outputting a target entity included in the at least one designated entity based on the entity classification model;
the entity classification model is obtained by training a linear model and/or a nonlinear model and has the capability of identifying entities with the heat degrees larger than or equal to a preset heat degree threshold value.
In some possible embodiments, the target entity determining module 34 includes:
an associated entity discovering unit 3402, configured to obtain a target entity type of the target entity and determine an associated entity type set of the target entity type;
determining one or more designated entities of which the entity types are contained in the associated entity type set from each designated entity which is included in the designated entity set and has the associated relationship with the target entity;
and determining the determined one or more designated entities as the associated entities of the target entity.
In a specific implementation, the apparatus discovered by the entity may execute the implementation manner provided by the steps in fig. 1 through its built-in functional modules. For example, the candidate data obtaining module 31 may be configured to execute implementation manners for obtaining entity candidate data of each data source in each step, which may specifically refer to the implementation manners provided in each step, and will not be described herein again. The candidate entity determining module 32 may be configured to perform the implementation manners described in the relevant steps of determining the candidate entity based on the entity parameter in the entity candidate data in each step, which may specifically refer to the implementation manners provided in each step, and will not be described herein again. The entity feature extraction module 33 may be configured to execute implementation manners such as determining an affiliation of a candidate entity and extracting an entity feature of a specified entity in each step, which may specifically refer to the implementation manners provided in each step, and will not be described herein again. The target entity determining module 34 may be configured to perform the implementation manners of determining the target entity based on the entity characteristics and determining the associated entity of the target entity based on the association relationship between the entities in the above steps, which may specifically refer to the implementation manners provided in the above steps, and will not be described herein again. The first entity set generating module 35 may be configured to execute implementation manners such as generating a target entity set according to the target entity and the associated entity of the target entity in each step, which may specifically refer to the implementation manners provided in each step, and details are not described here. The second entity set generating module 36 may be configured to execute the implementation manners of generating the target entity set based on the candidate entities and each designated entity in the designated entity set in each step, which may specifically refer to the implementation manners provided in each step, and will not be described herein again. The entity identification module 37 may be configured to execute implementation manners, such as extracting each entity included in the entity candidate data in each step and determining an entity parameter of each entity, which may specifically refer to the implementation manners provided in each step, and will not be described herein again.
In this embodiment of the application, the entity discovery apparatus may identify and extract each entity and entity parameters respectively corresponding to each entity included in the entity candidate data from the entity candidate data of the data sources such as each news channel, the search log, and/or the social platform based on a named entity identification algorithm. The entity parameters comprise any one of entity occurrence times, entity updating times, entity browsing times and entity data source quantity. According to the size relationship between the entity parameters of each entity and the entity parameter threshold, candidate entities can be determined from each entity, if the candidate entities are contained in the designated entity set, entity characteristics of at least one designated entity including the candidate entities can be extracted from the designated entity set, wherein the entity characteristics comprise entity importance degree distinguishing values, entity data source quantity, entity attribute quantity, entity occurrence times, entity updating times and entity browsing times. By inputting each entity feature after the normalization process into the entity classification model, a target entity included in the at least one specified entity can be output based on the entity classification model. And determining at least one associated entity of the target entity by using the associated relationship between the target entity and other specified entities in the specified entity set, and finally generating the target entity set. By adopting the method provided by the embodiment of the application, the entity can be found in time, the recall rate and the accuracy of the entity are improved, the flexibility is high, and the application range is wide.
Referring to fig. 9, fig. 9 is a schematic structural diagram of a terminal device provided in an embodiment of the present application. As shown in fig. 9, the terminal device in this embodiment may include: one or more processors 401 and memory 402. The processor 401 and the memory 402 are connected by a bus 403. The memory 402 is used to store a computer program comprising program instructions, and the processor 401 is used to execute the program instructions stored in the memory 402 to perform the following operations:
acquiring entity candidate data of at least one data source;
selecting candidate entities from each entity according to entity parameters of each entity included in the entity candidate data;
if the candidate entity is contained in a specified entity set, extracting entity characteristics of at least one specified entity including the candidate entity from the specified entity set;
determining a target entity from the at least one designated entity according to the entity characteristics of the at least one designated entity, and determining at least one associated entity of the target entity from the designated entity set based on the association relationship between the target entity and other designated entities in the designated entity set;
and generating a target entity set according to the target entity and the at least one associated entity of the target entity.
In some possible embodiments, the processor 401 is configured to:
and if the candidate entity is not contained in the specified entity set, generating a target entity set according to the candidate entity and each specified entity contained in the specified entity set.
In some possible embodiments, the data source includes at least one of a news channel, a search log, and a social platform; the processor 401 is configured to:
acquiring one or more items of data of news headlines, news abstracts and news texts in news channels, and determining the acquired data as entity candidate data; and/or
Acquiring a search record in a search log, and determining the acquired search record as entity candidate data; and/or
And acquiring discussion topics in the social platform, and determining the acquired discussion topics as entity candidate data.
In some possible embodiments, the processor 401 is configured to:
identifying and extracting each entity included in the entity candidate data based on a named entity identification algorithm;
and determining entity parameters respectively corresponding to the entities from the entity candidate data.
In some possible embodiments, the entity parameter includes any one of an entity occurrence number, an entity update number, and an entity browsing number; the processor 401 is configured to:
if the entity candidate data comprises one or more first entities from a single data source, determining a first entity of the one or more first entities with an entity parameter greater than or equal to a first preset entity parameter threshold value as a candidate entity;
if the entity candidate data comprises one or more second entities from at least two data sources, the entity parameters of any second entity in each data source are summed, and the second entity with the entity parameter sum larger than or equal to a second preset entity parameter threshold is determined as the candidate entity.
In some possible embodiments, the entity parameter includes a number of entity data sources; the processor 401 is configured to:
and determining one or more third entities from at least two data sources from the entity candidate data, and determining the third entities of which the entity data source quantity is not less than a preset data source quantity threshold value from the one or more third entities as candidate entities.
In some possible embodiments, the entity characteristics include at least two of an entity importance value, an entity data source number, an entity attribute number, an entity occurrence number, an entity update number, and an entity browsing number; the processor 401 is configured to:
respectively normalizing each entity feature of any one appointed entity in the at least one appointed entity to obtain the normalized entity feature corresponding to each appointed entity;
inputting the normalized entity features corresponding to each designated entity into an entity classification model, and outputting a target entity included in the at least one designated entity based on the entity classification model;
the entity classification model is obtained by training a linear model and/or a nonlinear model and has the capability of identifying entities with the heat degrees larger than or equal to a preset heat degree threshold value.
In some possible embodiments, the processor 401 is configured to:
acquiring a target entity type of the target entity and determining a related entity type set of the target entity type;
determining one or more designated entities of which the entity types are contained in the associated entity type set from each designated entity which is included in the designated entity set and has the associated relationship with the target entity;
and determining the determined one or more designated entities as the associated entities of the target entity.
It should be appreciated that in some possible implementations, the processor 401 may be a Central Processing Unit (CPU), and the processor may be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The memory 402 may include both read-only memory and random access memory, and provides instructions and data to the processor 401. A portion of the memory 402 may also include non-volatile random access memory. For example, the memory 402 may also store device type information.
In a specific implementation, the terminal device may execute the implementation manners provided in the steps in fig. 1 through the built-in function modules, which may specifically refer to the implementation manners provided in the steps, and are not described herein again.
In this embodiment, the terminal device may identify and extract, based on a named entity identification algorithm, each entity and entity parameters respectively corresponding to each entity included in the entity candidate data from entity candidate data of data sources such as each news channel, a search log, and/or a social platform. The entity parameters comprise any one of the number of entity occurrences, the number of entity updates, the number of entity views and the number of entity data sources. According to the size relation between the entity parameter of each entity and the entity parameter threshold, a candidate entity can be determined from each entity, if the candidate entity is contained in a specified entity set, entity characteristics of at least one specified entity including the candidate entity can be extracted from the specified entity set, wherein the entity characteristics comprise entity importance degree distinguishing values, entity data source quantity, entity attribute quantity, entity occurrence times, entity updating times and entity browsing times. By inputting the respective entity features after the normalization process into the entity classification model, a target entity included in the at least one specified entity can be output based on the entity classification model. And determining at least one associated entity of the target entity by using the associated relationship between the target entity and other specified entities in the specified entity set, and finally generating the target entity set. By adopting the method provided by the embodiment of the application, the entity can be found in time, the recall rate and the accuracy of the entity are improved, the flexibility is high, and the application range is wide.
An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a processor, the method for entity discovery provided in each step in fig. 1 is implemented.
The computer-readable storage medium may be the entity discovery apparatus provided in any of the foregoing embodiments or an internal storage unit of the terminal device, for example, a hard disk or a memory of an electronic device. The computer readable storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash memory card (flash card), and the like, provided on the electronic device. Further, the computer readable storage medium may also include both an internal storage unit and an external storage device of the electronic device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the electronic device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.
The terms "first", "second", "third", "fourth", and the like in the claims and in the description and drawings of the present application are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments. The term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items. Those of ordinary skill in the art will appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The method and the related apparatus provided by the embodiments of the present application are described with reference to the flowchart and/or the structural diagram of the method provided by the embodiments of the present application, and each flow and/or block of the flowchart and/or the structural diagram of the method, and the combination of the flow and/or block in the flowchart and/or the block diagram can be specifically implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block or blocks.

Claims (10)

1. An entity discovery method, the method comprising:
acquiring entity candidate data of at least one data source;
selecting candidate entities from each entity according to entity parameters of each entity included in the entity candidate data;
if the candidate entity is contained in a specified entity set, extracting entity characteristics of at least one specified entity including the candidate entity from the specified entity set, wherein the at least one specified entity comprises the candidate entity and the marked specified entity in the specified entity set;
determining a target entity from the at least one designated entity according to the entity characteristics of the at least one designated entity, and determining at least one associated entity of the target entity from the designated entity set based on the association relationship between the target entity and other designated entities in the designated entity set;
generating a target entity set according to the target entity and the at least one associated entity of the target entity;
the entity characteristics comprise at least two items of entity importance degree distinguishing value, entity data source quantity, entity attribute quantity, entity occurrence frequency, entity updating frequency and entity browsing frequency; the determining a target entity from the at least one designated entity according to the entity characteristics of the at least one designated entity includes:
respectively carrying out normalization processing on each entity characteristic of any one designated entity in the at least one designated entity to obtain the normalized entity characteristic corresponding to each designated entity;
inputting the normalized entity features corresponding to each designated entity into an entity classification model, and outputting a target entity included in the at least one designated entity based on the entity classification model;
the entity classification model is obtained by training a linear model and/or a nonlinear model and has the capability of identifying entities with the heat degrees larger than or equal to a preset heat degree threshold value.
2. The method of claim 1, further comprising:
and if the candidate entity is not contained in the specified entity set, generating a target entity set according to the candidate entity and each specified entity included in the specified entity set.
3. The method of claim 2, wherein the data sources comprise at least one of news channels, search logs, and social platforms; the acquiring entity candidate data of at least one data source comprises:
acquiring one or more items of data in news headlines, news abstracts and news texts in a news channel, and determining the acquired data as entity candidate data; and/or
Acquiring a search record in a search log, and determining the acquired search record as entity candidate data; and/or
The discussion topics in the social platform are obtained, and the obtained discussion topics are determined as entity candidate data.
4. The method of claim 3, further comprising:
identifying and extracting each entity included in the entity candidate data based on a named entity identification algorithm;
and determining entity parameters respectively corresponding to the entities from the entity candidate data.
5. The method of claim 4, wherein the entity parameter includes any one of an entity occurrence number, an entity update number, and an entity browse number; the selecting a candidate entity from each entity according to the entity parameter of each entity included in the entity candidate data includes:
if the entity candidate data comprises one or more first entities from a single data source, determining a first entity of the one or more first entities, of which an entity parameter is greater than or equal to a first preset entity parameter threshold value, as a candidate entity;
and if the entity candidate data comprises one or more second entities from at least two data sources, summing entity parameters of any second entity in each data source, and determining the second entity with the sum of the entity parameters being greater than or equal to a second preset entity parameter threshold value as the candidate entity.
6. The method of claim 4, wherein the entity parameter comprises an entity data source number; the selecting a candidate entity from each entity according to the entity parameter of each entity included in the entity candidate data includes:
and determining one or more third entities from at least two data sources from the entity candidate data, and determining the third entities of which the entity data source quantity is not less than a preset data source quantity threshold value from the one or more third entities as candidate entities.
7. The method according to any one of claims 1 to 6, wherein the determining at least one associated entity of the target entity from the set of specified entities based on the association between the target entity and other specified entities in the set of specified entities comprises:
obtaining a target entity type of the target entity and determining a related entity type set of the target entity type;
determining one or more designated entities of which the entity types are contained in the associated entity type set from each designated entity which is included in the designated entity set and has the associated relationship with the target entity;
and determining the one or more determined specified entities as the associated entities of the target entity.
8. An apparatus for entity discovery, the apparatus comprising:
the candidate data acquisition module is used for acquiring entity candidate data of at least one data source;
a candidate entity determining module, configured to select a candidate entity from each entity according to the entity parameter of each entity included in the entity candidate data determined by the candidate data obtaining module;
an entity feature extraction module, configured to extract, if the candidate entity determined by the candidate entity determination module is included in a specified entity set, an entity feature of at least one specified entity including the candidate entity from the specified entity set, where the at least one specified entity includes the candidate entity and a specified entity marked in the specified entity set;
a target entity determining module, configured to determine a target entity from the at least one designated entity according to the entity feature of the at least one designated entity determined by the entity feature extracting module, and determine at least one associated entity of the target entity from the designated entity set based on an association relationship between the target entity and other designated entities in the designated entity set;
a first entity set generating module, configured to generate a target entity set according to the target entity determined by the target entity determining module and the at least one associated entity of the target entity;
the entity characteristics comprise at least two items of entity importance degree distinguishing value, entity data source quantity, entity attribute quantity, entity occurrence frequency, entity updating frequency and entity browsing frequency; the target entity determination module comprises:
the target entity discovering unit is used for respectively carrying out normalization processing on each entity characteristic of any one appointed entity in the at least one appointed entity so as to obtain the entity characteristic which corresponds to each appointed entity and is subjected to normalization processing; inputting the normalized entity features corresponding to each designated entity into an entity classification model, and outputting a target entity included in the at least one designated entity based on the entity classification model;
the entity classification model is obtained by training a linear model and/or a nonlinear model and has the capability of identifying entities with the heat degrees larger than or equal to a preset heat degree threshold value.
9. A terminal device, comprising a processor and a memory, said processor and memory being interconnected;
the memory for storing a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 1-7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1-7.
CN201910242996.XA 2019-03-28 2019-03-28 Entity discovery method and device Active CN110008352B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910242996.XA CN110008352B (en) 2019-03-28 2019-03-28 Entity discovery method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910242996.XA CN110008352B (en) 2019-03-28 2019-03-28 Entity discovery method and device

Publications (2)

Publication Number Publication Date
CN110008352A CN110008352A (en) 2019-07-12
CN110008352B true CN110008352B (en) 2022-12-20

Family

ID=67168611

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910242996.XA Active CN110008352B (en) 2019-03-28 2019-03-28 Entity discovery method and device

Country Status (1)

Country Link
CN (1) CN110008352B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625837B (en) * 2020-05-22 2023-07-04 北京金山云网络技术有限公司 Method, device and server for identifying system loopholes
CN112633000A (en) * 2020-12-25 2021-04-09 北京明略软件系统有限公司 Method and device for associating entities in text, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992478A (en) * 2017-11-30 2018-05-04 百度在线网络技术(北京)有限公司 The method and apparatus for determining focus incident
CN108038183A (en) * 2017-12-08 2018-05-15 北京百度网讯科技有限公司 Architectural entities recording method, device, server and storage medium
CN108509479A (en) * 2017-12-13 2018-09-07 深圳市腾讯计算机系统有限公司 Entity recommends method and device, terminal and readable storage medium storing program for executing
CN108536702A (en) * 2017-03-02 2018-09-14 腾讯科技(深圳)有限公司 A kind of related entities determine method, apparatus and computing device
CN109189938A (en) * 2018-08-31 2019-01-11 北京字节跳动网络技术有限公司 Method and apparatus for updating knowledge mapping

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9229988B2 (en) * 2013-01-18 2016-01-05 Microsoft Technology Licensing, Llc Ranking relevant attributes of entity in structured knowledge base
US20160292281A1 (en) * 2015-04-01 2016-10-06 Microsoft Technology Licensing, Llc Obtaining content based upon aspect of entity
US10333816B2 (en) * 2015-09-22 2019-06-25 Ca, Inc. Key network entity detection

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108536702A (en) * 2017-03-02 2018-09-14 腾讯科技(深圳)有限公司 A kind of related entities determine method, apparatus and computing device
CN107992478A (en) * 2017-11-30 2018-05-04 百度在线网络技术(北京)有限公司 The method and apparatus for determining focus incident
CN108038183A (en) * 2017-12-08 2018-05-15 北京百度网讯科技有限公司 Architectural entities recording method, device, server and storage medium
CN108509479A (en) * 2017-12-13 2018-09-07 深圳市腾讯计算机系统有限公司 Entity recommends method and device, terminal and readable storage medium storing program for executing
CN109189938A (en) * 2018-08-31 2019-01-11 北京字节跳动网络技术有限公司 Method and apparatus for updating knowledge mapping

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Discovering emerging entities with ambiguous names;Johannes Hoffart等;《Proceedings of the 23rd International World Wide Web Conference》;20140430;385-395 *
面向自然语言查询的知识搜索关键技术研究;黄鹏程;《中国优秀硕士学位论文全文数据库 (信息科技辑)》;20160715(第7期);I138-1243 *

Also Published As

Publication number Publication date
CN110008352A (en) 2019-07-12

Similar Documents

Publication Publication Date Title
CN106874279B (en) Method and device for generating application category label
US9489401B1 (en) Methods and systems for object recognition
US9104979B2 (en) Entity recognition using probabilities for out-of-collection data
WO2017045443A1 (en) Image retrieval method and system
CN111241389B (en) Sensitive word filtering method and device based on matrix, electronic equipment and storage medium
US9390176B2 (en) System and method for recursively traversing the internet and other sources to identify, gather, curate, adjudicate, and qualify business identity and related data
US20160034514A1 (en) Providing search results based on an identified user interest and relevance matching
CN108595660A (en) Label information generation method, device, storage medium and the equipment of multimedia resource
Im et al. Linked tag: image annotation using semantic relationships between image tags
CN110929125A (en) Search recall method, apparatus, device and storage medium thereof
CN111797239A (en) Application program classification method and device and terminal equipment
CN109635073A (en) Forum's community application management method, device, equipment and computer readable storage medium
CN112148701A (en) File retrieval method and equipment
CN108197474A (en) The classification of mobile terminal application and detection method
CN110008352B (en) Entity discovery method and device
CN110209721A (en) Judgement document transfers method, apparatus, server and storage medium
CN112132238A (en) Method, device, equipment and readable medium for identifying private data
CN112507167A (en) Method and device for identifying video collection, electronic equipment and storage medium
US11797617B2 (en) Method and apparatus for collecting information regarding dark web
CN111259975B (en) Method and device for generating classifier and method and device for classifying text
CN105512270B (en) Method and device for determining related objects
CN111737577A (en) Data query method, device, equipment and medium based on service platform
CN115437930B (en) Webpage application fingerprint information identification method and related equipment
CN107577667B (en) Entity word processing method and device
CN115168568B (en) Data content identification method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant