CN110008352A - Entity finds method and device - Google Patents

Entity finds method and device Download PDF

Info

Publication number
CN110008352A
CN110008352A CN201910242996.XA CN201910242996A CN110008352A CN 110008352 A CN110008352 A CN 110008352A CN 201910242996 A CN201910242996 A CN 201910242996A CN 110008352 A CN110008352 A CN 110008352A
Authority
CN
China
Prior art keywords
entity
candidate
designated entities
data
mentioned
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910242996.XA
Other languages
Chinese (zh)
Other versions
CN110008352B (en
Inventor
徐程程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910242996.XA priority Critical patent/CN110008352B/en
Publication of CN110008352A publication Critical patent/CN110008352A/en
Application granted granted Critical
Publication of CN110008352B publication Critical patent/CN110008352B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the present application discloses a kind of method and device of entity discovery, this method comprises: obtaining the entity candidate data of at least one data source;Candidate entity is selected from each entity according to the substance parameter for each entity for including in entity candidate data;If candidate entity is contained in designated entities set, from the substance feature for extracting at least one designated entities including candidate entity in designated entities set;Target entity is determined from least one designated entities according to the substance feature of at least one designated entities, and based on the incidence relation between other designated entities in target entity and designated entities set, at least one associated entity of target entity is determined from designated entities set;Target entity set is generated according at least one of target entity and target entity associated entity.Using the embodiment of the present application, popular entity can be found in time, improve the recall rate of popular entity and recalls efficiency, and applicability is high.

Description

Entity finds method and device
Technical field
This application involves data processing fields more particularly to a kind of entity to find method and device.
Background technique
Knowledge mapping needs to guarantee the comprehensive and real-time of knowledge.When the overall flow of knowledge mapping building builds success Afterwards, the automatic discovery and downloading of entity are to maintain the important entrance that knowledge automatically updates.In general, website has much daily New entity occurs, and the prior art can only find to show the entity in homepage, and lead to popular entity recalls deficiency.Meanwhile Have in knowledge mapping and much have existed still critically important entity, needs periodically downloading to be updated, either by configuration The mode that rules for grasping is still manually runed all can not effectively find them, and occupancy resource is larger if all updating, and lead to It is often unrealistic, therefore cause many knowledge timeliness poor.
Summary of the invention
The embodiment of the present application provides a kind of method and device of entity discovery, can find popular entity in time, improves popular The recall rate of entity and recall efficiency, applicability is high.
In a first aspect, the embodiment of the present application provides a kind of method of entity discovery, this method comprises:
Obtain the entity candidate data of at least one data source;
It is selected from above-mentioned each entity according to the substance parameter for each entity for including in above-mentioned entity candidate data Candidate entity;
If above-mentioned candidate's entity is contained in designated entities set, it includes above-mentioned for extracting from above-mentioned designated entities set The substance feature of at least one designated entities including candidate entity;
Target is determined from least one above-mentioned designated entities according to the substance feature of at least one above-mentioned designated entities Entity, and based on the incidence relation between other designated entities in above-mentioned target entity and above-mentioned designated entities set, from above-mentioned At least one associated entity of above-mentioned target entity is determined in designated entities set;
Target entity collection is generated according at least one above-mentioned associated entity of above-mentioned target entity and above-mentioned target entity It closes.
The embodiment of the present application can find popular entity in time, by determining that the associated entity of target entity and target entity can It improves the recall rate of popular entity and recalls efficiency, applicability is high.
With reference to first aspect, in a kind of possible embodiment, the above method further include:
If above-mentioned candidate's entity is not included in above-mentioned designated entities set, according to above-mentioned candidate entity and above-mentioned specified reality Included each designated entities generate target entity set in body set.
The embodiment of the present application can find the entity for being not included in designated entities set in time, improve entity recall rate and Recall efficiency, strong applicability.
With reference to first aspect, in a kind of possible embodiment, above-mentioned data source includes news channel, search log At least one of and in social platform;The entity candidate data of above-mentioned at least one data source of acquisition, comprising:
The one or more data in headline, news in brief and the body in news channel are obtained, and The data that will acquire are determined as entity candidate data;And/or
The search record in search log is obtained, and the above-mentioned search record that will acquire is determined as entity candidate data;With/ Or
The discussion topic in social platform is obtained, and the topic discussed above that will acquire is determined as entity candidate data.
The embodiment of the present application can find entity in time, increase the diversity of data source, and then popular entity can be improved Recall rate, flexibility is high and strong applicability.
With reference to first aspect, in a kind of possible embodiment, the above method further include:
It is identified based on name entity identification algorithms and extracts each entity for including in above-mentioned entity candidate data;
The corresponding substance parameter of above-mentioned each entity is determined from above-mentioned entity candidate data.
The precision of Entity recognition can be improved in the embodiment of the present application, and then increases the recall rate and accuracy rate of popular entity, fits It is strong with property.
With reference to first aspect, in a kind of possible embodiment, above-mentioned substance parameter includes entity frequency of occurrence, entity Any one of update times and entity browsing time;It is above-mentioned according to each entity for including in above-mentioned entity candidate data Substance parameter selects candidate entity from above-mentioned each entity, comprising:
It, will be upper if in above-mentioned entity candidate data including one or more first instance from individual data source It is true more than or equal to the first instance of the first default substance parameter threshold value to state substance parameter in one or more first instance It is set to candidate entity;
If in above-mentioned entity candidate data including one or more second instance from least two data sources, Substance parameter of any second instance in each data source is summed, and the sum of substance parameter is more than or equal to The second instance of second default substance parameter threshold value is determined as candidate entity.
The embodiment of the present application can increase the recall rate of popular entity, and flexibility is high and applied widely.
With reference to first aspect, in a kind of possible embodiment, above-mentioned substance parameter includes solid data source quantity; The above-mentioned substance parameter according to each entity for including in above-mentioned entity candidate data selects candidate from above-mentioned each entity Entity, comprising:
One or more third entity from least two data sources is determined from above-mentioned entity candidate data, And solid data source quantity in said one or multiple third entities is not less than to the third of preset data source amount threshold Entity is determined as candidate entity.
The embodiment of the present application can increase the recall rate of popular entity, and flexibility is high and applied widely.
With reference to first aspect, in a kind of possible embodiment, above-mentioned substance feature include entity different degree distinguishing value, In solid data source quantity, entity attribute quantity, entity frequency of occurrence, entity update times and entity browsing time at least Two;The above-mentioned substance feature according at least one above-mentioned designated entities determines target from least one above-mentioned designated entities Entity, comprising:
Place is normalized in each substance feature of any designated entities at least one above-mentioned designated entities respectively Reason is to obtain the substance feature after the corresponding normalized of each designated entities;
Substance feature after the corresponding above-mentioned normalized of above-mentioned each designated entities is inputted into entity classification model, base Target entity included at least one above-mentioned designated entities is exported in above-mentioned entity classification model;
Wherein, above-mentioned entity classification model is obtained by linear model and/or nonlinear model training and has identification temperature More than or equal to the ability of the entity of preset heat threshold value.
The recall rate and accuracy rate of popular entity, not easy to make mistakes and easy to operate, applicability can be improved in the embodiment of the present application By force.
With reference to first aspect, above-mentioned to be based on above-mentioned target entity and above-mentioned specified reality in a kind of possible embodiment Incidence relation in body set between other designated entities, determined from above-mentioned designated entities set above-mentioned target entity to A few associated entity, comprising:
It obtains the target entity type of above-mentioned target entity and determines the associated entity set of types of above-mentioned target entity type It closes;
From each designated entities relevant with above-mentioned target entity for including in above-mentioned designated entities set, really Make one or more designated entities that entity type is contained in above-mentioned associated entity type set;
The said one determined or multiple designated entities are determined as to the associated entity of above-mentioned target entity.
The embodiment of the present application can increase the recall rate of popular entity, improve the efficiency of recalling of popular entity, easy to operate, spirit It is active high, strong applicability.
Second aspect, the embodiment of the present application provide a kind of device of entity discovery, which includes:
Candidate data obtains module, for obtaining the entity candidate data of at least one data source;
Candidate entity determining module, for being obtained in the above-mentioned entity candidate data that module determines according to above-mentioned candidate data Including the substance parameter of each entity select candidate entity from above-mentioned each entity;
Substance feature extraction module, if the above-mentioned candidate entity for above-mentioned candidate entity determining module to determine is contained in finger Determine in entity sets, then from least one designated entities extracted in above-mentioned designated entities set including above-mentioned candidate entity Substance feature;
Target entity determining module, at least one above-mentioned specified reality according to the determination of above-mentioned substance feature extraction module The substance feature of body determines target entity from least one above-mentioned designated entities, and based on above-mentioned target entity and above-mentioned finger Determine the incidence relation in entity sets between other designated entities, determines above-mentioned target entity from above-mentioned designated entities set At least one associated entity;
First instance set generation module, for according to above-mentioned target entity determining module determine above-mentioned target entity with And at least one above-mentioned associated entity of above-mentioned target entity generates target entity set.
In conjunction with second aspect, in a kind of possible embodiment, above-mentioned apparatus further include:
Second instance set generation module, if the above-mentioned candidate entity for above-mentioned candidate entity determining module to determine does not wrap Contained in above-mentioned designated entities set, then according to each specified reality included in above-mentioned candidate entity and above-mentioned designated entities set Body generates target entity set.
In conjunction with second aspect, in a kind of possible embodiment, above-mentioned data source includes news channel, search log At least one of and in social platform;Above-mentioned candidate data obtains module and is specifically used for:
The one or more data in headline, news in brief and the body in news channel are obtained, and The data that will acquire are determined as entity candidate data;And/or
The search record in search log is obtained, and the above-mentioned search record that will acquire is determined as entity candidate data;With/ Or
The discussion topic in social platform is obtained, and the topic discussed above that will acquire is determined as entity candidate data.
In conjunction with second aspect, in a kind of possible embodiment, above-mentioned apparatus further include:
Entity recognition module, for identifying and extracting in above-mentioned entity candidate data and include based on name entity identification algorithms Each entity and above-mentioned each entity substance parameter.
In conjunction with second aspect, in a kind of possible embodiment, above-mentioned substance parameter includes entity frequency of occurrence, entity Any one of update times and entity browsing time;Above-mentioned candidate's entity determining module is specifically used for:
It, will be upper if in above-mentioned entity candidate data including one or more first instance from individual data source It is true more than or equal to the first instance of the first default substance parameter threshold value to state substance parameter in one or more first instance It is set to candidate entity;
If in above-mentioned entity candidate data including one or more second instance from least two data sources, Substance parameter of any second instance in each data source is summed, and the sum of substance parameter is more than or equal to The second instance of second default substance parameter threshold value is determined as candidate entity.
In conjunction with second aspect, in a kind of possible embodiment, above-mentioned substance parameter includes solid data source quantity; Above-mentioned candidate's entity determining module is specifically used for:
One or more third entity from least two data sources is determined from above-mentioned entity candidate data, And solid data source quantity in said one or multiple third entities is not less than to the third of preset data source amount threshold Entity is determined as candidate entity.
In conjunction with second aspect, in a kind of possible embodiment, above-mentioned substance feature include entity different degree distinguishing value, In solid data source quantity, entity attribute quantity, entity frequency of occurrence, entity update times and entity browsing time at least Two;Above-mentioned target entity determining module includes:
Target entity finds unit, for by each entity of any designated entities at least one above-mentioned designated entities Feature is normalized respectively to obtain the substance feature after the corresponding normalized of each designated entities;
Substance feature after the corresponding above-mentioned normalized of above-mentioned each designated entities is inputted into entity classification model, base Target entity included at least one above-mentioned designated entities is exported in above-mentioned entity classification model;
Wherein, above-mentioned entity classification model is obtained by linear model and/or nonlinear model training and has identification temperature More than or equal to the ability of the entity of preset heat threshold value.
In conjunction with second aspect, in a kind of possible embodiment, above-mentioned target entity determining module includes:
Associated entity finds unit, for obtaining the target entity type of above-mentioned target entity and determining above-mentioned target entity The associated entity type set of type;
From each designated entities relevant with above-mentioned target entity for including in above-mentioned designated entities set, really Make one or more designated entities that entity type is contained in above-mentioned associated entity type set;
The said one determined or multiple designated entities are determined as to the associated entity of above-mentioned target entity.
The third aspect, the embodiment of the present application provide a kind of terminal device, which includes processor and memory, The processor and memory are connected with each other.The memory for store support the terminal device execute above-mentioned first aspect and/or The computer program for the method that any possible implementation of first aspect provides, which includes program instruction, The processor is configured for calling above procedure instruction, executes above-mentioned first aspect and/or first aspect is any possible Method provided by embodiment.
Fourth aspect, the embodiment of the present application provide a kind of computer readable storage medium, the computer-readable storage medium Matter is stored with computer program, which includes program instruction, which makes at this when being executed by a processor It manages device and executes method provided by above-mentioned first aspect and/or any possible embodiment of first aspect.
Implement the embodiment of the present application, has the following beneficial effects:
Entity candidate data based at least one data source got can include according in entity candidate data The substance parameter of each entity determines candidate entity, can be from specified reality if candidate entity is contained in designated entities set The substance feature of at least one designated entities including candidate entity is extracted in body set, and can determine according to substance feature Target entity out.Utilize the incidence relation between other designated entities in target entity and designated entities set, it may be determined that go out mesh At least one associated entity of entity is marked, and ultimately generates target entity set, can either find entity in time, additionally it is possible to improve The recall rate of entity and recall efficiency, applicability is high.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the drawings in the following description are only some examples of the present application, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.
Fig. 1 is the flow diagram of entity discovery method provided by the embodiments of the present application;
Fig. 2 is data source schematic diagram provided by the embodiments of the present application;
Fig. 3 is substance feature schematic diagram provided by the embodiments of the present application;
Fig. 4 is the schematic diagram of once relationship diffusion provided by the embodiments of the present application;
Fig. 5 is the application scenarios schematic diagram of once relationship diffusion provided by the embodiments of the present application;
Fig. 6 is the schematic diagram of two degree of relationships diffusion provided by the embodiments of the present application;
Fig. 7 is the schematic diagram of third degree relationships diffusion provided by the embodiments of the present application;
Fig. 8 is the structural schematic diagram of entity discovery device provided by the embodiments of the present application;
Fig. 9 is the structural schematic diagram of terminal device provided by the embodiments of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.
The method of entity discovery provided by the embodiments of the present application, is widely portable to various knowledge mapping (Knowledge Graph) or the popular entity of entity relationship ideograph (Entity-relationship model, ERD) update, popular entity It recalls or popular entity discovery can be updated, popular entity is recalled or popular for convenience of describing with the popular entity in knowledge mapping It is illustrated for entity discovery.Wherein, knowledge mapping is the new concept put forward by Google company in 2012, It is substantially a kind of semantic network, and for ease of understanding, knowledge mapping can also be understood as more relational graph (Multi- by we relational Graph).In data structure, figure (Graph) be made of node (Vertex) and side (Edge), but these Figure only includes usually a type of node and side, and a plurality of types of nodes and multiple types are normally comprised in more relational graphs Side.In knowledge mapping, each node on behalf " entity (Entity) ", each edge represents the " relationship between entity and entity (Relation) ", wherein entity refers to things in the real world, such as name, place name, mechanism name, concept, proper noun Then it is used to express certain connection between different entities Deng, relationship, for example people-" living in "-Beijing, Zhang San and Li Si is " friend Friend ", logistic regression are deep learnings " guide's knowledge " etc..In general, the popular entity described in us generally includes two Class, one kind are to be mentioned more entity for a period of time recently, such as video display star, popular TV play etc.;Another kind of is ratio More important entity is updated one's knowledge more frequent, such as video display star, variety show etc..
Method provided by the embodiments of the present application can by being updated for executing the popular entity in knowledge mapping, popular entity is called together It returns or the terminal device of popular entity discovery or system executes, wherein terminal device includes but is not limited to smart phone, plate electricity Brain, laptop and desktop computer etc., herein with no restrictions.For convenience of description, will be carried out by taking terminal device as an example below Explanation.
Entity candidate data of the method provided by the embodiments of the present application based at least one data source got, can root The substance parameter for each entity for factually including in body candidate data determines candidate entity from each entity, if candidate entity It is contained in designated entities set (such as some knowledge mapping), then can extract from designated entities set includes candidate entity The substance feature of at least one designated entities inside, and can determine that target entity is (such as popular real according to substance feature Body).Recycle the incidence relation in target entity and designated entities set between other designated entities, it may be determined that go out target entity At least one associated entity, and ultimately generate target entity set (such as popular entity sets).Using the embodiment of the present application The method of offer can either find popular entity in time, additionally it is possible to improve the recall rate of popular entity and recall efficiency, applicability It is high.
Method provided by the embodiments of the present application and relevant apparatus are carried out respectively respectively below in conjunction with Fig. 1 to Fig. 9 in detail Explanation.It may include obtaining entity candidate data, based on the entity in entity candidate data in method provided by the embodiments of the present application Parameter determine candidate entity, based on the designated entities extracted from designated entities set substance feature determine target entity, And the associated entity of target entity is determined based on incidence relation between entity and generates the data processing stages such as target entity set. Wherein, the implementation of above-mentioned each data processing stage can be found in the implementation being illustrated in fig. 1 shown below.
Referring to Fig. 1, Fig. 1 is the flow diagram that entity provided by the embodiments of the present application finds method.The embodiment of the present application The method of offer may include steps of 101 to 104:
101, the entity candidate data of at least one data source is obtained, and each according to include in entity candidate data The substance parameter of entity selects candidate entity from each entity.
In some possible embodiments, entity is usually individually present without departing from text, and in other words, entity is logical Often comprising in the text.Therefore in order to increase the recall rate of popular entity, and it can guarantee the comprehensive and institute of the renewal of knowledge Recall the diversity of entity, we may be selected obtained from multiple data sources in the form of web pages, log form, textual form And/or data existing for form are as entity candidate data.Wherein, data source includes but is not limited to news channel, searches One or more in Suo Zhi and social platform, here, the preferred news channel of data source, it is high, true that news has timeliness The feature that reality is strong and accuracy is high, so as to increase timeliness of the data obtained from news channel as entity candidate data And validity, applicability are higher.Referring to fig. 2, Fig. 2 is data source schematic diagram provided by the embodiments of the present application, wherein news frequency Road includes entertainment channel, scientific and technological channel, military channel and sports channel etc..Search log includes the search day in QQ browser Search log in will, TT browser or the search log in any other browser or search engine.Social platform may include Microblogging, discussion bar, discussion group etc. determine, herein with no restrictions with specific reference to practical application scene.Specifically, by obtaining news Headline, news in brief in channel and the one or more data in body, the news mark that can be will acquire Topic, news in brief and/or body are determined as entity candidate data.By obtaining search record and search in search log It returns the result, the search record that can be will acquire and search return the result and be determined as entity candidate data.Such as in search log If having recorded search record " " wandering XX " be who director " and that searches for returning the result " director of " wandering XX " is Guo X ", It then can will search for record " whose director " wandering XX " is " and return the result " director of " wandering XX " is Guo X " of search is used as in fact Body candidate data.By obtaining the discussion topic of user in social platform, the discussion topic that can be will acquire is determined as entity time Select data.Here discussion topic, which can be, discusses that number is more than default discussion frequency threshold value or frequency of reading is more than default reads Read frequency threshold value or former hot topics on topic list.By a variety of data conducts for obtaining different data source Entity candidate data, and entity is extracted from entity candidate data, the data source of entity candidate data is more various, and entity is candidate The data content of data is richer, so as to increase the recall rate of popular entity, and knowledge mapping can be allowed more perfect.
In some possible embodiments, further include such as verb, describe in addition to including entity in entity candidate data The meaningless part of speech such as word, quantifier, auxiliary word and interjection, therefore can be based on name Entity recognition (Named Entity Recognition, NER) algorithm identification entity candidate data in include each entity, wherein the entity identified includes people Name, place name, mechanism name, proper noun etc. can specifically determine, herein with no restrictions according to practical application scene.Optionally, if It there may come a time when that some entities can be omitted using only NER algorithm, therefore in order to improve the recall rate of entity and accuracy rate, can also make The entity for including in entity candidate data is identified with participle technique and/or Relation extraction technology, and then entity candidate's number can be obtained The whole entities for including in.It is understandable to be, with the development of mobile internet with the continuous liter of various aspects business demand Grade, data caused by information flow are just being in that blowout increases, so that the entity number extracted from entity candidate data Amount or entity number are also very huge.Therefore, in order to mitigate the subsequent workload classified to popular entity, entity can be based on Parameter carries out the coarse sizing or primary filtration of popular entity to the whole entities got.In general, an entity is mentioned Or the number for being searched or being seen or being updated is more, illustrates that it is more welcome, being also more likely to therefore can as popular entity Will the obtained entity frequency of occurrence of statistics or entity update times or entity browsing time as substance parameter, and by each entity Substance parameter be compared with default substance parameter threshold value to choose candidate popular entity, it is referred to as candidate real for convenience of describing Body.By the way that substance parameter threshold value is arranged, can all entities that substance parameter is less than substance parameter threshold value be filtered or be removed, It is easy to operate and not easy to make mistakes.
It is understandable to be, it is extracted based on NER algorithm and/or participle technique and/or Relation extraction technology all real Body is all from the entity candidate data of each data source.Therefore, the substance parameter of each entity extracted by statistics And the substance parameter of each entity is stored in entity candidate data, it can be based on each entity for including in entity candidate data Substance parameter and acquisition default substance parameter threshold value size relation, candidate entity is selected from each entity, wherein Substance parameter includes but is not limited to any one of entity frequency of occurrence, entity update times and entity browsing time.Specifically , if including that one or more entity from individual data source (for convenience of describing, can use first in entity candidate data Entity indicates the entity from individual data source), then substance parameter in said one or multiple first instances can be greater than Or it is determined as candidate entity equal to the first instance of the first default substance parameter threshold value.Here, the first default substance parameter threshold The size of value can be arranged based on empirical value, may be based on the ginseng of the physical quantities occurred in entity candidate data and each entity Size is counted to determine to filter out the candidate entity of more user's concern from a large amount of entities for including in entity candidate data.Its In, above-mentioned individual data source includes same type of individual data source.For example, it is assumed that news channel include entertainment channel, Scientific and technological channel, military channel and sports channel, then first instance can be any being only from the news data of entertainment channel Entity or first instance may be any entity being only from the news data of sports channel.
Optionally, if include in entity candidate data from least two data sources one or more entity (for Facilitate description, the entity from least two data sources can be indicated with second instance), then it can be by any second instance each Substance parameter in data source is summed, and the sum of substance parameter is more than or equal to the second default substance parameter threshold value Second instance be determined as candidate entity.Understandable to be, popular entity is usually appeared in incessantly in a kind of data source, because This can more embody the temperature of entity by the way that the substance parameter in each data source is summed.Here, the second default entity The size of parameter threshold can be configured based on actual conditions, it is however generally that, since second instance is from multiple data sources Entity, the substance parameter of second instance incorporates the substance parameter of the same second instance in multiple data sources, entity ginseng The second usually larger therefore set default substance parameter threshold value of number can be greater than the above-mentioned first default substance parameter threshold value.Its In, above-mentioned at least two data source includes same type of different data source or different types of different data source.Example Such as, it is assumed that news channel includes entertainment channel, scientific and technological channel, military channel and sports channel, and browser includes A browser, B clear Look at device and C browser, if second instance comes from same type of different data source, second instance can be to be simultaneously from joy Any entity in the news data of happy channel and sports channel, if second instance comes from different types of different data source, Then second instance can be any entity in the search log of the news data for being simultaneously from entertainment channel and A browser.No It is indigestible to be, since substance parameter includes any in entity frequency of occurrence, entity update times and entity browsing time , if therefore including entity frequency of occurrence, first default entity frequency of occurrence (the i.e. first default substance parameter in substance parameter Threshold value) it is smaller than the second default entity frequency of occurrence (the i.e. second default substance parameter threshold value), if in substance parameter including entity Update times, then the first default entity update times (the i.e. first default substance parameter threshold value) are smaller than the second default entity and update Number (the i.e. second default substance parameter threshold value).
Optionally, if include in entity candidate data from least two data sources one or more entity (for Facilitate description, the entity from least two data sources can be indicated with second instance), then it can obtain each of any second instance Maximum solid parameter in a substance parameter, here, it is also possible to maximum solid parameter measures the temperature of any second instance, in It is that the second instance that maximum solid parameter is more than or equal to the first default substance parameter threshold value can be determined as candidate entity.
As an example it is assumed that the substance parameter of each entity in entity candidate data includes entity frequency of occurrence, and really It when fixed candidate's entity is determined based on entity frequency of occurrence and the first default entity frequency of occurrence threshold value, then it is candidate based on entity The substance parameter of one or more first instance from individual data source in data selects candidate from each entity When entity, entity frequency of occurrence in said one or multiple first instances can be more than or equal to the first default entity and occurred The first instance of frequency threshold value is determined as candidate entity.For example, it is assumed that the first default entity frequency of occurrence threshold value is 300, wherein News data of the entity 1 from the entertainment channel and entity frequency of occurrence of entity 1 is 500 times, entity 2 is from the new of sports channel The entity frequency of occurrence for hearing data and entity 2 is 203 times, and the entity of search log of the entity 3 from A browser and entity 3 goes out Occurrence number is 150 times, therefore, entity frequency of occurrence (i.e. 500) can be greater than the first default entity frequency of occurrence threshold value (i.e. 300) Entity 1 be determined as candidate entity.
Again as an example it is assumed that the substance parameter of each entity in entity candidate data includes entity frequency of occurrence, and It when determining candidate's entity is determined based on entity frequency of occurrence and the second default entity frequency of occurrence threshold value, is then waited based on entity The substance parameter of one or more second instance from least two data sources in data is selected to choose from each entity Out when candidate entity, entity frequency of occurrence of any second instance in each data source can be summed, and by entity The second instance that the sum of frequency of occurrence is more than or equal to the second default entity frequency of occurrence threshold value is determined as candidate entity.Example Such as, it is assumed that the second default entity frequency of occurrence threshold value is 1000, wherein entity 4 is new in entertainment channel and sports channel Hearing the entity frequency of occurrence of data and entity 4 in the news data of entertainment channel and sports channel is 800 times and 700 respectively Secondary, i.e., the sum of entity frequency of occurrence of entity 4 is 1500 times.News data and A browser of the entity 5 from entertainment channel are searched The entity frequency of occurrence of Suo Zhi and entity 5 in the news data of entertainment channel and the search log of A browser is 630 respectively Secondary and 270 times, i.e., the sum of entity frequency of occurrence of entity 4 is 900 times.It therefore, can be by the sum of entity frequency of occurrence (i.e. 1500) Entity 4 more than or equal to the second default entity frequency of occurrence threshold value (i.e. 1000) is determined as candidate entity.
Optionally, popular entity in addition to entity frequency of occurrence or entity update times or entity browsing time relatively it is more it Outside, it can also be measured with solid data source quantity, it is however generally that, the data source of an entity is more, illustrates that it is got over It is welcome, therefore substance parameter can also include solid data source quantity.Specifically, can be determined from entity candidate data One or more entity from least two data sources (for convenience of describing, can be indicated from least two with third entity The entity of a data source), and solid data source quantity in one or more third entity is not less than preset data source The third entity of amount threshold is determined as candidate entity, and the size of data source amount threshold can be based on empirical value or practical feelings here Condition setting, herein with no restrictions.It is based on solid data source quantity and present count when as an example it is assumed that determining candidate's entity Determined according to source amount threshold, then based in entity candidate data from one or more third of at least two data sources It, can be by entity in said one or multiple third entities when the substance parameter of entity selects candidate entity from each entity Data source quantity is determined as candidate entity not less than the third entity of preset data source amount threshold.For example, it is assumed that present count It is 3 according to source amount threshold, wherein news data of the entity 6 in entertainment channel and sports channel, i.e. the entity number of entity 6 It is 2 according to source quantity.News data of the entity 7 from entertainment channel and sports channel and the search log for being also from A browser, I.e. the solid data source quantity of entity 7 is 3.Therefore, solid data source quantity (i.e. 3) can be greater than or equal to preset data The entity 7 of source amount threshold (i.e. 3) is determined as candidate entity.
Optionally, in order to increase the recall rate and accuracy rate of popular entity, the multistage screening of entity can also be set.Example Such as, settable two-stage screening, wherein the condition of two-stage screening can be for when the sum of substance parameter of any second instance is less than the When two default substance parameter threshold values, the maximum solid parameter in the substance parameter of any of the above-described second instance is obtained, if maximum real Body parameter is not less than the first default substance parameter threshold value, then the second instance can be determined as to candidate entity.This screening mode Can by from multiple data sources but the sum of substance parameter less than the entity of the second default substance parameter threshold value select as Candidate entity, the primary filtration for both having realized non-popular entity can also will be potential or may be that the entity of popular entity retains Come.In another example being equally that setting two-stage is screened, screening conditions can be set to when the sum of the substance parameter of any second instance is small When the second default substance parameter threshold value, the solid data source quantity of any of the above-described second instance is obtained, if solid data comes Source quantity is not less than preset data source amount threshold, then the second instance can be determined as to candidate entity.The screening mode is same It can will be from multiple data sources but the sum of substance parameter selects less than the entity of the second default substance parameter threshold value and to make For candidate entity, the primary filtration for both having realized non-popular entity can also will be potential or may be that the entity of popular entity retains Get off.
If 102, candidate entity is contained in designated entities set, extracting from designated entities set includes candidate entity The substance feature of at least one designated entities inside.
In some possible embodiments, with the quickening of information update speed, it is possible that some novel entities, this In novel entities be the entity occurred for the first time, in the embodiment of the present application, in order to guarantee the comprehensive of knowledge, if candidate entity is Novel entities can be then determined as target entity (popular entity) by novel entities.Specifically, by by candidate entity and designated entities collection The each designated entities for including in conjunction are matched one by one, it may be determined that and whether candidate's entity is contained in designated entities set, Wherein, designated entities set includes knowledge mapping, ERD etc., herein with no restrictions.If not matched from designated entities set Designated entities identical with candidate's entity illustrate that candidate's entity is novel entities, therefore by candidate's entity and can then refer to Determine each designated entities included in entity sets and generates target entity set (popular entity sets).If from above-mentioned specified reality It has been matched to designated entities identical with candidate entity in body set, then has extracted the substance feature of candidate's entity, it is candidate here Entity is from any data source and the designated entities that are contained in designated entities set.
It is substance feature schematic diagram provided by the embodiments of the present application referring to Fig. 3, Fig. 3, wherein the entity of each designated entities Feature may include entity different degree distinguishing value, solid data source quantity, entity attribute quantity, entity frequency of occurrence, entity more At least two in new number and entity browsing time and every substance feature can count in advance and be stored in designated entities set In.Here, entity different degree distinguishing value is the index for measuring entity significance level, especially for not synonymous reality of the same name Body, significance level are different.For example " Ma Yun " this entity may be corresponding business persona Ma Yun, it is also possible to corresponding one The singer of Ma Yun is in position, therefore the significance level of entity can be distinguished with entity different degree distinguishing value.Under normal circumstances, real Body different degree distinguishing value can be value range be 0 to 1000 between integer and entity different degree distinguishing value value it is higher, Illustrate that entity is more important.Solid data source quantity refers to the original web that can extract entity and entity mobility models, such as " chapter It is happy " this entity in websites such as so-and-so encyclopaedia, certain valve, certain discussion bars can find corresponding introduction page, then " chapter is happy " The solid data source quantity of this entity just refers to the quantity of these link websites.It is understandable to be, solid data source Quantity also can reflect the significance level or temperature of entity in side, it is however generally that solid data source quantity is more, illustrates this The significance level or temperature of entity are higher.Entity attribute quantity refers to the quantity of other entities relevant with the entity, In general entity attribute quantity is more, illustrates that the significance level of the entity or temperature are higher.Entity frequency of occurrence refers to entity The sum of frequency of occurrence or frequency of occurrence in the entity candidate data of one or more data sources.Entity update times refer to The number that entity is updated in total, it is to be understood that entity is more in the number being updated in the past, and future may more continue It updates.Entity browsing time refers to the number that entity is seen, it is to be understood that the number that entity is seen is more, illustrates the reality Body is more welcome.For convenience of description, the substance feature being mentioned below all includes entity different degree distinguishing value, solid data source number This 6 characteristic values of amount, entity attribute quantity, entity frequency of occurrence, entity update times and entity browsing time.
Optionally, in some possible embodiments, although there is also part designated entities to exist in designated entities set Any data source is not come from this operation, but is present in designated entities set in itself and is marked, therefore can also be with Such designated entities (designated entities being marked, such as the entity of certain particular categories) are extracted as substance feature Object.Such benefit is the characteristic for being on the one hand capable of increasing the recall rate of entity and considering entity itself.It is understandable Be, why in designated entities set selected section designated entities make marks rather than all make marks to whole designated entities be because Knowledge information for the designated entities of similar " the term class " or " words class " that includes in designated entities set is usually will not be more New, and the specified reality of the particular categories such as such as " software class ", " product class ", " figure kind ", " movie and television play class " or " novel class " The renewal of knowledge of body is often more many and diverse, therefore can mark such designated entities by way of making marks or labelling and attach most importance to Entity is wanted, so as to the subsequent extraction that the designated entities marked in advance can be carried out with substance feature.
103, target entity is determined from least one designated entities according to the substance feature of at least one designated entities, And based on the incidence relation between other designated entities in target entity and designated entities set, determined from designated entities set At least one associated entity of target entity out.
In some possible embodiments, it by obtaining the substance feature of at least one designated entities, can be obtained each The entity different degree distinguishing value that includes in the substance feature of designated entities, solid data source quantity, entity attribute quantity, entity Frequency of occurrence, entity update times and entity browsing time etc..6 features that are understandable to be, being extracted by above-mentioned steps Value can embody the significance level or temperature of designated entities from different aspect or angle, but be not to only have in practical applications Meet 6 characteristic values it is all bigger when be only popular entity.In other words, there may be a few characteristic values for the popular entity having It is larger, but the situation that other characteristic values are smaller, therefore in order to improve the accuracy of judgement, while making operating for deterministic process Property it is stronger, judging result is relatively reliable, can be then based on entity classification by the way that this 6 characteristic values are inputted entity classification models Model exports target entity included at least one described designated entities (i.e. popular entity).It is understandable to be, judgement It is two classification problems that whether designated entities, which are target entity, wherein the building of entity classification model may include entity classification The modeling data of model acquires, the data processing stages such as the training of entity classification model and the test of entity classification model.It can With understanding, the modeling data of entity classification model can be the substance feature in knowledge mapping or ERD, here, point Class result may be configured as 1 or 0.When carrying out the training of entity classification model, can will be made of substance feature and classification results Information characteristics to input entity classification model initial network model in, wherein initial network model can be linear model, Such as logistic regression, support vector machines (Support Vector Machine, SVM) etc. or initial network Model is also possible to nonlinear model, such as based on tree-like model, gradient boosted tree (Gradient Boosting Decision Tree, GBDT), random forest etc. can specifically determine, herein with no restrictions according to practical application scene.By upper It states substance feature that initial network model includes to the information characteristics centering of input and classification results learns, building input The entity classification model of corresponding classification results can be exported when the substance feature of any designated entities.Entity classification model construction After completion, test data of the substance feature of arbitrarily classification results known to several groups as entity classification model can be acquired.And it will The entity classification model that the input building of each group test data is completed, classification results based on the output of entity classification model and specified real The actual classification result of body is compared, if entity classification model exports in multiple groups test data classification results and designated entities The identical probability of actual classification result be greater than or equal to default precision, the entity classification model for illustrating that building is completed meets building It is required that requiring conversely, illustrating that the entity classification model of building completion does not meet to construct, then continue the instruction of entity classification model Practice until meeting the requirements.
Optionally, in machine learning field, (the different characteristic value i.e. in substance feature is exactly described to different evaluation index Different evaluation index) often have different dimension and dimensional unit, such situation influence whether data analysis as a result, Such as the solid data source quantity of a designated entities is generally more than ten, and entity frequency of occurrence is generally thousands of times, is Dimension impact between each characteristic value of elimination, needs to carry out data normalization processing, with solve between data target can Compare property.In other words, by carrying out data normalization processing to initial data, so that each index is in the same order of magnitude, so as to Subsequent progress Comprehensive Correlation evaluation.Wherein, most typical data standardization processing method is exactly the normalized of data, is passed through Data after normalized can be defined (such as [0,1] or [- 1,1]) in a certain range.Implement in the application In example, range transformation method or 0 mean value Standardization Act can be used to entity weight included in the substance feature of any designated entities Spend distinguishing value, solid data source quantity, entity attribute quantity, entity frequency of occurrence, entity update times and entity browsing The various features such as number value (such as 6 characteristic values) is normalized respectively to obtain the corresponding normalizing of each designated entities Change treated substance feature.Entity classification mould can be based on by inputting entity classification model by 6 characteristic values after normalizing Type exports target entity (popular entity) included at least one described designated entities.Here, entity classification model is constructed When, modeling data collected should also be the data after normalized, and specific modeling process can be found in previous paragraph Described, details are not described herein, so as to reduce the data processing complexity in entity classification model construction process, improves at data Manage efficiency.
Optionally, in some possible embodiments, the discovery quantity of popular entity (i.e. target entity) often has very much Limit, therefore in order to improve the recall rate of popular entity and enhance popular entity recalls efficiency, the embodiment of the present application also passes through pass The mode of system's diffusion has obtained more popular entities.Here, relationship diffusion is usually once relationship diffusion, wherein a certain entity It is the entity close or directly related with the entity relationship by the entity that the diffusion of once relationship obtains.For example, if with one Explain once relationship for the social circle (in particular to friend) of people, then with itself have once relationship it is artificial oneself most familiar with Friend.In addition, relationship is expanded to friends of friends, i.e. two degree of relationships, passes through friends of friends' handle by the introduction of friend Relational extensions are then third degree relationships to the friend of friends of friends, and so on, herein with no restrictions.
Referring to fig. 4, Fig. 4 is the schematic diagram of once relationship diffusion provided by the embodiments of the present application, is spread by once relationship The associated entity of obtained popular entity is associated entity 1, associated entity 2 and associated entity 3.Specifically, above-mentioned to be calculated based on NER Method can also divide each entity extracted when identifying and extracting each entity included in entity candidate data Class.Such as we extract entity " chapter is happy " from the news data of entertainment channel, and mark-up entity type is " figure kind ", Or we extract entity " wandering XX " from the news data of entertainment channel, and mark-up entity type is " movie and television play class ". It include the corresponding each associated entity type set of various entity types in designated entities set.One of entity type pair An associated entity type set has been answered, has included the diffusible of at least one entity type in an associated entity type set Entity type, diffusible entity type included in associated entity type set can be preset by user here.Then, lead to It crosses and obtains target entity type belonging to popular entity from designated entities set, and then can be determined from designated entities set The associated entity type set of above-mentioned target entity type, wherein include in the associated entity type set of target entity type The diffusible entity type of at least one target entity type.Such as, it is assumed that target entity type is " figure kind ", target entity The associated entity set of types of type is combined into set 1, wherein the diffusible entity type for including in set 1 is " figure kind " and " shadow Depending on acute class ".Pass through entity class belonging to determining all designated entities difference relevant with popular entity (target entity) Type, can by the entity type of obtained each designated entities respectively with the diffusible entity class in above-mentioned associated entity type set Type is compared, and then determines the entity type of above-mentioned each designated entities and the associated entity type set of target entity type Between belonging relation, if the entity type of any of the above-described designated entities is contained in the associated entity set of types of target entity type It closes, then any designated entities is determined as to the associated entity of the popular entity.
It for example, is the application scenarios schematic diagram of once relationship diffusion provided by the embodiments of the present application referring to Fig. 5, Fig. 5. Assuming that determining that some popular entity is " Zhang's continuous heavy rain ", wherein the target entity type of popular entity " Zhang's continuous heavy rain " is " personage Class " then can determine that the associated entity set of types of target entity type " figure kind " is combined into set 1, wherein include in set 1 Diffusible entity type be " figure kind " and " movie and television play class ".In designated entities set, have with popular entity " Zhang's continuous heavy rain " Once the designated entities of relationship included wife's " Yuan's instrument ", colleague " certain person of outstanding talent of week ", birthday " on August 27th, 1971 ", native place " perfume Port ", works of taking part in a performance " " anti-corruption storm 4 " ", wherein entity type belonging to " Yuan's instrument " and " week certain person of outstanding talent " be " figure kind ", Entity type belonging to " on August 27th, 1971 " is " date ", entity type is " place name ", " " anti-corruption storm belonging to " Hong Kong " 4 " entity type belonging to " is " movie and television play class ".Pass through all fingers for determining to have once relationship with popular entity " Zhang's continuous heavy rain " Determine the entity type of entity, it can be according to the associated entity type " personage for including in the entity type and set 1 of each designated entities The belonging relation of class " and " movie and television play class " determines the associated entity of popular entity " Zhang's continuous heavy rain " for " Yuan's instrument ", " certain person of outstanding talent of week " " " anti-corruption storm 4 " ".
Optionally, in some possible embodiments, in addition to by the corresponding reality of once relationship of the popular entity determined It is real can also to be determined as association as associated entity by body for two degree of relationships of popular entity and/or the corresponding entity of third degree relationships Body.It is the schematic diagram of two degree of relationships diffusion provided by the embodiments of the present application referring to Fig. 6, Fig. 6, is spread by two degree of relationships The associated entity of popular entity is associated entity 4, associated entity 5, associated entity 6 and associated entity 7.It is this referring to Fig. 7, Fig. 7 The schematic diagram for applying for the third degree relationships diffusion that embodiment provides, passes through the associated entity for the popular entity that third degree relationships are spread For associated entity 8, associated entity 9 and associated entity 10.Understandable to be, once relationship was most close with popular entity relationship Entity, and two degree of relationships are then and popular entity relationship time close entity, the tightness degree of third degree relationships and popular entity Lower than two degree relationships, the specific implementation for example above-mentioned one based on two degree of relationships and/or third degree relationships diffusion discovery associated entity Shown in the realization process of degree relationship diffusion, details are not described herein.
104, target entity set is generated according at least one of target entity and target entity associated entity.
In some possible embodiments, at least one pass for determining popular entity (target entity) and popular entity After joining entity, target entity set is generated jointly using the associated entity of all popular entity and popular entity, for convenience All entities for including in target entity set can be referred to as final popular entity by description.It is understandable to be, target entity It can also include the uniform resource locator of at least one corresponding data source of each final hot topic entity in set Address (Uniform Resource Locator, URL), by extracting the final popular entity pair of each of target entity set The address URL for each data source answered can get the corresponding entity candidate data of each final hot topic entity, be based on entity Candidate data completes the renewal of knowledge to final popular entity.
It in the embodiment of the present application, can be from each news channel, search log and/or society based on name entity identification algorithms It hands over and identifies and extract each entity for including in entity candidate data and each in the entity candidate data of data sources such as platform The corresponding substance parameter of a entity.Wherein, substance parameter includes entity frequency of occurrence, entity update times, entity browsing Any one of number and solid data source quantity.According to big between the substance parameter of each entity and substance parameter threshold value Small relationship can determine candidate entity from each entity, can be from specified if candidate entity is contained in designated entities set The substance feature of at least one designated entities including candidate entity is extracted in entity sets, wherein substance feature includes real Body different degree distinguishing value, solid data source quantity, entity attribute quantity, entity frequency of occurrence, entity update times and entity Browsing time.By the way that each substance feature after normalized is inputted entity classification model, entity classification model can be based on Export target entity included at least one designated entities.Recycle other in target entity and designated entities set specified Incidence relation between entity, it may be determined that go out at least one associated entity of target entity, and ultimately generate target entity set. Using method provided by the embodiments of the present application, entity can be found in time, improve the recall rate and accuracy rate of entity, applicability It is high.
It is the structural schematic diagram of entity discovery device provided by the embodiments of the present application referring to Fig. 8, Fig. 8.The embodiment of the present application Offer entity discovery device include:
Candidate data obtains module 31, for obtaining the entity candidate data of at least one data source;
Candidate entity determining module 32, for obtaining the above-mentioned entity candidate number that module 31 determines according to above-mentioned candidate data The substance parameter for each entity for including in selects candidate entity from above-mentioned each entity;
Substance feature extraction module 33, if the above-mentioned candidate entity for above-mentioned candidate entity determining module 32 to determine includes It is in designated entities set, then specified from least one extracted in above-mentioned designated entities set including above-mentioned candidate entity The substance feature of entity;
Target entity determining module 34, for according to above-mentioned substance feature extraction module 33 determine it is above-mentioned at least one refer to The substance feature for determining entity determines target entity from least one above-mentioned designated entities, and based on above-mentioned target entity with it is upper The incidence relation in designated entities set between other designated entities is stated, determines above-mentioned target from above-mentioned designated entities set At least one associated entity of entity;
First instance set generation module 35, the above-mentioned target for being determined according to above-mentioned target entity determining module 34 are real At least one above-mentioned associated entity of body and above-mentioned target entity generates target entity set.
In some possible embodiments, above-mentioned apparatus further include:
Second instance set generation module 36, if the above-mentioned candidate entity determining for above-mentioned candidate entity determining module is not Be contained in above-mentioned designated entities set, then it is each specified included by above-mentioned candidate entity and above-mentioned designated entities set Entity generates target entity set.
In some possible embodiments, above-mentioned data source includes in news channel, search log and social platform At least one of;Above-mentioned candidate data obtains module 31 and is specifically used for:
The one or more data in headline, news in brief and the body in news channel are obtained, and The data that will acquire are determined as entity candidate data;And/or
The search record in search log is obtained, and the above-mentioned search record that will acquire is determined as entity candidate data;With/ Or
The discussion topic in social platform is obtained, and the topic discussed above that will acquire is determined as entity candidate data.
In some possible embodiments, above-mentioned apparatus further include:
Entity recognition module 37 is wrapped for being identified and being extracted in above-mentioned entity candidate data based on name entity identification algorithms The substance parameter of each entity and above-mentioned each entity that include.
In some possible embodiments, above-mentioned substance parameter include entity frequency of occurrence, entity update times and Any one of entity browsing time;Above-mentioned candidate's entity determining module is specifically used for:
It, will be upper if in above-mentioned entity candidate data including one or more first instance from individual data source It is true more than or equal to the first instance of the first default substance parameter threshold value to state substance parameter in one or more first instance It is set to candidate entity;
If in above-mentioned entity candidate data including one or more second instance from least two data sources, Substance parameter of any second instance in each data source is summed, and the sum of substance parameter is more than or equal to The second instance of second default substance parameter threshold value is determined as candidate entity.
In some possible embodiments, above-mentioned substance parameter includes solid data source quantity;Above-mentioned candidate's entity Determining module 32 is specifically used for:
One or more third entity from least two data sources is determined from above-mentioned entity candidate data, And solid data source quantity in said one or multiple third entities is not less than to the third of preset data source amount threshold Entity is determined as candidate entity.
In some possible embodiments, above-mentioned substance feature includes entity different degree distinguishing value, solid data source At least two in quantity, entity attribute quantity, entity frequency of occurrence, entity update times and entity browsing time;Above-mentioned mesh Marking entity determining module 34 includes:
Target entity finds unit 3401, for by each of any designated entities at least one above-mentioned designated entities Substance feature is normalized respectively to obtain the substance feature after the corresponding normalized of each designated entities;
Substance feature after the corresponding above-mentioned normalized of above-mentioned each designated entities is inputted into entity classification model, base Target entity included at least one above-mentioned designated entities is exported in above-mentioned entity classification model;
Wherein, above-mentioned entity classification model is obtained by linear model and/or nonlinear model training and has identification temperature More than or equal to the ability of the entity of preset heat threshold value.
In some possible embodiments, above-mentioned target entity determining module 34 includes:
Associated entity finds unit 3402, for obtaining the target entity type of above-mentioned target entity and determining above-mentioned target The associated entity type set of entity type;
From each designated entities relevant with above-mentioned target entity for including in above-mentioned designated entities set, really Make one or more designated entities that entity type is contained in above-mentioned associated entity type set;
The said one determined or multiple designated entities are determined as to the associated entity of above-mentioned target entity.
In the specific implementation, the device of above-mentioned entity discovery can execute such as above-mentioned Fig. 1 by each functional module built in it In implementation provided by each step.For example, above-mentioned candidate data, which obtains module 31, can be used for executing above-mentioned each step The middle entity candidate data and other implementations for obtaining each data source, for details, reference can be made to realize provided by above-mentioned each step Mode, details are not described herein.Above-mentioned candidate's entity determining module 32 can be used for executing candidate based on entity in above-mentioned each step Substance parameter in data determines implementation described in the correlation steps such as candidate entity, and for details, reference can be made to above-mentioned each steps Provided implementation, details are not described herein.Above-mentioned substance feature extraction module 33 can be used for executing in above-mentioned each step Determine the belonging relation of candidate's entity, the substance feature for extracting designated entities and other implementations, for details, reference can be made to above-mentioned each steps Implementation provided by rapid, details are not described herein.Above-mentioned target entity determining module 34 can be used for executing above-mentioned each step In based on substance feature determine target entity and based on incidence relation between entity determine associated entity of target entity etc. realize Mode, for details, reference can be made to implementations provided by above-mentioned each step, and details are not described herein.Above-mentioned first instance set generates Module 35, which can be used for executing in above-mentioned each step, generates target entity collection according to the associated entity of target entity and target entity Close and other implementations, for details, reference can be made to implementations provided by above-mentioned each step, and details are not described herein.Above-mentioned second instance Set generation module 36 can be used for executing in above-mentioned each step based on each specified in candidate entity and designated entities set Entity generates target entity set and other implementations, and for details, reference can be made to implementations provided by above-mentioned each step, herein not It repeats again.Above-mentioned Entity recognition module 37 can be used for executing in above-mentioned each step extract entity candidate data in include it is each Entity and the substance parameter and other implementations for determining each entity, for details, reference can be made to realization sides provided by above-mentioned each step Formula, details are not described herein.
In the embodiment of the present application, entity discovery device can based on name entity identification algorithms from each news channel, It identifies and extracts in entity candidate data in the entity candidate data of the search data sources such as log and/or social platform and include Each entity and the corresponding substance parameter of each entity.Wherein, substance parameter includes entity frequency of occurrence, entity update Any one of number, entity browsing time and solid data source quantity.According to the substance parameter and entity of each entity Size relation between parameter threshold can determine candidate entity from each entity, if candidate entity is contained in designated entities collection In conjunction, then can from designated entities set extract include candidate entity including at least one designated entities substance feature, Middle substance feature includes entity different degree distinguishing value, solid data source quantity, entity attribute quantity, entity frequency of occurrence, reality Body update times and entity browsing time.It, can by the way that each substance feature after normalized is inputted entity classification model Target entity included at least one designated entities is exported based on entity classification model.It recycles target entity and specifies real Incidence relation in body set between other designated entities, it may be determined that go out at least one associated entity of target entity, and final Generate target entity set.Using method provided by the embodiments of the present application, entity can be found in time, improve the recall rate of entity And accuracy rate, flexibility is high, applied widely.
It is the structural schematic diagram of terminal device provided by the embodiments of the present application referring to Fig. 9, Fig. 9.As shown in figure 9, this implementation Terminal device in example may include: one or more processors 401 and memory 402.Above-mentioned processor 401 and memory 402 are connected by bus 403.For memory 402 for storing computer program, which includes program instruction, processing Device 401 is used to execute the program instruction of the storage of memory 402, performs the following operations:
Obtain the entity candidate data of at least one data source;
It is selected from above-mentioned each entity according to the substance parameter for each entity for including in above-mentioned entity candidate data Candidate entity;
If above-mentioned candidate's entity is contained in designated entities set, it includes above-mentioned for extracting from above-mentioned designated entities set The substance feature of at least one designated entities including candidate entity;
Target is determined from least one above-mentioned designated entities according to the substance feature of at least one above-mentioned designated entities Entity, and based on the incidence relation between other designated entities in above-mentioned target entity and above-mentioned designated entities set, from above-mentioned At least one associated entity of above-mentioned target entity is determined in designated entities set;
Target entity collection is generated according at least one above-mentioned associated entity of above-mentioned target entity and above-mentioned target entity It closes.
In some possible embodiments, above-mentioned processor 401 is used for:
If above-mentioned candidate's entity is not included in above-mentioned designated entities set, according to above-mentioned candidate entity and above-mentioned specified reality Included each designated entities generate target entity set in body set.
In some possible embodiments, above-mentioned data source includes in news channel, search log and social platform At least one of;Above-mentioned processor 401 is used for:
The one or more data in headline, news in brief and the body in news channel are obtained, and The data that will acquire are determined as entity candidate data;And/or
The search record in search log is obtained, and the above-mentioned search record that will acquire is determined as entity candidate data;With/ Or
The discussion topic in social platform is obtained, and the topic discussed above that will acquire is determined as entity candidate data.
In some possible embodiments, above-mentioned processor 401 is used for:
It is identified based on name entity identification algorithms and extracts each entity for including in above-mentioned entity candidate data;
The corresponding substance parameter of above-mentioned each entity is determined from above-mentioned entity candidate data.
In some possible embodiments, above-mentioned substance parameter include entity frequency of occurrence, entity update times and Any one of entity browsing time;Above-mentioned processor 401 is used for:
It, will be upper if in above-mentioned entity candidate data including one or more first instance from individual data source It is true more than or equal to the first instance of the first default substance parameter threshold value to state substance parameter in one or more first instance It is set to candidate entity;
If in above-mentioned entity candidate data including one or more second instance from least two data sources, Substance parameter of any second instance in each data source is summed, and the sum of substance parameter is more than or equal to The second instance of second default substance parameter threshold value is determined as candidate entity.
In some possible embodiments, above-mentioned substance parameter includes solid data source quantity;Above-mentioned processor 401 For:
One or more third entity from least two data sources is determined from above-mentioned entity candidate data, And solid data source quantity in said one or multiple third entities is not less than to the third of preset data source amount threshold Entity is determined as candidate entity.
In some possible embodiments, above-mentioned substance feature includes entity different degree distinguishing value, solid data source At least two in quantity, entity attribute quantity, entity frequency of occurrence, entity update times and entity browsing time;Above-mentioned place Reason device 401 is used for:
Place is normalized in each substance feature of any designated entities at least one above-mentioned designated entities respectively Reason is to obtain the substance feature after the corresponding normalized of each designated entities;
Substance feature after the corresponding above-mentioned normalized of above-mentioned each designated entities is inputted into entity classification model, base Target entity included at least one above-mentioned designated entities is exported in above-mentioned entity classification model;
Wherein, above-mentioned entity classification model is obtained by linear model and/or nonlinear model training and has identification temperature More than or equal to the ability of the entity of preset heat threshold value.
In some possible embodiments, above-mentioned processor 401 is used for:
It obtains the target entity type of above-mentioned target entity and determines the associated entity set of types of above-mentioned target entity type It closes;
From each designated entities relevant with above-mentioned target entity for including in above-mentioned designated entities set, really Make one or more designated entities that entity type is contained in above-mentioned associated entity type set;
The said one determined or multiple designated entities are determined as to the associated entity of above-mentioned target entity.
It should be appreciated that in some possible embodiments, above-mentioned processor 401 can be central processing unit (central processing unit, CPU), which can also be other general processors, digital signal processor (digital signal processor, DSP), specific integrated circuit (application specific integrated Circuit, ASIC), ready-made programmable gate array (field programmable gate array, FPGA) or other can Programmed logic device, discrete gate or transistor logic, discrete hardware components etc..General processor can be microprocessor Or the processor is also possible to any conventional processor etc..The memory 402 may include read-only memory and deposit at random Access to memory, and instruction and data is provided to processor 401.The a part of of memory 402 can also include non-volatile random Access memory.For example, memory 402 can be with the information of storage device type.
In the specific implementation, above-mentioned terminal device can be executed by each functional module built in it as each in above-mentioned Fig. 1 Implementation provided by step, for details, reference can be made to implementations provided by above-mentioned each step, and details are not described herein.
In the embodiment of the present application, terminal device can be based on name entity identification algorithms from each news channel, search day Include in entity candidate data each is identified and extracted in the entity candidate data of the data sources such as will and/or social platform Entity and the corresponding substance parameter of each entity.Wherein, substance parameter include entity frequency of occurrence, entity update times, Any one of entity browsing time and solid data source quantity.According to the substance parameter of each entity and substance parameter threshold Size relation between value can determine candidate entity from each entity, if candidate entity is contained in designated entities set, Can be from the substance feature for extracting at least one designated entities including candidate entity in designated entities set, wherein entity is special Sign includes entity different degree distinguishing value, solid data source quantity, entity attribute quantity, entity frequency of occurrence, entity update time Several and entity browsing time.By the way that each substance feature after normalized is inputted entity classification model, entity can be based on Disaggregated model exports target entity included at least one designated entities.It recycles in target entity and designated entities set Incidence relation between other designated entities, it may be determined that go out at least one associated entity of target entity, and ultimately generate target Entity sets.Using method provided by the embodiments of the present application, entity can be found in time, improve the recall rate of entity and accurate Rate, flexibility is high, applied widely.
The embodiment of the present application also provides a kind of computer readable storage medium, which has meter Calculation machine program, the computer program include program instruction, which realizes each step institute in Fig. 1 when being executed by processor The method of the entity discovery of offer, for details, reference can be made to implementations provided by above-mentioned each step, and details are not described herein.
Above-mentioned computer readable storage medium can be the entity discovery that aforementioned any embodiment provides device or on State the internal storage unit of terminal device, such as the hard disk or memory of electronic equipment.The computer readable storage medium can also be with It is the plug-in type hard disk being equipped on the External memory equipment of the electronic equipment, such as the electronic equipment, intelligent memory card (smart Media card, SMC), secure digital (secure digital, SD) card, flash card (flash card) etc..Further, The computer readable storage medium can also both including the electronic equipment internal storage unit and also including External memory equipment.It should Computer readable storage medium is for other programs and data needed for storing the computer program and the electronic equipment.The meter Calculation machine readable storage medium storing program for executing can be also used for temporarily storing the data that has exported or will export.
Following claims and term " first " in specification and attached drawing, " second ", " third ", " the 4th " etc. It is to be not use to describe a particular order for distinguishing different objects.In addition, term " includes " and " having " and they are any Deformation, it is intended that cover and non-exclusive include.Such as contain the process, method, system, product of a series of steps or units Or equipment is not limited to listed step or unit, but optionally further comprising the step of not listing or unit, or can Selection of land further includes the other step or units intrinsic for these process, methods, product or equipment.
Referenced herein " embodiment " is it is meant that a particular feature, structure, or characteristic described can wrap in conjunction with the embodiments It is contained at least one embodiment of the application.It is identical that each position in the description shows that the phrase might not be each meant Embodiment, nor the independent or alternative embodiment with other embodiments mutual exclusion.Those skilled in the art explicitly and Implicitly understand, embodiment described herein can be combined with other embodiments.In present specification and appended Term "and/or" used in claims refers to any combination and institute of one or more of associated item listed It is possible that combining, and including these combinations.Those of ordinary skill in the art may be aware that in conjunction with reality disclosed herein Each exemplary unit and algorithm steps of example description are applied, can be come with the combination of electronic hardware, computer software or the two real It is existing, in order to clearly illustrate the interchangeability of hardware and software, generally described in the above description according to function Each exemplary composition and step.These functions are implemented in hardware or software actually, depending on the specific of technical solution Using and design constraint.Professional technician can realize each specific application using distinct methods described Function, but it is this realize it is not considered that exceed scope of the present application.
Method provided by the embodiments of the present application and relevant apparatus be referring to method flow diagram provided by the embodiments of the present application and/ Or structural schematic diagram is come what is described, can specifically be realized by computer program instructions the every of method flow diagram and/or structural schematic diagram The combination of process and/or box in one process and/or box and flowchart and/or the block diagram.These computer programs refer to Enable the processor that can provide general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices To generate a machine, so that generating use by the instruction that computer or the processor of other programmable data processing devices execute In the function that realization is specified in one or more flows of the flowchart and/or structural schematic diagram one box or multiple boxes Device.These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with specific In the computer-readable memory that mode works, so that it includes instruction that instruction stored in the computer readable memory, which generates, The manufacture of device, the command device are realized in one box of one or more flows of the flowchart and/or structural schematic diagram Or the function of being specified in multiple boxes.These computer program instructions can also be loaded into computer or the processing of other programmable datas In equipment, so that executing series of operation steps on a computer or other programmable device to generate computer implemented place Reason, so that instruction executed on a computer or other programmable device offer is for realizing in one process of flow chart or multiple The step of function of being specified in process and/or structural representation one box or multiple boxes.

Claims (11)

1. a kind of entity finds method, which is characterized in that the described method includes:
Obtain the entity candidate data of at least one data source;
Candidate is selected from each entity according to the substance parameter for each entity for including in the entity candidate data Entity;
If candidate's entity is contained in designated entities set, extracting from the designated entities set includes the candidate The substance feature of at least one designated entities including entity;
Target entity is determined from least one described designated entities according to the substance feature of at least one designated entities, And based on the incidence relation between other designated entities in the target entity and the designated entities set, from the specified reality At least one associated entity of the target entity is determined in body set;
Target entity set is generated according at least one associated entity described in the target entity and the target entity.
2. method according to claim 1, which is characterized in that the method also includes:
If candidate's entity is not included in the designated entities set, according to the candidate entity and the designated entities collection Included each designated entities generate target entity set in conjunction.
3. method according to claim 1 or claim 2, which is characterized in that the data source include news channel, search log and At least one of in social platform;The entity candidate data for obtaining at least one data source, comprising:
The one or more data in headline, news in brief and the body in news channel are obtained, and will be obtained The data taken are determined as entity candidate data;And/or
The search record in search log is obtained, and the described search record that will acquire is determined as entity candidate data;And/or
The discussion topic in social platform is obtained, and the discussion topic that will acquire is determined as entity candidate data.
4. method according to claim 3, which is characterized in that the method also includes:
It is identified based on name entity identification algorithms and extracts each entity for including in the entity candidate data;
The corresponding substance parameter of each entity is determined from the entity candidate data.
5. method according to claim 4, which is characterized in that the substance parameter includes entity frequency of occurrence, entity update Any one of number and entity browsing time;The entity according to each entity for including in the entity candidate data Parameter selects candidate entity from each entity, comprising:
If including one or more first instance from individual data source in the entity candidate data, by described one The first instance that substance parameter is more than or equal to the first default substance parameter threshold value in a or multiple first instances is determined as Candidate entity;
If in the entity candidate data including one or more second instance from least two data sources, will appoint Substance parameter of one second instance in each data source is summed, and the sum of substance parameter is more than or equal to second The second instance of default substance parameter threshold value is determined as candidate entity.
6. method according to claim 4, which is characterized in that the substance parameter includes solid data source quantity;It is described Candidate entity is selected from each entity according to the substance parameter for each entity for including in the entity candidate data, Include:
One or more third entity from least two data sources is determined from the entity candidate data, and will Solid data source quantity is not less than the third entity of preset data source amount threshold in one or more of third entities It is determined as candidate entity.
7. any one of -6 the method according to claim 1, which is characterized in that the substance feature includes that entity different degree is distinguished In value, solid data source quantity, entity attribute quantity, entity frequency of occurrence, entity update times and entity browsing time At least two;The substance feature of at least one designated entities according to is determined from least one described designated entities Target entity, comprising:
By each substance feature of any designated entities at least one described designated entities be normalized respectively with Substance feature after obtaining the corresponding normalized of each designated entities;
Substance feature after the corresponding normalized of each designated entities is inputted into entity classification model, is based on institute It states entity classification model and exports target entity included at least one described designated entities;
Wherein, the entity classification model is obtained and is had by linear model and/or nonlinear model training to identify that temperature is greater than Or the ability of the entity equal to preset heat threshold value.
8. any one of -7 the method according to claim 1, which is characterized in that described based on the target entity and described specified Incidence relation in entity sets between other designated entities determines the target entity from the designated entities set At least one associated entity, comprising:
It obtains the target entity type of the target entity and determines the associated entity type set of the target entity type;
From each designated entities relevant with the target entity for including in the designated entities set, determine Entity type is contained in one or more designated entities of the associated entity type set;
The one or more of designated entities determined are determined as to the associated entity of the target entity.
9. a kind of device of entity discovery, which is characterized in that described device includes:
Candidate data obtains module, for obtaining the entity candidate data of at least one data source;
Candidate entity determining module, for including according in the determining entity candidate data of candidate data acquisition module The substance parameter of each entity select candidate entity from each entity;
Substance feature extraction module, if being contained in for the candidate entity that the candidate entity determining module determines specified real In body set, then from the reality for extracting at least one designated entities including the candidate entity in the designated entities set Body characteristics;
Target entity determining module, at least one designated entities according to substance feature extraction module determination Substance feature determines target entity from least one described designated entities, and based on the target entity and the specified reality Incidence relation in body set between other designated entities, determined from the designated entities set target entity to A few associated entity;
First instance set generation module, the target entity and institute for being determined according to the target entity determining module At least one the described associated entity for stating target entity generates target entity set.
10. a kind of terminal device, which is characterized in that including processor and memory, the processor and memory are connected with each other;
The memory is for storing computer program, and the computer program includes program instruction, and the processor is configured For calling described program to instruct, the method according to claim 1 is executed.
11. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, the computer program include program instruction, and described program instruction executes the processor such as The described in any item methods of claim 1-8.
CN201910242996.XA 2019-03-28 2019-03-28 Entity discovery method and device Active CN110008352B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910242996.XA CN110008352B (en) 2019-03-28 2019-03-28 Entity discovery method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910242996.XA CN110008352B (en) 2019-03-28 2019-03-28 Entity discovery method and device

Publications (2)

Publication Number Publication Date
CN110008352A true CN110008352A (en) 2019-07-12
CN110008352B CN110008352B (en) 2022-12-20

Family

ID=67168611

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910242996.XA Active CN110008352B (en) 2019-03-28 2019-03-28 Entity discovery method and device

Country Status (1)

Country Link
CN (1) CN110008352B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625837A (en) * 2020-05-22 2020-09-04 北京金山云网络技术有限公司 Method and device for identifying system vulnerability and server
CN112633000A (en) * 2020-12-25 2021-04-09 北京明略软件系统有限公司 Method and device for associating entities in text, electronic equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140207763A1 (en) * 2013-01-18 2014-07-24 Microsoft Corporation Ranking relevant attributes of entity in structured knowledge base
US20160292281A1 (en) * 2015-04-01 2016-10-06 Microsoft Technology Licensing, Llc Obtaining content based upon aspect of entity
US20170085456A1 (en) * 2015-09-22 2017-03-23 Ca, Inc. Key network entity detection
CN107992478A (en) * 2017-11-30 2018-05-04 百度在线网络技术(北京)有限公司 The method and apparatus for determining focus incident
CN108038183A (en) * 2017-12-08 2018-05-15 北京百度网讯科技有限公司 Architectural entities recording method, device, server and storage medium
CN108509479A (en) * 2017-12-13 2018-09-07 深圳市腾讯计算机系统有限公司 Entity recommends method and device, terminal and readable storage medium storing program for executing
CN108536702A (en) * 2017-03-02 2018-09-14 腾讯科技(深圳)有限公司 A kind of related entities determine method, apparatus and computing device
CN109189938A (en) * 2018-08-31 2019-01-11 北京字节跳动网络技术有限公司 Method and apparatus for updating knowledge mapping

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140207763A1 (en) * 2013-01-18 2014-07-24 Microsoft Corporation Ranking relevant attributes of entity in structured knowledge base
US20160292281A1 (en) * 2015-04-01 2016-10-06 Microsoft Technology Licensing, Llc Obtaining content based upon aspect of entity
US20170085456A1 (en) * 2015-09-22 2017-03-23 Ca, Inc. Key network entity detection
CN108536702A (en) * 2017-03-02 2018-09-14 腾讯科技(深圳)有限公司 A kind of related entities determine method, apparatus and computing device
CN107992478A (en) * 2017-11-30 2018-05-04 百度在线网络技术(北京)有限公司 The method and apparatus for determining focus incident
CN108038183A (en) * 2017-12-08 2018-05-15 北京百度网讯科技有限公司 Architectural entities recording method, device, server and storage medium
CN108509479A (en) * 2017-12-13 2018-09-07 深圳市腾讯计算机系统有限公司 Entity recommends method and device, terminal and readable storage medium storing program for executing
CN109189938A (en) * 2018-08-31 2019-01-11 北京字节跳动网络技术有限公司 Method and apparatus for updating knowledge mapping

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JOHANNES HOFFART等: "Discovering emerging entities with ambiguous names", 《PROCEEDINGS OF THE 23RD INTERNATIONAL WORLD WIDE WEB CONFERENCE》 *
黄鹏程: "面向自然语言查询的知识搜索关键技术研究", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625837A (en) * 2020-05-22 2020-09-04 北京金山云网络技术有限公司 Method and device for identifying system vulnerability and server
CN112633000A (en) * 2020-12-25 2021-04-09 北京明略软件系统有限公司 Method and device for associating entities in text, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN110008352B (en) 2022-12-20

Similar Documents

Publication Publication Date Title
CN104573054B (en) A kind of information-pushing method and equipment
WO2019192261A1 (en) Payment mode recommendation method and device and equipment
CN109063221A (en) Query intention recognition methods and device based on mixed strategy
CN104899273B (en) A kind of Web Personalization method based on topic and relative entropy
CN109614476A (en) Customer service system answering method, device, computer equipment and storage medium
TWI652584B (en) Method and device for matching text information and pushing business objects
CN110532479A (en) A kind of information recommendation method, device and equipment
CN110162695A (en) A kind of method and apparatus of information push
CN110020104A (en) News handles method, apparatus, storage medium and computer equipment
CN103729359A (en) Method and system for recommending search terms
CN109710841A (en) Comment on recommended method and device
CN107679082A (en) Question and answer searching method, device and electronic equipment
WO2020155877A1 (en) Information recommendation
CN109190007A (en) Data analysing method and device
CN105843796A (en) Microblog emotional tendency analysis method and device
CN106844341A (en) News in brief extracting method and device based on artificial intelligence
CN109325146A (en) A kind of video recommendation method, device, storage medium and server
CN106919575A (en) application program searching method and device
TW201923629A (en) Data processing method and apparatus
CN110619050A (en) Intention recognition method and equipment
CN110727857A (en) Method and device for identifying key features of potential users aiming at business objects
CN107832444A (en) Event based on search daily record finds method and device
CN105869058B (en) A kind of method that multilayer latent variable model user portrait extracts
CN112132238A (en) Method, device, equipment and readable medium for identifying private data
CN108694183A (en) A kind of search method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant