CN110222156B - Method and device for discovering entity, electronic equipment and computer readable medium - Google Patents

Method and device for discovering entity, electronic equipment and computer readable medium Download PDF

Info

Publication number
CN110222156B
CN110222156B CN201910516155.3A CN201910516155A CN110222156B CN 110222156 B CN110222156 B CN 110222156B CN 201910516155 A CN201910516155 A CN 201910516155A CN 110222156 B CN110222156 B CN 110222156B
Authority
CN
China
Prior art keywords
entity
selection
matching
retrieval
retrieval result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910516155.3A
Other languages
Chinese (zh)
Other versions
CN110222156A (en
Inventor
林泽南
卢佳俊
李然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910516155.3A priority Critical patent/CN110222156B/en
Publication of CN110222156A publication Critical patent/CN110222156A/en
Application granted granted Critical
Publication of CN110222156B publication Critical patent/CN110222156B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure provides a method of discovering entities, the method comprising: acquiring retrieval content and a high-selection retrieval result corresponding to the retrieval content, wherein the selection probability of the high-selection retrieval result is greater than a first threshold value; searching a matching entity matched with the high-selection retrieval result in a preset first database; and if the matching entity is not found, establishing a new entity according to the retrieval content and the corresponding high-selection retrieval result. The disclosure also provides an apparatus, an electronic device and a computer readable medium for discovering entities.

Description

Method and device for discovering entity, electronic equipment and computer readable medium
Technical Field
The disclosed embodiments relate to the field of database technologies, and in particular, to a method and apparatus for discovering an entity, an electronic device, and a computer-readable medium.
Background
With the development of society, new entities (including new words or new meaning items of words) continuously appear, and in order to perfect a knowledge graph, knowledge encyclopedia and the like, the new entities need to be continuously discovered and included in databases of the knowledge graph, the knowledge encyclopedia and the like.
At present, the entity is mainly judged to be newly added in a manual mode, but the mode is difficult to systematize, difficult to completely cover the new entity, low in efficiency, high in cost, large in human factor and easy to cause errors.
Disclosure of Invention
The embodiment of the disclosure provides a method and a device for discovering an entity, an electronic device and a computer readable medium.
In a first aspect, an embodiment of the present disclosure provides a method for discovering an entity, including:
acquiring retrieval content and a high-selection retrieval result corresponding to the retrieval content, wherein the selection probability of the high-selection retrieval result is greater than a first threshold value;
searching a matching entity matched with the high-selection retrieval result in a preset first database;
and if the matching entity is not found, establishing a new entity according to the retrieval content and the corresponding high-selection retrieval result.
In some embodiments, the high-choice search result is a transition-type high-choice search result;
in a first time period, the selection probability of a first retrieval result corresponding to the retrieval content is a first probability, and the selection probability of the transition type high selection retrieval result is a second probability;
in a second time period after the first time period, the selection probability of the first retrieval result is smaller than the first probability, the difference between the first probability and the first probability is larger than a first threshold value, the selection probability of the transition type high selection retrieval result is larger than the second probability, the difference between the transition type high selection retrieval result and the second probability is larger than a second threshold value, and the selection probability of the transition type high selection retrieval result is larger than the first threshold value.
In some embodiments, the first search result is a physical card.
In some embodiments, the searching for a matching entity matching the high-choice search result in the preset first database includes:
respectively calculating the matching degrees of at least part of entities in the first database and the high-selection retrieval result;
and the entity corresponding to the maximum matching degree in the matching degrees larger than the third threshold value is the matching entity.
In some embodiments, before the separately calculating the matching degrees of at least some of the entities in the first database with the high-choice search result, the method further includes: screening out entities which are possibly matched with the high selection retrieval result from the first database as candidate entities through character matching;
the respectively calculating the matching degrees of at least part of the entities in the first database and the high-selection search result comprises: and respectively calculating the matching degree of each candidate entity and the high selection retrieval result.
In some embodiments, said separately calculating a degree of matching of at least some of the entities in said first database with said high-choice search result comprises;
respectively calculating the matching degrees of at least part of entities in the first database and the high-selection retrieval result by adopting a semantic matching neural network: the semantic matching neural network comprises:
a first input terminal for inputting the high selection retrieval result;
a second input end for inputting information corresponding to the entity in the first database;
and the output end is used for outputting the matching degree of the entity and the high selection retrieval result.
In some embodiments, if the matching entity is found, taking an entity card corresponding to the matching entity as a recommended search result corresponding to the search content;
and if the matching entity is not found, after a new entity is established according to the retrieval content and the high-selection retrieval result corresponding to the retrieval content, taking an entity card corresponding to the new entity as a recommended retrieval result corresponding to the retrieval content.
In a second aspect, an embodiment of the present disclosure provides an apparatus for discovering an entity, including:
the retrieval system comprises an acquisition module, a retrieval module and a retrieval module, wherein the acquisition module is used for acquiring retrieval contents and high selection retrieval results corresponding to the retrieval contents, and the selection probability of the high selection retrieval results is greater than a first threshold;
the matching module is used for searching a matching entity matched with the high-selection retrieval result in a preset first database;
and the establishing module is used for establishing a new entity according to the retrieval content and the corresponding high-selection retrieval result when the matching entity is not found.
In some embodiments, the high-choice search result is a transition-type high-choice search result;
in a first time period, the selection probability of a first retrieval result corresponding to the retrieval content is a first probability, and the selection probability of the transition type high selection retrieval result is a second probability;
in a second time period after the first time period, the selection probability of the first retrieval result is smaller than the first probability, the difference between the first probability and the first probability is larger than a first threshold value, the selection probability of the transition type high selection retrieval result is larger than the second probability, the difference between the transition type high selection retrieval result and the second probability is larger than a second threshold value, and the selection probability of the transition type high selection retrieval result is larger than the first threshold value.
In some embodiments, the first search result is a physical card.
In some embodiments, the matching module comprises:
the matching degree calculation unit is used for calculating the matching degree of at least part of entities in the first database and the high-selection retrieval result respectively;
a matching unit, configured to use the entity corresponding to the largest matching degree of the matching degrees greater than the third threshold as the matching entity.
In some embodiments, the matching module further comprises:
the candidate entity screening unit is used for screening out entities which are possibly matched with the high-selection retrieval result from the first database through character matching to serve as candidate entities;
and,
the matching degree calculation unit is used for calculating the matching degree of each candidate entity and the high selection retrieval result respectively.
In some embodiments, the matching degree calculation unit is configured to calculate the matching degree between at least some entities in the first database and the high-choice search result by using a semantic matching neural network: the semantic matching neural network comprises:
a first input terminal for inputting the high selection retrieval result;
a second input end for inputting information corresponding to the entity in the first database;
and the output end is used for outputting the matching degree of the entity and the high selection retrieval result.
In some embodiments, the apparatus further comprises:
and the recommending module is used for taking the entity card corresponding to the matching entity as a recommended retrieval result corresponding to the retrieval content when the matching entity is found, and taking the entity card corresponding to the new entity as the recommended retrieval result corresponding to the retrieval content when the matching entity is not found.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including:
one or more processors;
a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement any of the above methods of discovering entities.
In a fourth aspect, the embodiments of the present disclosure provide a computer readable medium, on which a computer program is stored, the program, when executed by a processor, implementing any one of the above methods for discovering entities.
The method for discovering entities of the embodiment of the disclosure can discover newly-appeared entities by utilizing the self-performed retrieval and selection (i.e. retrieval and posterior information) of a large number of users. Therefore, the method can be automatically carried out without depending on manpower, so that the efficiency is high, the cost is low, and the accuracy is high; meanwhile, new entities often appear over time and are usually reflected in the retrieval records after the new entities appear, so that the method can monitor the new entities in real time and comprehensively by monitoring a large number of the retrieval records, and realizes active, rapid and high-coverage discovery of the new entities.
Drawings
The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. The above and other features and advantages will become more apparent to those skilled in the art by describing in detail exemplary embodiments thereof with reference to the attached drawings, in which:
fig. 1 is a flowchart of a method for discovering an entity according to an embodiment of the present disclosure;
fig. 2 is a flowchart of a part of steps in another method for discovering an entity according to an embodiment of the present disclosure;
fig. 3 is a flowchart of some steps in another method for discovering entities according to an embodiment of the present disclosure;
fig. 4 is a block diagram illustrating an apparatus for discovering an entity according to an embodiment of the present disclosure;
fig. 5 is a block diagram of another apparatus for discovering an entity according to an embodiment of the present disclosure.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those skilled in the art, the method and apparatus for discovering entities, the electronic device, and the computer readable medium provided in the present disclosure are described in detail below with reference to the accompanying drawings.
Example embodiments will be described more fully hereinafter with reference to the accompanying drawings, but which may be embodied in different forms and should not be construed as limited to the embodiments set forth in the disclosure. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As used in this disclosure, the term "and/or" includes any and all combinations of one or more of the associated listed items.
The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
When the terms "comprises" and/or "comprising … …" are used in this disclosure, the presence of stated features, integers, steps, operations, elements, and/or components are specified, but does not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Embodiments of the present disclosure may be described with reference to plan and/or cross-sectional views in light of idealized schematic illustrations of the present disclosure. Accordingly, the example illustrations can be modified in accordance with manufacturing techniques and/or tolerances.
Embodiments of the present disclosure are not limited to the embodiments shown in the drawings, but include modifications of configurations formed based on a manufacturing process. Thus, the regions illustrated in the figures have schematic properties, and the shapes of the regions shown in the figures illustrate specific shapes of regions of elements, but are not intended to be limiting.
Unless otherwise defined, all terms (including technical and scientific terms) used in this disclosure have the same meaning as commonly understood by one of ordinary skill in the art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Description of technical terms
In the embodiments of the present disclosure, unless otherwise specified, the following technical terms should be understood in accordance with the following explanations:
an entity, also referred to as "knowledge" or "concept," refers to the physical or abstract definition of a person, object, substance, structure, product, building, art, place, country, organization, event, technology, theorem, theory, or the like that exists or has existed at a time. Where the attributes of an entity include a name, it is understood that different entities may have the same name (a word may also be understood to have multiple "meanings"), and conversely, an entity may have multiple names (e.g., aliases); thus, in a database, entities should be distinguished by ID (number) rather than name.
A knowledge graph, which is a database representing relationships between different entities and attributes of the entities. In the knowledge graph, entities are taken as nodes; the entities are connected with each other through edges, and the entities are connected with the values (attribute-value) of the attributes corresponding to the entities through edges, so that the structured and network-shaped database is formed.
The database is a data set formed by one or more data according to a certain form, such as a knowledge map, a knowledge encyclopedia and the like.
Searching, which refers to a process of finding (e.g., searching by using a specific search engine) data related to the searched content in a certain range and showing the data in the form of a search result; the search may be a general search performed on all public web page data, or may be a specific search for data within a specific range (e.g., data of a specific website, or data of a specific type).
Search content, which refers to content on which the search is based, so that the search results should all be related to the search content; specifically, the search content may be only one keyword (search term), such as "newton"; alternatively, the search content may be a sentence or the like, such as "asking for a unit of what physical quantity newton is? "of course, for the retrieved content of a sentence, one or more core keywords can be extracted from the retrieved content, usually by semantic analysis techniques or the like.
The search results, which refer to the results found and shown by the search, are usually contents corresponding to a link, such as a web page, an article, and so on.
Probability of being selected: after a plurality of retrieval results are obtained through retrieval, different users may be interested in different retrieval results, so as to operate different retrieval results, for example, clicking to open a link corresponding to the retrieval result, or stopping a cursor at the retrieval result to obtain preliminary content of the retrieval result, and the like; thus, in a plurality of searches for one search content, different search results are selected (e.g., "clicked") at different times, and the ratio of the number of times a search result is selected to the number of times the search result is selected, i.e., "selection probability" of the search result, or "click rate" thereof.
An entity card refers to a web page that specifically introduces basic related knowledge of an entity, such as a web page related to each entity in a knowledge encyclopedia website.
Matching means that the degree of correlation (degree of matching) between two pieces of information reaches a predetermined value, so that the two pieces of information can be regarded as being substantially directed to one content (entity).
The semantic matching neural network is a self-learning neural network and is used for judging the correlation degree (matching degree) of two pieces of information at a semantic level; specifically, the semantic matching neural Network may adopt a form of a tuple Network (Siamese Network), a tuple Network (triple Network), or the like.
Recommending a retrieval result: when a certain search content is searched, a plurality of obtained search results can be sorted according to the selected probability, the relevance, the name, the source, the time and the like so as to determine whether the search content is shown or not and the showing sequence; and "recommended search result" refers to a search result that is set to be necessarily given in a preferential manner (e.g., as the first search result without fail).
Fig. 1 is a flowchart of a method for discovering an entity according to an embodiment of the present disclosure.
In a first aspect, referring to fig. 1, an embodiment of the present disclosure provides a method for discovering an entity, including:
s101, retrieval contents and high-selection retrieval results corresponding to the retrieval contents are obtained, and the selection probability of the high-selection retrieval results is larger than a first threshold value.
By analyzing the search records of the search engine, a high-selection search result (or called "high-click result") with a high selection probability is determined from a plurality of search results of a search content, so as to establish the following binary group:
{ search content (query), high-choice search result }.
Since the retrieved content is the basis for retrieval, it usually represents an entity (or knowledge) or entities. Since the retrieved content represents the entity, the retrieved result obtained according to the retrieved content should generally be the content related to the entity represented by the retrieved content; and the high-choice search results are of interest to most people, and are generally the most reliable and hottest contents about the entity represented by the search contents.
Of course, if the search content is a word (keyword), the entity represented by the search content is usually an entity named by the keyword (including alias); if the search content is a sentence, the core keyword can be extracted from the sentence by a semantic analysis technique, and the entity corresponding to the search content, which usually takes the core keyword as a name (including an alias), can be searched.
It is clear that an entity represents a certain concept and has a unique ID (number) in the database, and that usually an entity should comprise at least one "name" attribute. However, different entities may have the same name, or a word may have multiple "meanings" and thus represent multiple different entities, e.g., the word "newton" may represent an entity in "newton in english scientist" or an entity in "newton in mechanical units". In addition, each entity may have multiple names (e.g., alias names), for example, a name corresponding to an entity of "Newton" british scientists includes the chinese characters "Newton" and the english word "Newton", and a name corresponding to an entity of "Newton mechanical units" also includes the unit symbol "N".
Thus, a plurality of search results obtained by searching the same search content may be related to different entities; similarly, one search result for the same entity can be obtained by searching a plurality of different search contents.
The selected probability of the same search result is obviously different according to different statistical ranges, and the high-selection search result is determined according to the specific statistical range and can be set according to needs. For example, a search result with a high selection probability can be selected as a high selection search result within a period of time (e.g., a period of time before the current time), which is beneficial to embodying the "timeliness" of the search result, because new entities tend to appear over time; alternatively, the search result with a high selection probability in a specific number of searches (e.g., the search of the specific number of searches before the current time) may be a high selection search result.
The specific operation method for determining the high-choice search result is also various. For example, a plurality of search records corresponding to a search content can be obtained, each search record comprises a selected search result, and the selection probability of each search result can be calculated by statistically analyzing the search records; if the retrieval engine automatically sorts the retrieval results according to the selected probability, the high-selection retrieval results can be directly determined according to the sorting; for another example, the selection probability of each search result for each search content may be continuously updated by real-time monitoring, and when the selection probability of a certain search result exceeds the first threshold, the search result is used as the high-selection search result of the corresponding search content.
Wherein, the number of the determined high selection search results can be set according to the requirement. For example, all search results with the selected probability exceeding the first threshold may be regarded as high-selection search results, so that there may be a plurality of high-selection search results; alternatively, only one search result with a selection probability exceeding the first threshold (e.g., the one with the highest selection probability) may be used as the high-selection search result.
Although the step describes a process of determining a high-selection search result of one search content, it should be understood that, for a plurality of search contents, the method of the embodiment of the present disclosure may be used to determine their respective high-selection search results and perform subsequent permission, so that the method of the embodiment of the present disclosure may be used to monitor a large number of search contents simultaneously.
S102, searching a matching entity matched with the high-selection search result in a preset first database.
In some embodiments, the first database may be in the form of a knowledge graph, an intellectual encyclopedia, or the like.
As before, the high-choice search result is about the entity (or knowledge) represented by the search content, and therefore, the content of the high-choice search result can be compared with the information of each entity in the first database to see whether there is an entity in the first database that matches the high-choice search result.
For example, if only the entity "newton" is currently included in the first database, but no entity "newton" is currently included, then a web page for the entity "newton" will not find a matching entity in the first database when the web page is used as a high-choice search result.
S103, if the matched entity is not found, establishing a new entity according to the retrieval content and the corresponding high-selection retrieval result.
If no matching entity is found in the first database, it indicates that the entity represented by the high-selection search result is not currently included in the database, so that a new entity can be established according to the content of the high-selection search result and the search contents.
In some embodiments, establishing a new entity may be establishing a new entity in the first database.
Generally, in a newly created entity, the search content (or its core keyword) may be the name (including alias) or a part of the name of the entity, and other attributes of the entity can be extracted from the high-choice search result, such as:
generating relevant abstract information for the new entity according to the high-selection retrieval result by an abstract generation technology;
mining SPO (Object-predict-Object) triple information for the new entity according to a high-selection retrieval result by using an SPO (Subject-predict-Object) mining technology;
and mining concepts, upper concepts, belonging categories and the like for the new entities according to the high-selection retrieval result by using a concept mining technology.
For example, if the above high-selection search result for the entity of "mechanical unit newton" cannot find a matching entity, a new entity, that is, the entity of "mechanical unit newton" can be established according to the content of the corresponding web page and the search content (newton).
Of course, if the high-choice search result matches an entity in the first database (i.e., there is a matching entity), it indicates that the high-choice search result is about the already existing entity, and at this time, no entity may be established.
Of course, if necessary, when there is a matching entity, the information of the matching entity may also be updated according to the search content and the high-selection search result, such as adding a new alias, a new introduction, and other data thereto.
The method for discovering entities of the embodiment of the disclosure can discover newly-appeared entities by utilizing the self-performed retrieval and selection (i.e. retrieval and posterior information) of a large number of users. Therefore, the method can be automatically carried out without depending on manpower, so that the efficiency is high, the cost is low, and the accuracy is high; meanwhile, new entities often appear over time and are usually reflected in the retrieval records after the new entities appear, so that the method can monitor the new entities in real time and comprehensively by monitoring a large number of the retrieval records, and realizes active, rapid and high-coverage discovery of the new entities.
In some embodiments, the above high-choice search result is a transition-type high-choice search result; wherein,
in a first time period, the selection probability of a first retrieval result corresponding to the retrieval content is a first probability, and the selection probability of a transition type high selection retrieval result is a second probability;
in a second time period after the first time period, the selection probability of the first retrieval result is smaller than the first probability, the difference between the first probability and the first probability is larger than a first threshold value, the selection probability of the transition type high selection retrieval result is larger than a second probability, the difference between the transition type high selection retrieval result and the second probability is larger than a second threshold value, and the selection probability of the transition type high selection retrieval result is larger than the first threshold value.
With the development of the technology, most of the existing entities are already recorded in the database, so that many high-selection search results may correspond to the existing entities, and the processing of the high-selection search results is not meaningful. Thus, in order to improve the accuracy of the processing, the high-choice search result may be further limited to a transition-type high-choice search result (i.e., "semantic transition high click result").
According to the above conditions, in each search of an earlier (first time period) search content, there is a first entity that is frequently selected (e.g., the probability of selection of the first entity may be greater than a certain threshold), while the probability of selection of another search result is not high (in this case, the search result may not even exist, so the probability of selection is 0); in each later (second time period) search of the search content, the selection probability of the first entity which is selected frequently originally is obviously reduced, the selection probability of the search result which is selected with the low probability is obviously improved, and the search results reach the degree of checking the first threshold value, and then the search results are the transition type high selection search results. Thus, the following triples may be established:
{ search content (query), search result with decreased selection probability (first search result), high-selection search result with increased selection probability (transition type high-selection search result) }.
That is, if a large number of choices for the same search content "shift" from one search result (first search result) to another search result (shift-type high-choice search result), it indicates that the "direction of attention" of the search result of the search content by the person has changed, or that the "degree of heat" of attention of the respective semantic item corresponding to the keyword (or core keyword) of the search result has changed, that is, "semantic item shift" occurs.
Of course, the above "meaning item transfer" is "time-efficient", and the method according to the embodiment of the present disclosure can detect the "meaning item transfer" in time every time, so as to discover such new entity quickly.
Specifically, for example, for a certain name a, which is originally the name of a singer B, after searching with the name a as the search content, most users originally select the search result (first search result) related to the singer B; however, a new ball C with the same name as the singer B has recently appeared, so that many users select the search result about the ball C after the search is performed by taking the name a as the search content, so that the selection probability of the search result corresponding to the singer B is reduced, and the selection probability of the search result corresponding to the ball C is improved, thereby becoming a transition type high selection search result.
As can be seen from this, since the search result (transition type high-selection search result) as the transition target is "of new interest" to people, the entity corresponding to the transition type high-selection search result is not usually included in the database, and there is a high probability that the entity is a new entity. Therefore, by further limiting the "high-choice search result of the limited type" to the "transition type high-choice search result", the probability of finding a new entity in each processing procedure can be increased, the operation efficiency can be improved, and the required operation amount can be reduced.
In some embodiments, the first search result is a physical card.
The above first search result may be a web page specifically introducing basic related knowledge of a certain entity, such as a web page related to each entity in a knowledge encyclopedia website. When searching for each entity, people have a large number of first choices of entity cards to learn about the entity on a base; therefore, if the selected probability of the entity card is reduced, it usually means that the entity (entity card) to which the retrieval content originally corresponds is no longer the main concern of people, but a new entity appears (or "meaning item transfer"). Therefore, the probability of finding a new entity in each processing process can be further improved by adopting the entity card as the first retrieval result.
In some embodiments, referring to fig. 2, the searching for a matching entity matching the high-choice search result in the preset first database (step S102) includes:
and S1021, respectively calculating the matching degree of at least part of entities in the first database and the high-selection retrieval result.
S1022, the entity corresponding to the maximum matching degree of the matching degrees greater than the third threshold is used as the matching entity.
For searching the matching entity, the matching degrees (similarity) between the existing entities in the first database and the high-selection search result can be respectively calculated, and then a maximum matching degree is selected from all the matching degrees larger than the third threshold, and the entity corresponding to the matching degree is used as the matching entity (of course, if only one matching degree larger than the third threshold is used, the matching degree is directly selected). Of course, if there is no matching degree greater than the third threshold, it indicates that there is no matching entity currently.
In some embodiments, referring to fig. 3, before the respectively calculating the matching degrees of at least some of the entities in the first database with the high-choice search result (S1021), the method further includes:
s10210, screening out entities which are possible to be matched with the high-selection retrieval result from the first database as candidate entities through character matching.
Calculating the matching degree between at least part of the entities in the first database and the high-choice search result (S1021), respectively, includes:
s10211, respectively calculating the matching degree of each candidate entity and the high selection search result.
Obviously, most of the entities in the first database are almost irrelevant to the high-choice search result, and there is no possibility that the entities match the high-choice search result, so that it is not necessary to calculate the matching degree of the entities with the high-choice search result.
Therefore, before the matching degree is calculated, the entities which are possibly matched with the high-selection search result are screened from the first database through simple character matching to serve as candidate entities (for example, a candidate entity set is formed), and therefore only the matching degree of the candidate entities in the candidate entity set and the high-selection search result needs to be calculated subsequently, and the operation amount is greatly reduced.
Specifically, the above ways of determining candidate entities are various, and for example, they may include:
name matching, namely selecting an entity with the name of the retrieval content (or the core keyword thereof) as a candidate entity;
alias matching, namely excavating aliases of entities by using the result of intelligent retrieval (wise retrieval) and an alias mining technology, and selecting the entities with the aliases as retrieval contents (or core keywords thereof) as candidate entities;
error correction matching, namely, error correction is carried out on the retrieval content (or the core keyword thereof) by using an error correction technology, and then an entity taking the name (or the alias) as the error-corrected retrieval content (or the core keyword thereof) is taken as a candidate entity;
content matching, using stem analysis, phrase (term) importance analysis, component analysis, intention recognition and other techniques to select core keywords in the search content, and then using the entities with names (or alias) as core keywords as candidate entities.
In some embodiments, the calculating the degree of matching of at least some of the entities in the first database with the high-choice search result (S1021) respectively comprises; respectively calculating the matching degree of at least part of entities in the first database and the high-selection retrieval result by adopting a semantic matching neural network: the semantic matching neural network comprises:
a first input terminal for inputting a high-selection retrieval result;
the second input end is used for inputting information of a corresponding entity in the first database;
and the output end is used for outputting the matching degree of the entity and the high selection retrieval result.
That is, the pre-trained semantic matching neural network can be used to calculate the matching degree between each entity (e.g., candidate entity) and the high-choice search result.
The semantic matching neural network is a self-learning neural network and is used for judging the correlation degree (matching degree) of two pieces of information in a semantic level, and comprises two input ends and an output end, wherein the two input ends are respectively used for inputting a high-selection retrieval result (such as the content of a webpage of the high-selection retrieval result) and information of an entity (such as all related information including abstract information, SPO triple information, concepts, upper concepts, belonging categories and the like); and the output end is used for outputting the matching degree of the two pieces of information obtained by analysis.
Of course, the semantic matching neural network should also have one or more intermediate layers, and the specific situations of the intermediate layers can be set according to the requirements. For example, the middle layer may include an embedding layer, a bidirectional lstm layer, a plurality of hidden layers, an output layer, etc.; the output layer may be a dense (fully connected) layer using a sigmoid activation function, or may be a layer for calculating a manhattan distance, a euclidean distance, a cosine distance, or the like, and in short, it only needs to output a value representing a matching degree between two pieces of information.
The semantic matching neural network can be obtained through training, wherein the training specifically comprises the steps of enabling the semantic matching neural network to process a large number of positive sample pairs (namely matched entity information and retrieval results) and negative sample pairs (namely unmatched entity information and retrieval results), and adjusting parameters of the semantic matching neural network according to processing results.
For example, a positive and negative sample pair (pair) can be created by intelligent (wise) search, and the search results of the intelligent search are ranked by relevance (e.g., using the Learning to rank algorithm), so that the obtained text information of natural results is information representing entities concerned by most users. Further, the "natural result text information of smart search-information of corresponding entity" for a search content may be used as a positive sample pair, and the "natural result text information of smart search-information of similar (same name or similar name) entity" for a search content may be used as a negative sample pair for training. Therefore, the positive and negative sample pairs (materials) used for training can be directly obtained through the intelligent retrieval engine and the entity database without artificial establishment, and the method is convenient.
In some embodiments, after step S102:
if the matching entity is found, taking the entity card corresponding to the matching entity as a recommended retrieval result of the corresponding retrieval content;
if the matched entity is not found, after a new entity is established in the first database according to the retrieval content and the high-selection retrieval result corresponding to the retrieval content, the entity card corresponding to the new entity is used as the recommended retrieval result corresponding to the retrieval content.
That is to say, when a matching entity matching the high-selection search result is found, the entity card corresponding to the matching entity can be used as the recommended search result corresponding to the search content, so that when the user searches with the search content again, the entity card corresponding to the matching entity is preferentially given as the recommended search result, and it is ensured that the entity card corresponding to the entity most concerned at present in the search result always exists. For example, if there is no entity card in the original search result, the entity card matching the entity can be added into the search result; if the original search result has other entity cards, the original search result can be replaced by the entity card of the matched entity (such as the above transfer type high-selection search result), or the entity card of the matched entity is added; if the entity card is already in the original search result, the sorting of the entity card can be advanced, etc.
Similarly, if a matching entity matching the high-choice search result is not found, and thus a new entity is established, the entity card of the new entity can be used as the recommended search result. Of course, the specific recommendation method for recommending the search result at this time may be the same as or different from the recommendation method for the entity card of the previous candidate entity, and will not be described in detail here.
Fig. 4 is a block diagram of an apparatus for discovering an entity according to an embodiment of the disclosure.
In a second aspect, referring to fig. 4, an embodiment of the present disclosure provides an apparatus for discovering an entity, including:
the acquisition module is used for acquiring the retrieval content and the high selection retrieval result corresponding to the retrieval content, and the selection probability of the high selection retrieval result is greater than a first threshold value;
the matching module is used for searching a matching entity matched with the high-selection retrieval result in a preset first database;
and the establishing module is used for establishing a new entity according to the retrieval content and the corresponding high-selection retrieval result when the matching entity is not found.
In some embodiments, the high-choice search result is a transition-type high-choice search result;
in a first time period, the selection probability of a first retrieval result corresponding to the retrieval content is a first probability, and the selection probability of a transition type high selection retrieval result is a second probability;
in a second time period after the first time period, the selection probability of the first retrieval result is smaller than the first probability, the difference between the first probability and the first probability is larger than a first threshold value, the selection probability of the transition type high selection retrieval result is larger than a second probability, the difference between the transition type high selection retrieval result and the second probability is larger than a second threshold value, and the selection probability of the transition type high selection retrieval result is larger than the first threshold value.
In some embodiments, the first search result is a physical card.
In some embodiments, referring to fig. 5, the matching module comprises:
the matching degree calculation unit is used for calculating the matching degree of at least part of entities in the first database and the high-selection retrieval result respectively;
and the matching unit is used for taking the entity corresponding to the maximum matching degree in the matching degrees larger than the third threshold value as the matching entity.
In some embodiments, referring to fig. 5, the matching module further comprises:
the candidate entity screening unit is used for screening out entities which are possibly matched with the high-selection retrieval result from the first database through character matching to serve as candidate entities;
and,
and the matching degree calculation unit is used for calculating the matching degree of each candidate entity and the high selection retrieval result respectively.
In some embodiments, the matching degree calculating unit is configured to calculate the matching degree of at least some entities in the first database with the high-selection search result respectively by using a semantic matching neural network: the semantic matching neural network comprises:
a first input terminal for inputting a high-selection retrieval result;
the second input end is used for inputting information of a corresponding entity in the first database;
and the output end is used for outputting the matching degree of the entity and the high selection retrieval result.
In some embodiments, referring to fig. 5, the apparatus further comprises:
and the recommending module is used for taking the entity card corresponding to the matched entity as a recommended retrieval result of the corresponding retrieval content when the matched entity is found, and taking the entity card corresponding to the new entity as a recommended retrieval result of the corresponding retrieval content when the matched entity is not found.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including:
one or more processors;
a storage device having one or more programs stored thereon, which when executed by one or more processors, cause the one or more processors to implement any of the methods for discovering entities described above.
In a fourth aspect, the embodiments of the present disclosure provide a computer readable medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements any one of the methods for discovering an entity described above.
It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.
The present disclosure has disclosed example embodiments and, although specific terms are employed, they are used and should be interpreted in a generic and descriptive sense only and not for purposes of limitation. In some instances, features, characteristics and/or elements described in connection with a particular embodiment may be used alone or in combination with features, characteristics and/or elements described in connection with other embodiments, unless expressly stated otherwise, as would be apparent to one skilled in the art. Accordingly, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the disclosure as set forth in the appended claims.

Claims (12)

1. A method of discovering entities, comprising:
acquiring retrieval content and a high-selection retrieval result corresponding to the retrieval content, wherein the selection probability of the high-selection retrieval result is greater than a first threshold value;
searching a matching entity matched with the high-selection retrieval result in a preset first database;
if the matching entity is not found, establishing a new entity according to the retrieval content and the corresponding high-selection retrieval result;
wherein the high selection retrieval result is a transfer type high selection retrieval result; in a first time period, the selection probability of a first retrieval result corresponding to the retrieval content is a first probability, and the selection probability of the transition type high selection retrieval result is a second probability; in a second time period after the first time period, the selection probability of the first retrieval result is smaller than the first probability, the difference between the first probability and the first probability is larger than a first threshold value, the selection probability of the transition type high selection retrieval result is larger than the second probability, the difference between the transition type high selection retrieval result and the second probability is larger than a second threshold value, and the selection probability of the transition type high selection retrieval result is larger than the first threshold value; the first retrieval result is an entity card.
2. The method of claim 1, wherein the searching for the matching entity matching the high-choice search result in the preset first database comprises:
respectively calculating the matching degrees of at least part of entities in the first database and the high-selection retrieval result;
and the entity corresponding to the maximum matching degree in the matching degrees larger than the third threshold value is the matching entity.
3. The method of claim 2, wherein,
before the respectively calculating the matching degrees of at least part of the entities in the first database and the high-selection search result, the method further comprises the following steps: screening out entities which are possibly matched with the high selection retrieval result from the first database as candidate entities through character matching;
the respectively calculating the matching degrees of at least part of the entities in the first database and the high-selection search result comprises: and respectively calculating the matching degree of each candidate entity and the high selection retrieval result.
4. The method of claim 2, wherein said separately calculating a degree of matching of at least some of the entities in the first database with the highly selected search result comprises;
respectively calculating the matching degrees of at least part of entities in the first database and the high-selection retrieval result by adopting a semantic matching neural network: the semantic matching neural network comprises:
a first input terminal for inputting the high selection retrieval result;
a second input end for inputting information corresponding to the entity in the first database;
and the output end is used for outputting the matching degree of the entity and the high selection retrieval result.
5. The method of claim 1, wherein,
if the matching entity is found, taking an entity card corresponding to the matching entity as a recommended retrieval result corresponding to the retrieval content;
and if the matching entity is not found, after a new entity is established according to the retrieval content and the high-selection retrieval result corresponding to the retrieval content, taking an entity card corresponding to the new entity as a recommended retrieval result corresponding to the retrieval content.
6. An apparatus for discovering entities, comprising:
the retrieval system comprises an acquisition module, a retrieval module and a retrieval module, wherein the acquisition module is used for acquiring retrieval contents and high selection retrieval results corresponding to the retrieval contents, and the selection probability of the high selection retrieval results is greater than a first threshold;
the matching module is used for searching a matching entity matched with the high-selection retrieval result in a preset first database;
the establishment module is used for establishing a new entity according to the retrieval content and the corresponding high-selection retrieval result when the matching entity is not found;
wherein the high selection retrieval result is a transfer type high selection retrieval result; in a first time period, the selection probability of a first retrieval result corresponding to the retrieval content is a first probability, and the selection probability of the transition type high selection retrieval result is a second probability; in a second time period after the first time period, the selection probability of the first retrieval result is smaller than the first probability, the difference between the first probability and the first probability is larger than a first threshold value, the selection probability of the transition type high selection retrieval result is larger than the second probability, the difference between the transition type high selection retrieval result and the second probability is larger than a second threshold value, and the selection probability of the transition type high selection retrieval result is larger than the first threshold value; the first retrieval result is an entity card.
7. The apparatus of claim 6, wherein the matching module comprises:
the matching degree calculation unit is used for calculating the matching degree of at least part of entities in the first database and the high-selection retrieval result respectively;
a matching unit, configured to use the entity corresponding to the largest matching degree of the matching degrees greater than the third threshold as the matching entity.
8. The apparatus of claim 7, wherein the matching module further comprises:
the candidate entity screening unit is used for screening out entities which are possibly matched with the high-selection retrieval result from the first database through character matching to serve as candidate entities;
and,
the matching degree calculation unit is used for calculating the matching degree of each candidate entity and the high selection retrieval result respectively.
9. The apparatus of claim 7, wherein,
the matching degree calculation unit is used for respectively calculating the matching degree of at least part of entities in the first database and the high-selection retrieval result by adopting a semantic matching neural network: the semantic matching neural network comprises:
a first input terminal for inputting the high selection retrieval result;
a second input end for inputting information corresponding to the entity in the first database;
and the output end is used for outputting the matching degree of the entity and the high selection retrieval result.
10. The apparatus of claim 6, further comprising:
and the recommending module is used for taking the entity card corresponding to the matching entity as a recommended retrieval result corresponding to the retrieval content when the matching entity is found, and taking the entity card corresponding to the new entity as the recommended retrieval result corresponding to the retrieval content when the matching entity is not found.
11. An electronic device, comprising:
one or more processors;
storage means having one or more programs stored thereon which, when executed by the one or more processors, cause the one or more processors to implement the method of discovering entities according to any of claims 1 to 5.
12. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method of discovering entities according to any one of claims 1 to 5.
CN201910516155.3A 2019-06-14 2019-06-14 Method and device for discovering entity, electronic equipment and computer readable medium Active CN110222156B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910516155.3A CN110222156B (en) 2019-06-14 2019-06-14 Method and device for discovering entity, electronic equipment and computer readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910516155.3A CN110222156B (en) 2019-06-14 2019-06-14 Method and device for discovering entity, electronic equipment and computer readable medium

Publications (2)

Publication Number Publication Date
CN110222156A CN110222156A (en) 2019-09-10
CN110222156B true CN110222156B (en) 2021-11-16

Family

ID=67817319

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910516155.3A Active CN110222156B (en) 2019-06-14 2019-06-14 Method and device for discovering entity, electronic equipment and computer readable medium

Country Status (1)

Country Link
CN (1) CN110222156B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807102B (en) * 2021-08-20 2022-11-01 北京百度网讯科技有限公司 Method, device, equipment and computer storage medium for establishing semantic representation model

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9665643B2 (en) * 2011-12-30 2017-05-30 Microsoft Technology Licensing, Llc Knowledge-based entity detection and disambiguation
CN105468652A (en) * 2014-09-12 2016-04-06 北大方正集团有限公司 Retrieval sorting method and system
CN105095399B (en) * 2015-07-06 2019-06-28 百度在线网络技术(北京)有限公司 Search result method for pushing and device
CN106844603B (en) * 2017-01-16 2021-05-11 竹间智能科技(上海)有限公司 Entity popularity calculation method and device, and application method and device
CN107391673B (en) * 2017-07-21 2020-11-03 苏州狗尾草智能科技有限公司 Method and device for generating Chinese universal knowledge graph with timestamp
CN108415902B (en) * 2018-02-10 2021-10-26 合肥工业大学 Named entity linking method based on search engine
CN108920588B (en) * 2018-06-26 2021-02-26 北京光年无限科技有限公司 Knowledge graph updating method and system for man-machine interaction
CN109033464A (en) * 2018-08-31 2018-12-18 北京字节跳动网络技术有限公司 Method and apparatus for handling information

Also Published As

Publication number Publication date
CN110222156A (en) 2019-09-10

Similar Documents

Publication Publication Date Title
US11580104B2 (en) Method, apparatus, device, and storage medium for intention recommendation
US11645317B2 (en) Recommending topic clusters for unstructured text documents
US9864808B2 (en) Knowledge-based entity detection and disambiguation
US9449271B2 (en) Classifying resources using a deep network
US9646260B1 (en) Using existing relationships in a knowledge base to identify types of knowledge for addition to the knowledge base
CN109885773B (en) Personalized article recommendation method, system, medium and equipment
CN112148889A (en) Recommendation list generation method and device
US20170212899A1 (en) Method for searching related entities through entity co-occurrence
US20060184517A1 (en) Answers analytics: computing answers across discrete data
US20100191758A1 (en) System and method for improved search relevance using proximity boosting
CN112100396B (en) Data processing method and device
CN112115232A (en) Data error correction method and device and server
CN113515589B (en) Data recommendation method, device, equipment and medium
US10147095B2 (en) Chain understanding in search
US20190370402A1 (en) Profile spam removal in search results from social network
CN113254671B (en) Atlas optimization method, device, equipment and medium based on query analysis
CN110222156B (en) Method and device for discovering entity, electronic equipment and computer readable medium
CN113886535A (en) Knowledge graph-based question and answer method and device, storage medium and electronic equipment
US11726972B2 (en) Directed data indexing based on conceptual relevance
CN113704462A (en) Text processing method and device, computer equipment and storage medium
Cooper et al. Knowledge-based fast web query engine using NoSQL
CN114491294B (en) Data recommendation method and device based on graphic neural network and electronic equipment
Jeon et al. Random forest algorithm for linked data using a parallel processing environment
Albawi et al. Review Previous Solutions To Query Based Uncertain Object Determining
CN112925873A (en) Formalized expression method and device for text search requirement and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant