CN106294418A - Search method and searching system - Google Patents

Search method and searching system Download PDF

Info

Publication number
CN106294418A
CN106294418A CN201510272225.7A CN201510272225A CN106294418A CN 106294418 A CN106294418 A CN 106294418A CN 201510272225 A CN201510272225 A CN 201510272225A CN 106294418 A CN106294418 A CN 106294418A
Authority
CN
China
Prior art keywords
model
target entity
microblogging
entity
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510272225.7A
Other languages
Chinese (zh)
Other versions
CN106294418B (en
Inventor
强闰伟
范非凡
吕超
杨建武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Peking University
Beijing Founder Electronics Co Ltd
Original Assignee
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University
Priority to CN201510272225.7A priority Critical patent/CN106294418B/en
Publication of CN106294418A publication Critical patent/CN106294418A/en
Application granted granted Critical
Publication of CN106294418B publication Critical patent/CN106294418B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention proposes a kind of search method and a kind of searching system, wherein, described method includes: when receiving the query statement retrieving the microblogging document in microblogging language material set, creates original query model corresponding with described query statement according to described query statement;Identify the target entity in described query statement;According to target entity topic model corresponding with described target entity, described original query model and the microblogging document language model set up according to every microblogging document in described microblogging collection of document, described original query model is extended, with the interrogation model that is expanded;Add up the similarity between described expanding query model and described microblogging document language model, to determine the target retrieval result of described query statement according to described similarity.By technical scheme, make user to retrieve in microblogging document exactly and obtain target retrieval result, thus improve accuracy rate, the robustness of retrieval can also be strengthened simultaneously.

Description

Search method and searching system
Technical field
The present invention relates to retrieval technique field, in particular to a kind of search method and searching system.
Background technology
Microblogging is that a lightweight information based on customer relationship propagates platform, and user can broadcast and divide Enjoy the activity about him and status information.The popular of microblogging brings the need retrieving microblogging document Asking, user is the most gradually accustomed to carrying out microblogging document the search of various content.
Different from traditional Web retrieval, the retrieval of microblogging document is faced the biggest challenge, first, Due to the length limitation of microblogging document, microblogging retrieval is made to be faced with the vocabulary mismatch problem of sternness.This Outward, owing to same entity has different another name, therefore, same entity is being carried out by different user The another name corresponding with this entity may be used during retrieval to retrieve, such as entity " Zhou Jielun " Another name has " Zhou Dong, Jie Lun, human relations treasured " etc., is so obtained by another name retrieval in microblogging document Target retrieval result is also the most inaccurate, and effectiveness of retrieval is the highest, on the other hand, and microblogging document Also including a lot of entity in Ben Shen, the target retrieval result that retrieval the most all can be made to obtain is inaccurate.
Therefore, how to make user can retrieve target retrieval result exactly in microblogging document, become For problem demanding prompt solution.
Summary of the invention
The present invention is based on the problems referred to above, it is proposed that a kind of new technical scheme, can solve user The technical problem obtaining target retrieval result can not be retrieved exactly in microblogging document.
In view of this, an aspect of of the present present invention proposes a kind of search method, including: right receiving When microblogging document in microblogging language material set carries out the query statement retrieved, create according to described query statement Build original query model corresponding with described query statement;Identify the target entity in described query statement; According to target entity topic model corresponding with described target entity, described original query model and according to The microblogging document language model that every microblogging document in described microblogging collection of document is set up, to described former Beginning interrogation model is extended, with the interrogation model that is expanded;Add up described expanding query model and institute State the similarity between microblogging document language model, to determine described query statement according to described similarity Target retrieval result.
In this technical scheme, using query statement, the microblogging document in microblogging language material set is carried out During retrieval, owing to query statement including the another name of target entity, therefore, by identifying inquiry language Target entity in Ju can be effectively improved retrieval effectiveness, it addition, by corresponding to query statement Ground original query model is extended the interrogation model that is expanded, so according to expanding query model to micro- When blog article shelves are retrieved, the substantial amounts of microblogging document relevant to query statement can be retrieved, i.e. wrap Include the information that user is interested, such that it is able to efficiently avoid the missing inspection to microblogging document, and then Make microblogging document to carry out retrieval ground more comprehensively, and micro-by statistics expanding query model and every Similarity between blog article shelves corresponding microblogging document language model determines target retrieval result, thus Make target retrieval result more accurate, also improve the robustness of retrieval simultaneously.Therefore, by this skill Art scheme, user can retrieve exactly in microblogging document and obtain target retrieval result, thus improve The accuracy rate of retrieval, wherein, target entity is that the key word in query statement, such as query statement are Target entity in " Zhou Jielun New cinema " is " Zhou Jielun ".
In technique scheme, it is preferable that by below equation add up described expanding query model with Described similarity between described microblogging document language model, and by similarity more than or equal to presetting phase Like the target microblogging document spent as described target retrieval result:
Score ( Q , D ) = - KL ( θ ^ Q ′ | | θ ^ D ) ∝ Σ w ∈ V p ( w | θ ^ Q ′ ) × log p ( w | θ ^ D ) ;
Wherein, Score (Q, D) represents described similarity, and V represents in described microblogging document language model All entities,Represent described expanding query model,Represent described microblogging document language model,Represent the probability that described target entity is occupied in described expanding query model,Represent the probability that described target entity is occupied in described microblogging document language model.
In this technical scheme, substantial amounts of microblogging can be retrieved by the expanding query model after extension Document, but may comprise in this substantial amounts of microblogging document have many consumers the information less paid close attention to or These information does not arranges according to certain order of priority, and the information that i.e. user less pays close attention to may Before coming the information that user pays special attention to, therefore, by statistics expanding query model and microblogging literary composition Similarity between shelves language model, and determine target retrieval result according to the height of this similarity, can To filter out information the most inessential, that relatedness is less or user less pays close attention to, therefore, by this Technical scheme, can improve the matching accuracy rate of retrieval result, improves target retrieval result further Accuracy, wherein, above-mentioned formula be KL distance (Kullback-Leibler Divergence, also known as Relative entropy) calculating, wherein, all entities refer to every microblogging document in microblogging document language model In all of word, such as, certain microblogging document in microblogging document language model be " week Jie Lunxin Film is excellent ", then all entities in this microblogging document are " Zhou Jielun ", " newly ", " film " and " excellent ", in a word, entity is exactly to represent the word on our ordinary meaning, target Entity is exactly the key word such as " Zhou Jielun " that user wants inquiry.
In technique scheme, it is preferable that be calculated described expanding query mould according to following equation Type:
p ( w | θ ^ Q ′ ) = ( 1 - α ) × p ( w | θ ^ Q ) + α × p ( w | θ ^ E ) ;
Wherein,Represent described expanding query model,Represent described original query model,Table Show described target entity topic model,Represent that described target entity is at described expanding query The probability occupied in model,Represent that described target entity is in described original query model The probability occupied,Represent that described target entity is occupied in described target entity model Probability, described α represents initial interpolation parameter.
In this technical scheme, the retrieval results contrast corresponding due to original query model is few, even also Do not comprise user and need the information of retrieval, accordingly, it would be desirable to be extended being expanded to original query model Exhibition interrogation model, when so retrieving microblogging document according to expanding query model, can retrieve The substantial amounts of microblogging document relevant to query statement, i.e. includes the information that user is interested, thus can Efficiently avoid the missing inspection to microblogging document, and then microblogging document is carried out retrieval ground more comprehensively, Further increasing retrieval effectiveness.
In technique scheme, it is preferable that according to the more newer command received, according to below equation Update described α, to obtain α ':
α ′ = α × Σ w ∈ E IDF ( w ) Σ w 1 ∈ Q IDF ( w 1 )
Wherein, w represents described target entity, and E represents all entities in described target entity model, Q represents all entities in described query statement, w1Represent any entity in described query statement, IDF (w) represent the described target entity reverse document frequency in described microblogging language material set, IDF(w1) represent the described any entity reverse document frequency in described microblogging language material set.
In this technical scheme, due to the important journey of target entity same in different query statements Degree is different, and initial interpolation parameter α can to and target entity model corresponding with target entity There is certain relation, therefore, need when different query statements is retrieved initial interpolation is joined Number α are updated so that it becomes adaptive interpolation parameter, and determine extension according to the α ' after updating Interrogation model, so that expanding query model is more accurate, wherein, all entities refer to microblogging document The all of word in every microblogging document in language model, such as, in microblogging document language model Certain microblogging document is " Zhou Jielun New cinema is excellent ", then all entities in this microblogging document Being " Zhou Jielun ", " newly ", " film " and " excellent ", in a word, entity represents exactly Word on our ordinary meaning, target entity is exactly the key word such as " Zhou Jielun " that user wants inquiry.
In technique scheme, it is preferable that when described target entity is multiple, according to each institute State the target entity reverse document frequency in described microblogging language material set and each described target entity Described target entity topic model, determines final entity topic model, to use described final reality Body topic model, described original query model and create described expansion with described microblogging document language model Exhibition interrogation model.
In this technical scheme, when query statement has multiple target entity, according to each target The target entity topic model of entity and each target entity reverse literary composition in described microblogging language material set Shelves frequency determines final entity topic model, with the expansion obtained by final entity topic model Exhibition interrogation model is retrieved, thus the target retrieval result obtained is more accurate, i.e. target retrieval Result has the relevant microblog document of each target entity in multiple target entity, and then makes target examine Hitch fruit is the microblogging document that user wants to retrieve, and improves Consumer's Experience.
In technique scheme, it is preferable that according to the first establishment order received, by following Formula determines described final entity topic model:
p ( w | θ ^ E ′ ) = Σ i = 1 n IDF ( E i ) × p ( w | θ ^ E i ) Σ i = 1 n IDF ( E i )
Wherein,Represent described final entity topic model,Represent each described mesh The probability that mark entity is occupied in described final entity topic model, n represents described target entity Number,Represent the target entity topic model of each described target entity, IDF (Ei) represent every The individual described target entity reverse document frequency in described microblogging language material set,Represent every Individual described target entity is occupied in described target entity topic model corresponding with described target entity Probability, EiRepresent target entity described in the i-th in multiple described target entity.
In this technical scheme, when query statement has multiple target entity, permissible from formula Find out, according to each target entity corresponding target entity topic model and each target entity described Reverse document frequency in microblogging language material set is calculated final entity topic model, due to each The target entity reverse document frequency in described microblogging language material set represents that each target entity is at microblogging Significance level in language material set, therefore, is looked into by the extension obtained by final entity topic model Ask model to retrieve, make target retrieval result have real with each target in multiple target entities The microblogging document that body is related, and according to each target entity significance level in microblogging language material set Determine target retrieval result, so that target retrieval result is the information that user wants to retrieve, enter And improve retrieval effectiveness, wherein, reverse document frequency (Inverse Document Frequency, IDF) being the significance level for weighing target entity, the IDF for target entity can be by microblogging In language material set, the total quantity of microblogging document is divided by the quantity of the microblogging document comprising this target entity, then The business obtained is taken the logarithm and obtains, and the IDF of target entity can affect the ginseng of the initial difference after renewal Number.
In technique scheme, it is preferable that according to the second establishment order received, by following Process creates target entity topic model corresponding with described target entity: when described microblogging language material set When the language material collective database at place receives described target entity, according to described target entity from described Microblogging language material set is extracted the M bar microblogging document relevant to described target entity;According to described mesh Mark target domain belonging to entity, in the target domain knowledge being connected with described language material collective database Searching for the multiple key words relevant to described target domain in storehouse, wherein, multiple described key words include Described target entity;The virtual document corresponding with described target domain is generated according to multiple described key words; Domain language model is set up according to described virtual document, and according to every in described microblogging language material set All entities in microblogging document set up background language model;Use described domain language model, described Background language model and the initial solid model corresponding with described target entity travel through described M bar microblogging Document, and carry out n times interative computation, to obtain described target entity topic model, wherein, M >=1, N >=1, and M and N be positive integer.
In this technical scheme, by domain language model, background language model and and the target set up The initial solid model that entity is corresponding can control " background noise " and " field coherent noise ", only Change microblogging document, thus accurately determine the target entity topic model of target entity, thus by by mesh When the expanding query model that mark entity topic model extension obtains is retrieved, can retrieve substantial amounts of The microblogging document relevant to query statement, i.e. includes the information that user is interested, such that it is able to effectively Avoid the missing inspection to microblogging document, and then improve retrieval effectiveness, wherein, all entities refer to micro- The all of word in every microblogging document in blog article shelves language model, such as, microblogging document language mould Certain microblogging document in type is " Zhou Jielun New cinema is excellent ", then the institute in this microblogging document Having entity to be " Zhou Jielun ", " newly ", " film " and " excellent ", in a word, entity is just Being to represent the word on our ordinary meaning, target entity is exactly the key word such as " week that user wants inquiry Jie Lun ".
In technique scheme, it is preferable that also include: corresponding with described target domain generating After described virtual document, add up described target entity at the described virtual literary composition corresponding with described target domain Each described key word in the first occurrence number in Dang, and multiple described key word is at described mesh The second occurrence number in the described virtual document that mark field is corresponding;According to described first occurrence number and Described second occurrence number determines the field priori value of described target entity;According to described field priori value Update described domain language model.
In this technical scheme, by statistics target entity in the virtual document corresponding with target domain The first occurrence number and each key word in multiple key words at virtual document corresponding to target domain In the second occurrence number, determine the field priori value of target entity, thus according to field priori value pair Domain language model is updated, and then the domain language model obtained is more accurate, i.e. domain language Model relates to each field of target entity, and then improves retrieval effectiveness.
Another aspect of the present invention proposes a kind of searching system, including: the first model creating unit, When receiving the query statement that the microblogging document in microblogging language material set is retrieved, according to described Query statement creates original query model corresponding with described query statement;Entity recognition unit, identifies Target entity in described query statement;Model extension unit, according to corresponding with described target entity Target entity topic model, described original query model and according to every in described microblogging collection of document The microblogging document language model that microblogging document is set up, is extended described original query model, with To expanding query model;Retrieval result determines unit, adds up described expanding query model and described microblogging Similarity between document language model, to determine the target of described query statement according to described similarity Retrieval result.
In this technical scheme, using query statement, the microblogging document in microblogging language material set is carried out During retrieval, owing to query statement including the another name of target entity, therefore, by identifying inquiry language Target entity in Ju can be effectively improved retrieval effectiveness, it addition, by corresponding to query statement Ground original query model is extended the interrogation model that is expanded, so according to expanding query model to micro- When blog article shelves are retrieved, the substantial amounts of microblogging document relevant to query statement can be retrieved, i.e. wrap Include the information that user is interested, such that it is able to efficiently avoid the missing inspection to microblogging document, and then Make microblogging document to carry out retrieval ground more comprehensively, and micro-by statistics expanding query model and every Similarity between blog article shelves corresponding microblogging document language model determines target retrieval result, thus Make target retrieval result more accurate, also improve the robustness of retrieval simultaneously.Therefore, by this skill Art scheme, user can retrieve exactly in microblogging document and obtain target retrieval result, thus improve Accuracy rate, wherein, target entity is the target keyword that the user in query statement wants inquiry, example As query statement be the target entity in " Zhou Jielun New cinema " be " Zhou Jielun ", and " newly " and " film " is also other entities or refers to the word on our ordinary meaning.
In technique scheme, it is preferable that described retrieval result determines that unit includes: similarity is united Meter unit, by below equation add up described expanding query model and described microblogging document language model it Between described similarity, and using similarity more than or equal to preset similarity target microblogging document as Described target retrieval result:
Score ( Q , D ) = - KL ( θ ^ Q ′ | | θ ^ D ) ∝ Σ w ∈ V p ( w | θ ^ Q ′ ) × log p ( w | θ ^ D ) ;
Wherein, Score (Q, D) represents described similarity, and V represents in described microblogging document language model All entities,Represent described expanding query model,Represent described microblogging document language model,Represent the probability that described target entity is occupied in described expanding query model,Represent the probability that described target entity is occupied in described microblogging document language model.
In this technical scheme, substantial amounts of microblogging can be retrieved by the expanding query model after extension Document, but may comprise in this substantial amounts of microblogging document have many consumers the information less paid close attention to or These information does not arranges according to certain order of priority, and the information that i.e. user less pays close attention to may Before coming the information that user pays special attention to, therefore, by statistics expanding query model and microblogging literary composition Similarity between shelves language model, and determine target retrieval result according to the height of this similarity, can To filter out information the most inessential, that relatedness is less or user less pays close attention to, therefore, by this Technical scheme, can improve the matching accuracy rate of retrieval result, improves target retrieval result further Accuracy, wherein, above-mentioned formula be KL distance (Kullback-Leibler Divergence, also known as Relative entropy) calculating, wherein, all entities refer to every microblogging document in microblogging document language model In all of word, such as, certain microblogging document in microblogging document language model be " week Jie Lunxin Film is excellent ", then all entities in this microblogging document are " Zhou Jielun ", " newly ", " film " and " excellent ", in a word, entity is exactly to represent the word on our ordinary meaning, target Entity is exactly the key word such as " Zhou Jielun " that user wants inquiry.
In technique scheme, it is preferable that model extension unit specifically for: according to following equation It is calculated described expanding query model:
p ( w | θ ^ Q ′ ) = ( 1 - α ) × p ( w | θ ^ Q ) + α × p ( w | θ ^ E ) ;
Wherein,Represent described expanding query model,Represent described original query model,Table Show described target entity topic model,Represent that described target entity is at described expanding query The probability occupied in model,Represent that described target entity is in described original query model The probability occupied,Represent that described target entity is occupied in described target entity model Probability, described α represents initial interpolation parameter.
In this technical scheme, the retrieval results contrast corresponding due to original query model is few, even also Do not comprise user and need the information of retrieval, accordingly, it would be desirable to be extended being expanded to original query model Exhibition interrogation model, when so retrieving microblogging document according to expanding query model, can retrieve The substantial amounts of microblogging document relevant to query statement, i.e. includes the information that user is interested, thus can Efficiently avoid the missing inspection to microblogging document, and then microblogging document is carried out retrieval ground more comprehensively, Further increasing retrieval effectiveness.
In technique scheme, it is preferable that also include: parameter updating block, according to receive More newer command, updates described α according to below equation, to obtain α ':
α ′ = α × Σ w ∈ E IDF ( w ) Σ w 1 ∈ Q IDF ( w 1 )
Wherein, w represents described target entity, and E represents all entities in described target entity model, Q represents all entities in described query statement, w1Represent any entity in described query statement, IDF (w) represent the described target entity reverse document frequency in described microblogging language material set, IDF(w1) represent the described any entity reverse document frequency in described microblogging language material set.
In this technical scheme, due to the important journey of target entity same in different query statements Degree is different, and initial interpolation parameter α can to and target entity model corresponding with target entity There is certain relation, therefore, need when different query statements is retrieved initial interpolation is joined Number α are updated so that it becomes adaptive interpolation parameter, and determine extension according to the α ' after updating Interrogation model, so that expanding query model is more accurate, wherein, all entities refer to microblogging document The all of word in every microblogging document in language model, such as, in microblogging document language model Certain microblogging document is " Zhou Jielun New cinema is excellent ", then all entities in this microblogging document Being " Zhou Jielun ", " newly ", " film " and " excellent ", in a word, entity represents exactly Word on our ordinary meaning, target entity is exactly the key word such as " Zhou Jielun " that user wants inquiry.
In technique scheme, it is preferable that also include: described model extension unit is additionally operable to: when When described target entity is multiple, according to each described target entity in described microblogging language material set Reverse document frequency and the described target entity topic model of each described target entity, determine final Entity topic model, with use described final entity topic model, described original query model and with Described microblogging document language model creates described expanding query model.
In this technical scheme, when query statement has multiple target entity, according to each target The target entity topic model of entity and each target entity reverse literary composition in described microblogging language material set Shelves frequency determines final entity topic model, with the expansion obtained by final entity topic model Exhibition interrogation model is retrieved, thus the target retrieval result obtained is more accurate, i.e. target retrieval Result has the relevant microblog document of each target entity in multiple target entity, and then makes target examine Hitch fruit is the microblogging document that user wants to retrieve, and improves Consumer's Experience.
In technique scheme, it is preferable that also include: described model extension unit specifically for: According to the first establishment order received, determine described final entity topic model by below equation:
p ( w | θ ^ E ′ ) = Σ i = 1 n IDF ( E i ) × p ( w | θ ^ E i ) Σ i = 1 n IDF ( E i )
Wherein,Show described final entity topic model,Represent each described mesh The probability that mark entity is occupied in described final entity topic model, n represents described target entity Number,Represent the target entity topic model of each described target entity, IDF (Ei) represent every The individual described target entity reverse document frequency in described microblogging language material set,Represent every Individual described target entity is occupied in described target entity topic model corresponding with described target entity Probability, EiRepresent target entity described in the i-th in multiple described target entity.
In this technical scheme, when query statement has multiple target entity, permissible from formula Find out, according to each target entity corresponding target entity topic model and each target entity described Reverse document frequency in microblogging language material set is calculated final entity topic model, due to each The target entity reverse document frequency in described microblogging language material set represents that each target entity is at microblogging Significance level in language material set, therefore, is looked into by the extension obtained by final entity topic model Ask model to retrieve, make target retrieval result have real with each target in multiple target entities The microblogging document that body is related, and according to each target entity significance level in microblogging language material set Determine target retrieval result, so that target retrieval result is the information that user wants to retrieve, enter And improve retrieval effectiveness, wherein, reverse document frequency (Inverse Document Frequency, IDF) being the significance level for weighing target entity, the IDF for target entity can be by microblogging In language material set, the total quantity of microblogging document is divided by the quantity of the microblogging document comprising this target entity, then The business obtained is taken the logarithm and obtains, and the IDF of target entity can affect the ginseng of the initial difference after renewal Number.
In technique scheme, it is preferable that also include: the second model creating unit, for basis The the second establishment order received, creates target corresponding with described target entity by procedure below real Body topic model: when the language material collective database at described microblogging language material set place receives described target During entity, extract and described target entity phase from described microblogging language material set according to described target entity The M bar microblogging document closed, according to the target domain belonging to described target entity, with described language material The target domain knowledge base that collective database is connected is searched for the multiple passes relevant to described target domain Keyword, wherein, multiple described key words include described target entity, raw according to multiple described key words Become the virtual document corresponding with described target domain, set up domain language model according to described virtual document, And set up background language mould according to all entities in every microblogging document in described microblogging language material set Type, uses described domain language model, described background language model and corresponding with described target entity Initial solid model travels through described M bar microblogging document, and carries out n times interative computation, to obtain State target entity topic model, wherein, M >=1, N >=1, and M and N and be positive integer.
In this technical scheme, by domain language model, background language model and and the target set up The initial solid model that entity is corresponding can control " background noise " and " field coherent noise ", only Change microblogging document, thus accurately determine the target entity topic model of target entity, thus by by mesh When the expanding query model that mark entity topic model extension obtains is retrieved, can retrieve substantial amounts of The microblogging document relevant to query statement, i.e. includes the information that user is interested, such that it is able to effectively Avoid the missing inspection to microblogging document, and then improve retrieval effectiveness, wherein, all entities refer to micro- The all of word in every microblogging document in blog article shelves language model, such as, microblogging document language mould Certain microblogging document in type is " Zhou Jielun New cinema is excellent ", then the institute in this microblogging document Having entity to be " Zhou Jielun ", " newly ", " film " and " excellent ", in a word, entity is just Being to represent the word on our ordinary meaning, target entity is exactly the key word such as " week that user wants inquiry Jie Lun ".
In technique scheme, it is preferable that described second theme model creating unit also includes:: Number of times statistic unit, after generating the described virtual document corresponding with described target domain, statistics is described The target entity the first occurrence number in the described virtual document corresponding with described target domain, and Each described key word in multiple described key words is at described virtual document corresponding to described target domain In the second occurrence number;Priori value determines unit, according to described first occurrence number and described second Occurrence number determines the field priori value of described target entity;Domain model updating block, according to described Field priori value updates described domain language model.
In this technical scheme, by statistics target entity in the virtual document corresponding with target domain The first occurrence number and each key word in multiple key words at virtual document corresponding to target domain In the second occurrence number, determine the field priori value of target entity, thus according to field priori value pair Domain language model is updated, and then the domain language model obtained is more accurate, i.e. domain language Model relates to each field of target entity, and then improves retrieval effectiveness.
By technical scheme, make user to retrieve in microblogging document exactly and obtain mesh Mark retrieval result, thus improve recall precision and accuracy rate, the robust of retrieval can also be strengthened simultaneously Property.
Accompanying drawing explanation
Fig. 1 shows the schematic flow sheet of search method according to an embodiment of the invention;
Fig. 2 shows the schematic flow sheet of search method according to another embodiment of the invention;
Fig. 3 shows the flow process signal of preliminary acquisition microblogging document according to an embodiment of the invention Figure;
Fig. 4 shows the flow process determining target entity topic model according to an embodiment of the invention Schematic diagram;
Fig. 5 shows the principle signal of target entity topic model according to an embodiment of the invention Figure;
Fig. 6 show according to an embodiment of the invention determine expanding query model and target inspection The schematic flow sheet of hitch fruit;
Fig. 7 shows the structural representation of searching system according to an embodiment of the invention;
Fig. 8 shows the structural representation of searching system according to another embodiment of the invention.
Detailed description of the invention
In order to the above-mentioned purpose of the present invention, feature and advantage can be more clearly understood that, below in conjunction with attached The present invention is further described in detail by figure and detailed description of the invention.It should be noted that not In the case of conflict, the feature in embodiments herein and embodiment can be mutually combined.
Elaborate a lot of detail in the following description so that fully understanding the present invention, but, The present invention can implement to use other to be different from other modes described here, therefore, and the present invention Protection domain do not limited by following public specific embodiment.
Fig. 1 shows the schematic flow sheet of search method according to an embodiment of the invention.
As it is shown in figure 1, search method according to an embodiment of the invention, including: step 102, When receiving the query statement that the microblogging document in microblogging language material set is retrieved, according to described Query statement creates original query model corresponding with described query statement;Step 104, identifies described Target entity in query statement;Step 106, according to target entity corresponding with described target entity Topic model, described original query model and according to every microblogging document in described microblogging collection of document The microblogging document language model set up, is extended described original query model, looks into be expanded Ask model;Step 108, adds up between described expanding query model and described microblogging document language model Similarity, to determine the target retrieval result of described query statement according to described similarity.
In this technical scheme, using query statement, the microblogging document in microblogging language material set is carried out During retrieval, owing to query statement including the another name of target entity, therefore, by identifying inquiry language Target entity in Ju can be effectively improved retrieval effectiveness, it addition, by corresponding to query statement Ground original query model is extended the interrogation model that is expanded, so according to expanding query model to micro- When blog article shelves are retrieved, the substantial amounts of microblogging document relevant to query statement can be retrieved, i.e. wrap Include the information that user is interested, such that it is able to efficiently avoid the missing inspection to microblogging document, and then Make microblogging document to carry out retrieval ground more comprehensively, and micro-by statistics expanding query model and every Similarity between blog article shelves corresponding microblogging document language model determines target retrieval result, thus Make target retrieval result more accurate, also improve the robustness of retrieval simultaneously.Therefore, by this skill Art scheme, user can retrieve exactly in microblogging document and obtain target retrieval result, thus improve The accuracy rate of retrieval, wherein, target entity is that the key word in query statement, such as query statement are Target entity in " Zhou Jielun New cinema " is " Zhou Jielun ".
In technique scheme, it is preferable that by below equation add up described expanding query model with Described similarity between described microblogging document language model, and by similarity more than or equal to presetting phase Like the target microblogging document spent as described target retrieval result:
Score ( Q , D ) = - KL ( θ ^ Q ′ | | θ ^ D ) ∝ Σ w ∈ V p ( w | θ ^ Q ′ ) × log p ( w | θ ^ D ) ;
Wherein, Score (Q, D) represents described similarity, and V represents in described microblogging document language model All entities,Represent described expanding query model,Represent described microblogging document language model,Represent the probability that described target entity is occupied in described expanding query model,Represent the probability that described target entity is occupied in described microblogging document language model.
In this technical scheme, substantial amounts of microblogging can be retrieved by the expanding query model after extension Document, but may comprise in this substantial amounts of microblogging document have many consumers the information less paid close attention to or These information does not arranges according to certain order of priority, and the information that i.e. user less pays close attention to may Before coming the information that user pays special attention to, therefore, by statistics expanding query model and microblogging literary composition Similarity between shelves language model, and determine target retrieval result according to the height of this similarity, can To filter out information the most inessential, that relatedness is less or user less pays close attention to, therefore, by this Technical scheme, can improve the matching accuracy rate of retrieval result, improves target retrieval result further Accuracy, wherein, above-mentioned formula be KL distance (Kullback-Leibler Divergence, also known as Relative entropy) calculating, wherein, all entities refer to every microblogging document in microblogging document language model In all of word, such as, certain microblogging document in microblogging document language model be " week Jie Lunxin Film is excellent ", then all entities in this microblogging document are " Zhou Jielun ", " newly ", " film " and " excellent ", in a word, entity is exactly to represent the word on our ordinary meaning, target Entity is exactly the key word such as " Zhou Jielun " that user wants inquiry.
In technique scheme, it is preferable that be calculated described expanding query mould according to following equation Type:
p ( w | θ ^ Q ′ ) = ( 1 - α ) × p ( w | θ ^ Q ) + α × p ( w | θ ^ E ) ;
WhereinRepresent described expanding query model,Represent described original query model,Table Show described target entity topic model,Represent that described target entity is at described expanding query The probability occupied in model,Represent that described target entity is in described original query model The probability occupied,Represent that described target entity is occupied in described target entity model Probability, described α represents initial interpolation parameter.
In this technical scheme, the retrieval results contrast corresponding due to original query model is few, even also Do not comprise user and need the information of retrieval, accordingly, it would be desirable to be extended being expanded to original query model Exhibition interrogation model, when so retrieving microblogging document according to expanding query model, can retrieve The substantial amounts of microblogging document relevant to query statement, i.e. includes the information that user is interested, thus can Efficiently avoid the missing inspection to microblogging document, and then microblogging document is carried out retrieval ground more comprehensively, Further increasing retrieval effectiveness.
In technique scheme, it is preferable that according to the more newer command received, according to below equation Update described α, to obtain α ':
α ′ = α × Σ w ∈ E IDF ( w ) Σ w 1 ∈ Q IDF ( w 1 )
Wherein, w represents described target entity, and E represents all entities in described target entity model, Q represents all entities in described query statement, w1Represent any entity in described query statement, IDF (w) represent the described target entity reverse document frequency in described microblogging language material set, IDF(w1) represent the described any entity reverse document frequency in described microblogging language material set.
In this technical scheme, due to the important journey of target entity same in different query statements Degree is different, and initial interpolation parameter α can to and target entity model corresponding with target entity There is certain relation, therefore, need when different query statements is retrieved initial interpolation is joined Number α are updated so that it becomes adaptive interpolation parameter, and determine extension according to the α ' after updating Interrogation model, so that expanding query model is more accurate, wherein, all entities refer to microblogging document The all of word in every microblogging document in language model, such as, in microblogging document language model Certain microblogging document is " Zhou Jielun New cinema is excellent ", then all entities in this microblogging document Being " Zhou Jielun ", " newly ", " film " and " excellent ", in a word, entity represents exactly Word on our ordinary meaning, target entity is exactly the key word such as " Zhou Jielun " that user wants inquiry.
In technique scheme, it is preferable that when described target entity is multiple, according to each institute State the target entity reverse document frequency in described microblogging language material set and each described target entity Described target entity topic model, determines final entity topic model, to use described final reality Body topic model, described original query model and create described expansion with described microblogging document language model Exhibition interrogation model.
In this technical scheme, when query statement has multiple target entity, according to each target The target entity topic model of entity and each target entity reverse literary composition in described microblogging language material set Shelves frequency determines final entity topic model, with the expansion obtained by final entity topic model Exhibition interrogation model is retrieved, thus the target retrieval result obtained is more accurate, i.e. target retrieval Result has the relevant microblog document of each target entity in multiple target entity, and then makes target examine Hitch fruit is the microblogging document that user wants to retrieve, and improves Consumer's Experience.
In technique scheme, it is preferable that according to the first establishment order received, by following Formula determines described final entity topic model:
p ( w | θ ^ E ′ ) = Σ i = 1 n IDF ( E i ) × p ( w | θ ^ E i ) Σ i = 1 n IDF ( E i )
Wherein,Represent described final entity topic model,Represent each described mesh The probability that mark entity is occupied in described final entity topic model, n represents described target entity Number,Represent the target entity topic model of each described target entity, IDF (Ei) represent every The individual described target entity reverse document frequency in described microblogging language material set,Represent every Individual described target entity is occupied in described target entity topic model corresponding with described target entity Probability, EiRepresent target entity described in the i-th in multiple described target entity.
In this technical scheme, when query statement has multiple target entity, permissible from formula Find out, according to each target entity corresponding target entity topic model and each target entity described Reverse document frequency in microblogging language material set is calculated final entity topic model, due to each The target entity reverse document frequency in described microblogging language material set represents that each target entity is at microblogging Significance level in language material set, therefore, is looked into by the extension obtained by final entity topic model Ask model to retrieve, make target retrieval result have real with each target in multiple target entities The microblogging document that body is related, and according to each target entity significance level in microblogging language material set Determine target retrieval result, so that target retrieval result is the information that user wants to retrieve, enter And improve retrieval effectiveness, wherein, reverse document frequency (Inverse Document Frequency, IDF) being the significance level for weighing target entity, the IDF for target entity can be by microblogging In language material set, the total quantity of microblogging document is divided by the quantity of the microblogging document comprising this target entity, then The business obtained is taken the logarithm and obtains, and the IDF of target entity can affect the ginseng of the initial difference after renewal Number.
In technique scheme, it is preferable that according to the second establishment order received, by following Process creates target entity topic model corresponding with described target entity: when described microblogging language material set When the language material collective database at place receives described target entity, according to described target entity from described Microblogging language material set is extracted the M bar microblogging document relevant to described target entity;According to described mesh Mark target domain belonging to entity, in the target domain knowledge being connected with described language material collective database Searching for the multiple key words relevant to described target domain in storehouse, wherein, multiple described key words include Described target entity;The virtual document corresponding with described target domain is generated according to multiple described key words; Domain language model is set up according to described virtual document, and according to every in described microblogging language material set All entities in microblogging document set up background language model;Use described domain language model, described Background language model and the initial solid model corresponding with described target entity travel through described M bar microblogging Document, and carry out n times interative computation, to obtain described target entity topic model, wherein, M >=1, N >=1, and M and N be positive integer.
In this technical scheme, by domain language model, background language model and and the target set up The initial solid model that entity is corresponding can control " background noise " and " field coherent noise ", only Change microblogging document, thus accurately determine the target entity topic model of target entity, thus by by mesh When the expanding query model that mark entity topic model extension obtains is retrieved, can retrieve substantial amounts of The microblogging document relevant to query statement, i.e. includes the information that user is interested, such that it is able to effectively Avoid the missing inspection to microblogging document, and then improve retrieval effectiveness, wherein, all entities refer to micro- The all of word in every microblogging document in blog article shelves language model, such as, microblogging document language mould Certain microblogging document in type is " Zhou Jielun New cinema is excellent ", then the institute in this microblogging document Having entity to be " Zhou Jielun ", " newly ", " film " and " excellent ", in a word, entity is just Being to represent the word on our ordinary meaning, target entity is exactly the key word such as " week that user wants inquiry Jie Lun ".
In technique scheme, it is preferable that also include: corresponding with described target domain generating After described virtual document, add up described target entity at the described virtual literary composition corresponding with described target domain Each described key word in the first occurrence number in Dang, and multiple described key word is at described mesh The second occurrence number in the described virtual document that mark field is corresponding;According to described first occurrence number and Described second occurrence number determines the field priori value of described target entity;According to described field priori value Update described domain language model.
In this technical scheme, by statistics target entity in the virtual document corresponding with target domain The first occurrence number and each key word in multiple key words at virtual document corresponding to target domain In the second occurrence number, determine the field priori value of target entity, thus according to field priori value pair Domain language model is updated, and then the domain language model obtained is more accurate, i.e. domain language Model relates to each field of target entity, and then improves retrieval effectiveness.
Fig. 2 shows the schematic flow sheet of search method according to another embodiment of the invention.
As in figure 2 it is shown, search method according to another embodiment of the invention, including:
Step 202, obtains all microblogging documents in microblogging stream.
Step 204, sets up microblogging document language model according to every microblogging document, and enters step 218。
Step 206, obtains microblogging language material set in microblogging stream, and wherein, microblogging language material set includes Microblogging document.
Step 208, identifies all entities in microblogging document, such as, utilizes Entity recognition instrument TwitterNLP identifies all entities, and the entity setting up each entity in all entities indexes, its In, the list of the corresponding microblogging document sorted sequentially in time of each entity.
Step 210, identifies the target entity in query statement.
Step 212, estimates the target entity topic model of target entity, enters step 216.
Step 214, is receiving the inquiry language retrieving the microblogging document in microblogging language material set During sentence, create original query corresponding with query statement by maximal possibility estimation and according to query statement Model.
Step 216, according to target entity topic model and original query model (according to target entity Corresponding target entity topic model, original query model and micro-according to every in microblogging collection of document The microblogging document language model that blog article shelves are set up), original query model is extended, to be expanded Exhibition interrogation model.
Step 218, sets up according to every microblogging document in expanding query model and microblogging collection of document Microblogging document language model, carry out KL distance and calculate (statistics expanding query model and microblogging document Similarity between language model).
Step 220, determines the target retrieval result of query statement according to similarity.
Fig. 3 shows the flow process signal of preliminary acquisition microblogging document according to an embodiment of the invention Figure.
As it is shown on figure 3, preliminary acquisition microblogging document according to an embodiment of the invention, including:
Step 302, identifies all entities in microblogging language material set.
Step 304, the entity setting up each entity in all entities indexes, wherein, each entity The list of a corresponding microblogging document sorted sequentially in time.
Step 306, searches out the M relevant to this target entity according to target entity in entity indexes Bar microblogging document, the microblogging document of up-to-date issue during wherein this M bar microblogging document is entity index.
Fig. 4 shows the flow process determining target entity topic model according to an embodiment of the invention Schematic diagram;Fig. 5 shows the principle of target entity topic model according to an embodiment of the invention Schematic diagram.
Technical scheme is described in detail below in conjunction with Fig. 4 and Fig. 5:
As shown in Figure 4, according to an embodiment of the invention determine target entity topic model, bag Include:
Step 402, identifies the target entity in query statement.
Step 404, according to the target domain belonging to target entity, is being connected with language material collective database The target domain knowledge base connect is searched for the multiple key words relevant to target domain, wherein, Duo Geguan Keyword includes target entity.
Step 406, generates the virtual document corresponding with target domain according to multiple key words, and according to void Intend document and set up domain language model, and according in every microblogging document in microblogging language material set All entities are set up background language model and set up the initial solid model corresponding with target entity, thus Mixed model is set up, such as Fig. 5 institute by domain language model, background language model and initial solid model Show, and set up process by mixed model, derive the target entity model of target entity, wherein, λ shown in Fig. 5CAnd λEIt is parameter preset, γ1And γkRepresent the power of the 1st domain language model Weight values and the weighted value of kth domain language model, EF represents the M bar microblogging document in Fig. 3,Represent initial solid model,Represent background language model andRepresent k domain language Model.
Step 408 (is equal to step 306), according to target entity entity index in search out with The M bar microblogging document that this target entity is relevant (extracts from microblogging language material set according to target entity The M bar microblogging document relevant to target entity).
Step 410, carries out model parameter iterative computation by EM algorithm traversal M bar microblogging document, Wherein, EM algorithmic notation expectation-maximization algorithm (Expectation Maximization Algorithm, Also known as EM algorithm).
Step 412, is iterated calculating to mixed model according to the model parameter after iterative computation, with Obtaining target entity topic model, wherein, iterations is preset times n times, when carrying out for the first time During iteration, the initial solid model corresponding with target entity may be approximately equal to background language model, M >=1, N >=1, and M and N are positive integer.
Fig. 6 show according to an embodiment of the invention determine expanding query model and target inspection The schematic flow sheet of hitch fruit.
As shown in Figure 6, according to an embodiment of the invention expanding query model and target are determined Retrieval result, including:
Step 602, identifies the target entity in query statement.
Step 604, sets up the target entity topic model corresponding with target entity, enters step 610.
Step 606, calculates initial interpolation parameter α, to obtain α ', enters step 610.
Step 608, creates and query statement correspondingly original query model according to query statement, enters Step 610.
Step 610, enters target entity topic model, initial interpolation parameter α ' and original query model Line linearity superposition, determines expanding query model.
Step 612, obtains microblogging document in microblogging stream.
Step 614, sets up microblogging document language mould according to every microblogging document in microblogging collection of document Type.
Step 616, carries out KL distance and calculates expanding query model and microblogging document language model (similarity between statistics expanding query model and microblogging document language model).
Step 618, using similarity more than or equal to presetting the target microblogging document of similarity as target Retrieval result.
Fig. 7 shows the structural representation of searching system according to an embodiment of the invention.
As it is shown in fig. 7, searching system 700 according to an embodiment of the invention, including: first Model creating unit 702, Entity recognition unit 704, model extension unit 706 and retrieval result are true Cell 708, wherein, described first model creating unit 702 is for receiving microblogging language material When microblogging document in set carries out the query statement retrieved, create with described according to described query statement Query statement corresponding original query model;Entity recognition unit 704, identifies in described query statement Target entity;Model extension unit 706, according to target entity master corresponding with described target entity Inscribe model, described original query model and build according to every microblogging document in described microblogging collection of document Vertical microblogging document language model, is extended described original query model, with the inquiry that is expanded Model;Retrieval result determines unit 708, adds up described expanding query model and described microblogging document language Similarity between speech model, to determine that according to described similarity the target retrieval of described query statement is tied Really.
In this technical scheme, using query statement, the microblogging document in microblogging language material set is carried out During retrieval, owing to query statement including the another name of target entity, therefore, by identifying inquiry language Target entity in Ju can be effectively improved retrieval effectiveness, it addition, by corresponding to query statement Ground original query model is extended the interrogation model that is expanded, so according to expanding query model to micro- When blog article shelves are retrieved, the substantial amounts of microblogging document relevant to query statement can be retrieved, i.e. wrap Include the information that user is interested, such that it is able to efficiently avoid the missing inspection to microblogging document, and then Make microblogging document to carry out retrieval ground more comprehensively, and micro-by statistics expanding query model and every Similarity between blog article shelves corresponding microblogging document language model determines target retrieval result, thus Make target retrieval result more accurate, also improve the robustness of retrieval simultaneously.Therefore, by this skill Art scheme, user can retrieve exactly in microblogging document and obtain target retrieval result, thus improve Accuracy rate, wherein, target entity is the target keyword that the user in query statement wants inquiry, example As query statement be the target entity in " Zhou Jielun New cinema " be " Zhou Jielun ", and " newly " and " film " is also other entities or refers to the word on our ordinary meaning.In technique scheme, Preferably, described retrieval result determines that unit 708 includes: similarity statistic unit 7082, passes through It is described similar that below equation is added up between described expanding query model and described microblogging document language model Degree, and using similarity more than or equal to presetting the target microblogging document of similarity as described target retrieval Result:
Score ( Q , D ) = - KL ( θ ^ Q ′ | | θ ^ D ) ∝ Σ w ∈ V p ( w | θ ^ Q ′ ) × log p ( w | θ ^ D ) ;
Wherein, Score (Q, D) represents described similarity, and V represents in described microblogging document language model All entities,Represent described expanding query model,Represent described microblogging document language model,Represent the probability that described target entity is occupied in described expanding query model,Represent the probability that described target entity is occupied in described microblogging document language model.
In this technical scheme, substantial amounts of microblogging can be retrieved by the expanding query model after extension Document, but may comprise in this substantial amounts of microblogging document have many consumers the information less paid close attention to or These information does not arranges according to certain order of priority, and the information that i.e. user less pays close attention to may Before coming the information that user pays special attention to, therefore, by statistics expanding query model and microblogging literary composition Similarity between shelves language model, and determine target retrieval result according to the height of this similarity, can To filter out information the most inessential, that relatedness is less or user less pays close attention to, therefore, by this Technical scheme, can improve the matching accuracy rate of retrieval result, improves target retrieval result further Accuracy, wherein, above-mentioned formula be KL distance (Kullback-Leibler Divergence, also known as Relative entropy) calculating, wherein, all entities refer to every microblogging document in microblogging document language model In all of word, such as, certain microblogging document in microblogging document language model be " week Jie Lunxin Film is excellent ", then all entities in this microblogging document are " Zhou Jielun ", " newly ", " film " and " excellent ", in a word, entity is exactly to represent the word on our ordinary meaning, target Entity is exactly the key word such as " Zhou Jielun " that user wants inquiry.
In technique scheme, it is preferable that model extension unit 706 specifically for: according to following Formula is calculated described expanding query model:
p ( w | θ ^ Q ′ ) = ( 1 - α ) × p ( w | θ ^ Q ) + α × p ( w | θ ^ E ) ;
Wherein,Represent described expanding query model,Represent described original query model,Table Show described target entity topic model,Represent that described target entity is at described expanding query The probability occupied in model,Represent that described target entity is in described original query model The probability occupied,Represent that described target entity is occupied in described target entity model Probability, described α represents initial interpolation parameter.
In this technical scheme, the retrieval results contrast corresponding due to original query model is few, even also Do not comprise user and need the information of retrieval, accordingly, it would be desirable to be extended being expanded to original query model Exhibition interrogation model, when so retrieving microblogging document according to expanding query model, can retrieve The substantial amounts of microblogging document relevant to query statement, i.e. includes the information that user is interested, thus can Efficiently avoid the missing inspection to microblogging document, and then microblogging document is carried out retrieval ground more comprehensively, Further increasing retrieval effectiveness.
In technique scheme, it is preferable that also include: parameter updating block 710, according to reception The more newer command arrived, updates described α according to below equation, to obtain α ':
α ′ = α × Σ w ∈ E IDF ( w ) Σ w 1 ∈ Q IDF ( w 1 )
Wherein, w represents described target entity, and E represents all entities in described target entity model, Q represents all entities in described query statement, w1Represent any entity in described query statement, IDF (w) represent the described target entity reverse document frequency in described microblogging language material set, IDF(w1) represent the described any entity reverse document frequency in described microblogging language material set.
In this technical scheme, due to the important journey of target entity same in different query statements Degree is different, and initial interpolation parameter α can to and target entity model corresponding with target entity There is certain relation, therefore, need when different query statements is retrieved initial interpolation is joined Number α are updated so that it becomes adaptive interpolation parameter, and determine extension according to the α ' after updating Interrogation model, so that expanding query model is more accurate, wherein, all entities refer to microblogging document The all of word in every microblogging document in language model, such as, in microblogging document language model Certain microblogging document is " Zhou Jielun New cinema is excellent ", then all entities in this microblogging document Being " Zhou Jielun ", " newly ", " film " and " excellent ", in a word, entity represents exactly Word on our ordinary meaning, target entity is exactly the key word such as " Zhou Jielun " that user wants inquiry.
In technique scheme, it is preferable that also include: described model extension unit 706 is additionally operable to: When described target entity is multiple, according to each described target entity in described microblogging language material set Reverse document frequency and the described target entity topic model of each described target entity, determine final Entity topic model, with use described final entity topic model, described original query model and Described expanding query model is created with described microblogging document language model.
In this technical scheme, when query statement has multiple target entity, according to each target The target entity topic model of entity and each target entity reverse literary composition in described microblogging language material set Shelves frequency determines final entity topic model, with the expansion obtained by final entity topic model Exhibition interrogation model is retrieved, thus the target retrieval result obtained is more accurate, i.e. target retrieval Result has the relevant microblog document of each target entity in multiple target entity, and then makes target examine Hitch fruit is the microblogging document that user wants to retrieve, and improves Consumer's Experience.
In technique scheme, it is preferable that also include: described model extension unit 706 is specifically used In: according to the first establishment order received, determine described final entity theme by below equation Model:
p ( w | θ ^ E ′ ) = Σ i = 1 n IDF ( E i ) × p ( w | θ ^ E i ) Σ i = 1 n IDF ( E i )
Wherein,Represent described final entity topic model,Represent each described mesh The probability that mark entity is occupied in described final entity topic model, n represents described target entity Number,Represent the target entity topic model of each described target entity, IDF (Ei) represent every The individual described target entity reverse document frequency in described microblogging language material set,Represent every Individual described target entity is occupied in described target entity topic model corresponding with described target entity Probability, EiRepresent target entity described in the i-th in multiple described target entity.
In this technical scheme, when query statement has multiple target entity, permissible from formula Find out, according to each target entity corresponding target entity topic model and each target entity described Reverse document frequency in microblogging language material set is calculated final entity topic model, due to each The target entity reverse document frequency in described microblogging language material set represents that each target entity is at microblogging Significance level in language material set, therefore, is looked into by the extension obtained by final entity topic model Ask model to retrieve, make target retrieval result have real with each target in multiple target entities The microblogging document that body is related, and according to each target entity significance level in microblogging language material set Determine target retrieval result, so that target retrieval result is the information that user wants to retrieve, enter And improve retrieval effectiveness, wherein, reverse document frequency (Inverse Document Frequency, IDF) being the significance level for weighing target entity, the IDF for target entity can be by microblogging In language material set, the total quantity of microblogging document is divided by the quantity of the microblogging document comprising this target entity, then The business obtained is taken the logarithm and obtains, and the IDF of target entity can affect the ginseng of the initial difference after renewal Number.
In technique scheme, it is preferable that also include: the second model creating unit 712, it is used for According to the second establishment order received, create mesh corresponding with described target entity by procedure below Mark entity topic model: when the language material collective database at described microblogging language material set place receives described During target entity, extract real with described target from described microblogging language material set according to described target entity The M bar microblogging document that body is relevant, according to the target domain belonging to described target entity, with described The target domain knowledge base that language material collective database is connected is searched for relevant to described target domain many Individual key word, wherein, multiple described key words include described target entity, according to multiple described keys Word generates the virtual document corresponding with described target domain, sets up domain language according to described virtual document Model, and set up background according to all entities in every microblogging document in described microblogging language material set Language model, use described domain language model, described background language model and with described target entity Corresponding initial solid model travels through described M bar microblogging document, and carries out n times interative computation, with Obtain described target entity topic model, wherein, M >=1, N >=1, and M and N and be positive integer.
In this technical scheme, by domain language model, background language model and and the target set up The initial solid model that entity is corresponding can control " background noise " and " field coherent noise ", only Change microblogging document, thus accurately determine the target entity topic model of target entity, thus by by mesh When the expanding query model that mark entity topic model extension obtains is retrieved, can retrieve substantial amounts of The microblogging document relevant to query statement, i.e. includes the information that user is interested, such that it is able to effectively Avoid the missing inspection to microblogging document, and then improve retrieval effectiveness, wherein, all entities refer to micro- The all of word in every microblogging document in blog article shelves language model, such as, microblogging document language mould Certain microblogging document in type is " Zhou Jielun New cinema is excellent ", then the institute in this microblogging document Having entity to be " Zhou Jielun ", " newly ", " film " and " excellent ", in a word, entity is just Being to represent the word on our ordinary meaning, target entity is exactly the key word such as " week that user wants inquiry Jie Lun ".
In technique scheme, it is preferable that described second theme model creating unit also includes:: Number of times statistic unit 7122, after generating the described virtual document corresponding with described target domain, system Count the described target entity the first occurrence number in the described virtual document corresponding with described target domain, And each described key word in multiple described key word is corresponding described virtual of described target domain The second occurrence number in document;Priori value determines unit 7124, according to described first occurrence number With the field priori value that described second occurrence number determines described target entity;Domain model updating block 7126, update described domain language model according to described field priori value.
In this technical scheme, by statistics target entity in the virtual document corresponding with target domain The first occurrence number and each key word in multiple key words at virtual document corresponding to target domain In the second occurrence number, determine the field priori value of target entity, thus according to field priori value pair Domain language model is updated, and then the domain language model obtained is more accurate, i.e. domain language Model relates to each field of target entity, and then improves retrieval effectiveness.
Fig. 8 shows the structural representation of searching system according to another embodiment of the invention.
As shown in Figure 8, searching system 800 according to another embodiment of the invention (is equivalent to figure The searching system 700 of the embodiment shown in 7), including: entity microblogging set acquisition module 802, For collecting the microblogging document relevant to target entity;Entity topic model estimation module 804 is (quite The second model creating unit 712 in the embodiment shown in Fig. 7), it is used for carrying out target entity theme The estimation of model;Adaptive Query Processing expansion module 806 (is equivalent to the model of the embodiment shown in Fig. 7 Expanding element 706), for target entity topic model is incorporated in microblogging document language model.
These modules the following detailed description of searching system 800:
1. entity microblogging set acquisition module 802 specifically for: the target entity in query statement is entered Row identifies, the foundation of entity index, and chooses the microblogging document relevant to target entity.
2. entity topic model estimation module 804 includes: knowledge base link module 8042, priori value meter Calculate module 8044 (being equivalent to the priori value computing unit 7124 of the embodiment shown in Fig. 7) and generate Formula model construction module 8046, knowledge base link module 8042 is for being linked to target entity Freebase knowledge base, and obtain this target entity target domain belonging in Freebase knowledge base (field in Freebase can regard the different spaces of a whole page of popular newspaper as: such as business, life style, Art, amusement, politics, economic dispatch);Priori value computing module 8044 is led with target for obtaining Multiple key words that territory is relevant, wherein, multiple described key words include described target entity, according to many Individual key word generates the virtual document corresponding with target domain, carries out maximum likelihood on this virtual document Estimate to generate field priori value;Production model construction module 8046 is used for building and target entity Corresponding initial solid model, background language model and domain language model, and utilize EM algorithm to exist Microblogging document is iterated calculate, to obtain target entity topic model.
3. Adaptive Query Processing expansion module 806, for being modeled obtaining original query to query statement Model, and be modeled obtaining microblogging document language to every microblogging document in microblogging collection of document Model, is extended original query model by target entity topic model, with the inquiry that is expanded Model, expanding query model and microblogging document language model are carried out KL distance calculate, by according in terms of Calculate result and obtain target retrieval result.Technical scheme will be explained in further detail below:
One, entity is identified.
1. utilize Entity recognition instrument TwitterNLP to identify all entities in microblogging document.
2. set up entity index, corresponding for each entity in all entities one according to time sequence The list of microblogging document.
3. identify the target entity in query statement, and in entity indexes, obtain the M bar of up-to-date issue Comprise the microblogging document of this target entity.
Two, target entity topic model is set up.
1. target entity is linked to Freebase knowledge base (target domain knowledge base), reads target Entity entity information in Freebase knowledge base, to obtain the target domain belonging to target entity (such as music field, world of art, books field).Particularly, if target entity does not links To entity information, then it is assumed that this target entity belongs to any one field.
2. calculating field priori value, in indexing according to entity, all entity trial Freebase search connects Mouth is linked to Freebase knowledge base, and the attribute under different field and type word are constituted a virtual literary composition Shelves are (relevant with target domain to search in the target domain knowledge base that language material collective database is connected Multiple key words, wherein, multiple key words include target entity, and generate according to multiple key words The virtual document corresponding with target domain), this virtual document use following equation carry out the most seemingly So estimate to generate field priori value:
p ( w | d ) = c ( w , d ) Σ n c ( w 2 , d )
Wherein, w represents that target entity, d represent the target domain belonging to target entity, w2Expression is many Each key word in individual key word, (w d) represents that w is in the virtual document that target domain d is corresponding to c The first occurrence number, c (w2, d) represent that each key word in multiple key word is at target domain pair The second occurrence number in the virtual document answered, n represents the total quantity of key word.
3. set up target entity topic model, set up domain language model according to virtual document, and according to The all entities in every microblogging document in microblogging language material set set up background language model, Yi Jijian The vertical initial solid model corresponding with target entity, wherein, initial solid model can be similar to background Language model, is formed mixed model by domain language model, background language model and initial solid model.
4. utilize EM algorithm to carry out model estimation.According to mixed model as shown in Figure 5, Wo Menke It is expressed as with the log-likelihood function by the M bar microblogging set EF of return:
log p ( EF | θ ^ ) = Σ i Σ w c ( w , D i ) × log { λ E [ ( 1 - λ C ) × p ( w , θ ^ E ) + λ C × p ( w | θ ^ C ) ] + ( 1 - λ E ) × Σ d = 1 k γ d p ( w , θ ^ d ) }
Wherein, EF represents the M bar microblogging document searched out above, and i is used for traveling through microblogging corpus All microblogging documents in conjunction, w represents all realities in every microblogging document in microblogging language material set Each entity in body, DiRepresenting i-th microblogging document in microblogging language material set, k represents target The quantity of the target domain belonging to entity,Represent what w was occupied in target entity model Probability,Represent the frequency that word w is occupied in background language model,Represent word The frequency that w is occupied in domain language model, c (w, Di) it is that word w is at DiThe number of times of middle appearance, λCRepresent the first parameter preset, λERepresent the second parameter preset, λCAnd λEIt is respectively used to control background make an uproar Sound and field coherent noise, γdRepresent the weighted value of target domain language model.
Use EM algorithm i.e.Mixed model is carried out maximal possibility estimation, at microblogging Iteration undated parameter on language material set EF, thus obtain below equation:
t d ( n ) ( w ) = ( 1 - λ E ) × γ d ( n ) × p ( n ) ( w | θ ^ d ) λ E × [ ( 1 - λ c ) × p ( n ) ( w | θ ^ E ) + λ × p ( w | θ ^ C ) ] + ( 1 - λ E ) × Σ d ′ = 1 k γ d ′ ( n ) × p ( n ) ( w | θ ^ d ′ )
s ( n ) ( w ) = λ E × [ ( 1 - λ c ) × p ( n ) ( w | θ ^ E ) + λ × p ( w | θ ^ C ) ] λ E × [ ( 1 - λ c ) × p ( n ) ( w | θ ^ E ) + λ × p ( w | θ ^ C ) ] + ( 1 - λ E ) × Σ d ′ = 1 k γ d ′ ( n ) × p ( n ) ( w | θ ^ d ′ )
r ( n ) ( w ) = ( 1 - λ c ) × p ( n ) ( w | θ ^ E ) ( 1 - λ c ) × p ( n ) ( w | θ ^ E ) + λ × p ( w | θ ^ C )
p ( n + 1 ) ( w | θ ^ d ) = Σ i c ( w , D i ) × t d ( n ) ( w ) Σ w ′ Σ i Σ d ′ = 1 k c ( w ′ , D i ) × t d ′ ( n ) ( w ′ )
p ( n + 1 ) ( w | θ ^ E ) = Σ i c ( w , D i ) × r ( n ) ( w ) × s ( n ) ( w ) Σ w ′ Σ i c ( w ′ , D i ) × r ( n ) ( w ′ ) × s ( n ) ( w ′ )
γ d ( n + 1 ) = Σ w Σ i c ( w , D i ) × t d ( n ) ( w ) Σ w Σ i Σ d ′ = 1 k c ( w , D i ) × t d ′ ( n ) ( w )
Wherein, n represents the number of times of current iteration, and w represents target entity, and w ' represents microblogging corpus Each entity in all entities closed, d ' represents each field in all spectra, s(n)(w), r(n)W () is the intermediate variable in order to represent convenience of calculation,Represent that w exists The probability in domain language model during (n+1) wheel iteration,Represent that w is the (n+1) probability in entity topic model during wheel iteration,When representing (n+1) wheel iteration The weighted value of domain language model, in summation subscript, w/w ' is used for traveling through in microblogging language material set All entities, i for travel through feedback microblogging set in all microblogging documents, d/d ' is used for traveling through institute Having field, k to represent the quantity of the target domain belonging to target entity E, λ represents default iterative parameter.
It addition, updateDuring can use the field priori value of target entity p(w|d).At each gram language model p (w | d) one conjugate prior of upper definition, (i.e. Di Li Cray is first Test), then, use maximum a posteriori probability (Maximum A Posteriori, MAP) to estimate All of parameter, it is only necessary to do the least change on the more new formula of domain language model, by under Row formula carries out MAP estimation:
p ( n + 1 ) ( w | θ ^ d ) = σ d · p ( w | d ) + Σ i c ( w , D i ) · t d ( n ) ( w ) σ d + Σ w ′ Σ i Σ d ′ = 1 k c ( w ′ , D i ) · t d ′ ( n ) ( w ′ )
So far, use after above formula number of iterations wheel (such as 100 take turns), target entity can be obtained Topic model
Three, Adaptive Query Processing extension.
1. when receiving the query statement that the microblogging document in microblogging language material set is retrieved, root Original query model corresponding with query statement is created according to query statement, and according to microblogging collection of document In every microblogging document set up microblogging document language model.
2. by target entity topic model, original query model is extended the interrogation model that is expanded. It is calculated expanding query model according to following equation:
p ( w | θ ^ Q ′ ) = ( 1 - α ) × p ( w | θ ^ Q ) + α × p ( w | θ ^ E ) ;
Wherein,Represent expanding query model,Represent original query model,Represent that target is real Body topic model,Represent the probability that target entity is occupied in expanding query model,Represent the probability that target entity is occupied in original query model,Represent mesh The probability that mark entity is occupied in target entity model, α represents initial interpolation parameter, and α controls target The significance level of entity topic model.
In the related, initial interpolation parameter α is disposed as one admittedly for all of query statement Fixed value, however, it is contemplated that the importance degree of same target entity is not in different query statement Identical, it is possible to initial interpolation parameter is updated, updates α according to below equation, with To α ':
α ′ = α × Σ w ∈ E IDF ( w ) Σ w 1 ∈ Q IDF ( w 1 )
Wherein, w represents that target entity, E represent all entities in target entity model, and Q represents and looks into Ask all entities in statement, w1Representing any entity in query statement, IDF (w) represents that target is real The body reverse document frequency in microblogging language material set, IDF (w1) represent that any entity is at microblogging language material Reverse document frequency in set.
Particularly, when query statement there being multiple target entity identified, real according to each target The cum rights meansigma methods of the target entity topic model of body determines final entity topic model, specifically, Final entity topic model is determined by below equation:
p ( w | θ ^ E ′ ) = Σ i = 1 n IDF ( E i ) × p ( w | θ ^ E i ) Σ i = 1 n IDF ( E i )
Wherein,Represent final entity topic model,Represent that each target entity exists The probability occupied in final entity topic model, n represents the number of target entity,Represent every The target entity topic model of individual target entity, IDF (Ei) represent that each target entity is in microblogging corpus Reverse document frequency in conjunction,Represent that each target entity is at mesh corresponding with target entity The probability occupied in mark entity topic model, EiRepresent that the i-th target in multiple target entity is real Body.
3.KL distance calculate (statistics expanding query model and microblogging document language model between similar Degree), by the similarity between below equation statistics expanding query model and microblogging document language model, And using similarity more than or equal to presetting the target microblogging document of similarity as target retrieval result:
Score ( Q , D ) = - KL ( θ ^ Q ′ | | θ ^ D ) ∝ Σ w ∈ V p ( w | θ ^ Q ′ ) × log p ( w | θ ^ D ) ;
Wherein, Score (Q, D) represents similarity, and V represents all realities in microblogging document language model Body,Represent expanding query model,Represent microblogging document language model,Represent The probability that target entity is occupied in expanding query model,Represent that target entity is at microblogging The probability occupied in document language model.
Below in conjunction with an embodiment, the present invention is further described through:
1) carry out pretreatment stage, every microblogging document in microblogging stream is all used Entity recognition instrument Identify all entities comprised.Such as microblogging document is that " New cinema of Zhou Jielun is really clapped so good ", we have identified entity " Zhou Jielun ", then this microblogging numbering (id) is stored in reality by us Entity item corresponding in body index;For target entity, we obtain from entity indexes and are newly joined M bar microblogging document as microblogging language material set.
2) firstly for target entity " Zhou Jielun ", Freebase searching interface is used to attempt link Object in Freebase knowledge base, and obtain its affiliated target domain, i.e. film, music, TV, personage, media, awards.
Build mixed model, this mixed model include initial solid topic model that " Zhou Jielun " is corresponding, Background language model and six domain language models.
Use domain language model, background language model and the initial solid model corresponding with target entity Traversal M bar microblogging document, and carry out n times interative computation, to obtain target entity topic model, Wherein, M >=1, N >=1, and M and N be positive integer.
3) query statement and every microblogging document being carried out maximum likelihood modeling, such as query statement is " Zhou Jielun New cinema ", obtains after participle [" Zhou Jielun ", " newly ", " film "], passes through Maximal possibility estimation creates original query model, p (Zhou Jielun)=0.33, p (newly)=0.33, p (electricity Shadow)=0.33, and set up microblogging document language model according to every microblogging document, wherein, for often The Maximum-likelihood estimation modeling of bar microblogging document is similar with the estimation of original query model modeling.
Identifying the target entity in query statement, such as query statement is " Zhou Jielun New cinema ", knows Do not go out target entity for " Zhou Jielun ".
Utilize " Zhou Jielun " target entity topic model to extend original query model, be expanded and look into Ask model, calculate initial interpolation parameter:
Original query model is extended, due to query statement " week according to linear interpolation formula above Outstanding human relations New cinema " in only have a target entity " Zhou Jielun ", therefore, it can directly utilize this mesh The target entity topic model of mark entity is extended.
Utilize KL expanding query model after computing formula calculates extension and microblogging document language The similarity of model, microblogging document language model utilizes the Maximum-likelihood estimation of microblogging document, and carries out Di Li Cray smoothing processing.
The target retrieval result of query statement is determined according to similarity.
Technical scheme is described in detail, it is possible to use family is exactly micro-above in association with accompanying drawing In blog article shelves, retrieval obtains target retrieval result, thus improves retrieval rate, can also have simultaneously Effect ground strengthens the robustness of retrieval.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for For those skilled in the art, the present invention can have various modifications and variations.All essences in the present invention Within god and principle, any modification, equivalent substitution and improvement etc. made, should be included in the present invention Protection domain within.

Claims (16)

1. a search method, it is characterised in that including:
When receiving the query statement that the microblogging document in microblogging language material set is retrieved, according to Described query statement creates original query model corresponding with described query statement;
Identify the target entity in described query statement;
According to target entity topic model corresponding with described target entity, described original query model and The microblogging document language model set up according to every microblogging document in described microblogging collection of document, to institute State original query model to be extended, with the interrogation model that is expanded;
Add up the similarity between described expanding query model and described microblogging document language model, with root The target retrieval result of described query statement is determined according to described similarity.
Search method the most according to claim 1, it is characterised in that united by below equation Count the described similarity between described expanding query model and described microblogging document language model, and by phase Like degree more than or equal to presetting the target microblogging document of similarity as described target retrieval result:
Score ( Q , D ) = - KL ( θ ^ Q ′ | | θ ^ D ) ∝ Σ w ∈ V p ( w | θ ^ Q ′ ) × log p ( w | θ ^ D ) ;
Wherein, Score (Q, D) represents described similarity, and V represents in described microblogging document language model All entities,Represent described expanding query model,Represent described microblogging document language model,Represent the probability that described target entity is occupied in described expanding query model,Represent the probability that described target entity is occupied in described microblogging document language model.
Search method the most according to claim 1, it is characterised in that according to following equation meter Calculation obtains described expanding query model:
p ( w | θ ^ Q ′ ) = ( 1 - α ) × p ( w | θ ^ Q ) + α × p ( w | θ ^ E ) ;
Wherein,Represent described expanding query model,Represent described original query model,Table Show described target entity topic model,Represent that described target entity is at described expanding query The probability occupied in model,Represent that described target entity is in described original query model The probability occupied,Represent that described target entity is occupied in described target entity model Probability, described α represents initial interpolation parameter.
Search method the most according to claim 3, it is characterised in that
According to the more newer command received, update described α according to below equation, to obtain α ':
α ′ = α × Σ w ∈ E IDF ( w ) Σ w 1 ∈ Q IDF ( w 1 )
Wherein, w represents described target entity, and E represents all entities in described target entity model, Q represents all entities in described query statement, w1Represent any entity in described query statement, IDF (w) represent the described target entity reverse document frequency in described microblogging language material set, IDF(w1) represent the described any entity reverse document frequency in described microblogging language material set.
Search method the most according to claim 1, it is characterised in that
When described target entity is multiple, according to each described target entity in described microblogging corpus Reverse document frequency in conjunction and the described target entity topic model of each described target entity, determine Final entity topic model, to use described final entity topic model, described original query mould Type and create described expanding query model with described microblogging document language model.
Search method the most according to claim 5, it is characterised in that
According to the first establishment order received, determine described final entity theme by below equation Model:
p ( w | θ ^ E ′ ) = Σ i = 1 n IDF ( E i ) × p ( w | θ ^ E i ) Σ i = 1 n IDF ( E i )
Wherein,Represent described final entity topic model,Represent each described mesh The probability that mark entity is occupied in described final entity topic model, n represents described target entity Number,Represent the target entity topic model of each described target entity, IDF (Ei) represent every The individual described target entity reverse document frequency in described microblogging language material set,Represent every Individual described target entity is occupied in described target entity topic model corresponding with described target entity Probability, EiRepresent target entity described in the i-th in multiple described target entity.
Search method the most according to any one of claim 1 to 6, it is characterised in that root According to the second establishment order received, create target corresponding with described target entity by procedure below Entity topic model:
When the language material collective database at described microblogging language material set place receives described target entity, From described microblogging language material set, the M relevant to described target entity is extracted according to described target entity Bar microblogging document;
According to the target domain belonging to described target entity, it is being connected with described language material collective database Target domain knowledge base in search for multiple key words relevant to described target domain, wherein, multiple Described key word includes described target entity;
The virtual document corresponding with described target domain is generated according to multiple described key words;
Domain language model is set up according to described virtual document, and according in described microblogging language material set All entities in every microblogging document set up background language model;
Use described domain language model, described background language model and corresponding with described target entity Initial solid model travels through described M bar microblogging document, and carries out n times interative computation, to obtain State target entity topic model, wherein, M >=1, N >=1, and M and N and be positive integer.
Search method the most according to claim 7, it is characterised in that also include:
After generating the described virtual document corresponding with described target domain, add up described target entity and exist First occurrence number in described virtual document corresponding with described target domain, and multiple described pass Each described key word in keyword second going out in the described virtual document that described target domain is corresponding Occurrence number;
The field of described target entity is determined according to described first occurrence number and described second occurrence number Priori value;
Described domain language model is updated according to described field priori value.
9. a searching system, it is characterised in that including:
First model creating unit, retrieves receiving the microblogging document in microblogging language material set Query statement time, according to described query statement create original query mould corresponding with described query statement Type;
Entity recognition unit, identifies the target entity in described query statement;
Model extension unit, according to target entity topic model corresponding with described target entity, described Original query model and the microblogging document set up according to every microblogging document in described microblogging collection of document Language model, is extended described original query model, with the interrogation model that is expanded;
Retrieval result determines unit, adds up described expanding query model and described microblogging document language model Between similarity, to determine the target retrieval result of described query statement according to described similarity.
Searching system the most according to claim 9, it is characterised in that described retrieval result is true Cell includes:
Similarity statistic unit, adds up described expanding query model and described microblogging literary composition by below equation Described similarity between shelves language model, and by similarity more than or equal to the target presetting similarity Microblogging document is as described target retrieval result:
Score ( Q , D ) = - KL ( θ ^ Q ′ | | θ ^ D ) ∝ Σ w ∈ V p ( w | θ ^ Q ′ ) × log p ( w | θ ^ D ) ;
Wherein, Score (Q, D) represents described similarity, and V represents in described microblogging document language model All entities,Represent described expanding query model,Represent described microblogging document language model,Represent the probability that described target entity is occupied in described expanding query model,Represent the probability that described target entity is occupied in described microblogging document language model.
11. searching systems according to claim 9, it is characterised in that described model extension list Unit specifically for:
It is calculated described expanding query model according to following equation:
p ( w | θ ^ Q ′ ) = ( 1 - α ) × p ( w | θ ^ Q ) + α × p ( w | θ ^ E ) ;
Wherein,Represent described expanding query model,Represent described original query model,Table Show described target entity topic model,Represent that described target entity is at described expanding query The probability occupied in model,Represent that described target entity is in described original query model The probability occupied,Represent that described target entity is occupied in described target entity model Probability, described α represents initial interpolation parameter.
12. searching systems according to claim 11, it is characterised in that also include:
Parameter updating block, according to the more newer command received, updates described α according to below equation, To obtain α ':
α ′ = α × Σ w ∈ E IDF ( w ) Σ w 1 ∈ Q IDF ( w 1 )
Wherein, w represents described target entity, and E represents all entities in described target entity model, Q represents all entities in described query statement, w1Represent any entity in described query statement, IDF (w) represent the described target entity reverse document frequency in described microblogging language material set, IDF(w1) represent the described any entity reverse document frequency in described microblogging language material set.
13. searching systems according to claim 9, it is characterised in that described model extension list Unit is additionally operable to:
When described target entity is multiple, according to each described target entity in described microblogging corpus Reverse document frequency in conjunction and the described target entity topic model of each described target entity, determine Final entity topic model, to use described final entity topic model, described original query mould Type and create described expanding query model with described microblogging document language model.
14. searching systems according to claim 13, it is characterised in that described model extension Unit specifically for: according to receive first establishment order, by below equation determine described finally Entity topic model:
p ( w | θ ^ E ′ ) = Σ i = 1 n IDF ( E i ) × p ( w | θ ^ E i ) Σ i = 1 n IDF ( E i )
Wherein,Represent described final entity topic model,Represent each described mesh The probability that mark entity is occupied in described final entity topic model, n represents described target entity Number,Represent the target entity topic model of each described target entity, IDF (Ei) represent every The individual described target entity reverse document frequency in described microblogging language material set,Represent every Individual described target entity is occupied in described target entity topic model corresponding with described target entity Probability, EiRepresent target entity described in the i-th in multiple described target entity.
15. according to the searching system according to any one of claim 9 to 14, it is characterised in that Also include:
Second model creating unit, for according to the second establishment order received, passing through procedure below Create target entity topic model corresponding with described target entity:
When the language material collective database at described microblogging language material set place receives described target entity, From described microblogging language material set, the M relevant to described target entity is extracted according to described target entity Bar microblogging document,
According to the target domain belonging to described target entity, it is being connected with described language material collective database Target domain knowledge base in search for multiple key words relevant to described target domain, wherein, multiple Described key word includes described target entity,
The virtual document corresponding with described target domain is generated according to multiple described key words,
Domain language model is set up according to described virtual document, and according in described microblogging language material set All entities in every microblogging document set up background language model,
Use described domain language model, described background language model and corresponding with described target entity Initial solid model travels through described M bar microblogging document, and carries out n times interative computation, to obtain State target entity topic model, wherein, M >=1, N >=1, and M and N and be positive integer.
16. searching systems according to claim 15, it is characterised in that
Described second theme model creating unit also includes:
Number of times statistic unit, after generating the described virtual document corresponding with described target domain, statistics The described target entity the first occurrence number in the described virtual document corresponding with described target domain, And each described key word in multiple described key word is corresponding described virtual of described target domain The second occurrence number in document;
Priori value determines unit, determines institute according to described first occurrence number and described second occurrence number State the field priori value of target entity;
Domain model updating block, updates described domain language model according to described field priori value.
CN201510272225.7A 2015-05-25 2015-05-25 Search method and searching system Expired - Fee Related CN106294418B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510272225.7A CN106294418B (en) 2015-05-25 2015-05-25 Search method and searching system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510272225.7A CN106294418B (en) 2015-05-25 2015-05-25 Search method and searching system

Publications (2)

Publication Number Publication Date
CN106294418A true CN106294418A (en) 2017-01-04
CN106294418B CN106294418B (en) 2019-08-30

Family

ID=57634572

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510272225.7A Expired - Fee Related CN106294418B (en) 2015-05-25 2015-05-25 Search method and searching system

Country Status (1)

Country Link
CN (1) CN106294418B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609152A (en) * 2017-09-22 2018-01-19 百度在线网络技术(北京)有限公司 Method and apparatus for expanding query formula
CN109388743A (en) * 2017-08-11 2019-02-26 阿里巴巴集团控股有限公司 The determination method and apparatus of language model
CN111061839A (en) * 2019-12-19 2020-04-24 过群 Combined keyword generation method and system based on semantics and knowledge graph
CN111309869A (en) * 2020-02-28 2020-06-19 中国工商银行股份有限公司 Real-time text stream information retrieval method and system
CN111460079A (en) * 2020-03-06 2020-07-28 华南理工大学 Topic generation method based on concept information and word weight
CN111566637A (en) * 2018-02-01 2020-08-21 国际商业机器公司 Dynamically building and configuring a session proxy learning model
CN113407574A (en) * 2021-07-20 2021-09-17 广州博冠信息科技有限公司 Multi-table paging query method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999560A (en) * 2011-10-26 2013-03-27 微软公司 Improvement of relevance of search engine result page between name and other search queries by using social network features
CN103377226A (en) * 2012-04-25 2013-10-30 中国移动通信集团公司 Intelligent search method and system thereof
CN103885985A (en) * 2012-12-24 2014-06-25 北京大学 Real-time microblog search method and device
US8949263B1 (en) * 2012-05-14 2015-02-03 NetBase Solutions, Inc. Methods and apparatus for sentiment analysis
WO2015016784A1 (en) * 2013-08-01 2015-02-05 National University Of Singapore A method and apparatus for tracking microblog messages for relevancy to an entity identifiable by an associated text and an image

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999560A (en) * 2011-10-26 2013-03-27 微软公司 Improvement of relevance of search engine result page between name and other search queries by using social network features
CN103377226A (en) * 2012-04-25 2013-10-30 中国移动通信集团公司 Intelligent search method and system thereof
US8949263B1 (en) * 2012-05-14 2015-02-03 NetBase Solutions, Inc. Methods and apparatus for sentiment analysis
CN103885985A (en) * 2012-12-24 2014-06-25 北京大学 Real-time microblog search method and device
WO2015016784A1 (en) * 2013-08-01 2015-02-05 National University Of Singapore A method and apparatus for tracking microblog messages for relevancy to an entity identifiable by an associated text and an image

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘挺等: "《信息检索系统导论》", 31 December 2008 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109388743B (en) * 2017-08-11 2021-11-23 阿里巴巴集团控股有限公司 Language model determining method and device
CN109388743A (en) * 2017-08-11 2019-02-26 阿里巴巴集团控股有限公司 The determination method and apparatus of language model
CN107609152A (en) * 2017-09-22 2018-01-19 百度在线网络技术(北京)有限公司 Method and apparatus for expanding query formula
CN107609152B (en) * 2017-09-22 2021-03-09 百度在线网络技术(北京)有限公司 Method and apparatus for expanding query expressions
US11886823B2 (en) 2018-02-01 2024-01-30 International Business Machines Corporation Dynamically constructing and configuring a conversational agent learning model
CN111566637A (en) * 2018-02-01 2020-08-21 国际商业机器公司 Dynamically building and configuring a session proxy learning model
CN111061839A (en) * 2019-12-19 2020-04-24 过群 Combined keyword generation method and system based on semantics and knowledge graph
CN111061839B (en) * 2019-12-19 2024-01-23 过群 Keyword joint generation method and system based on semantics and knowledge graph
CN111309869A (en) * 2020-02-28 2020-06-19 中国工商银行股份有限公司 Real-time text stream information retrieval method and system
CN111309869B (en) * 2020-02-28 2023-09-22 中国工商银行股份有限公司 Real-time text stream information retrieval method and system
CN111460079B (en) * 2020-03-06 2023-03-28 华南理工大学 Topic generation method based on concept information and word weight
CN111460079A (en) * 2020-03-06 2020-07-28 华南理工大学 Topic generation method based on concept information and word weight
CN113407574A (en) * 2021-07-20 2021-09-17 广州博冠信息科技有限公司 Multi-table paging query method, device, equipment and storage medium
CN113407574B (en) * 2021-07-20 2024-04-26 广州博冠信息科技有限公司 Multi-table paging query method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN106294418B (en) 2019-08-30

Similar Documents

Publication Publication Date Title
CN106294418A (en) Search method and searching system
CN111428147B (en) Social recommendation method of heterogeneous graph volume network combining social and interest information
CN106598950B (en) A kind of name entity recognition method based on hybrid laminated model
CN102929942B (en) The overlapping community discovery method of a kind of community network based on integrated study
CN103268348B (en) A kind of user's query intention recognition methods
CN104598611B (en) The method and system being ranked up to search entry
CN106156145A (en) The management method of a kind of address date and device
CN103778227A (en) Method for screening useful images from retrieved images
CN104008165A (en) Club detecting method based on network topology and node attribute
CN106970910A (en) A kind of keyword extracting method and device based on graph model
CN103399858A (en) Socialization collaborative filtering recommendation method based on trust
CN104484343A (en) Topic detection and tracking method for microblog
CN102982107A (en) Recommendation system optimization method with information of user and item and context attribute integrated
CN103870474A (en) News topic organizing method and device
CN106021366A (en) API (Application Programing Interface) tag recommendation method based on heterogeneous information
CN110059220A (en) A kind of film recommended method based on deep learning Yu Bayesian probability matrix decomposition
CN106294662A (en) Inquiry based on context-aware theme represents and mixed index method for establishing model
CN107943919A (en) A kind of enquiry expanding method of session-oriented formula entity search
CN104089774A (en) Gear fault diagnosis method based on orthogonal match between multiple parallel dictionaries
CN106708929A (en) Video program search method and device
CN110083703A (en) A kind of document clustering method based on citation network and text similarity network
CN103795592B (en) Online water navy detection method and device
CN106599227A (en) Method and apparatus for obtaining similarity between objects based on attribute values
CN103488637A (en) Method for carrying out expert search based on dynamic community mining
CN105869058A (en) Method for user portrait extraction based on multilayer latent variable model

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220617

Address after: 100871 No. 5, the Summer Palace Road, Beijing, Haidian District

Patentee after: Peking University

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: BEIJING FOUNDER ELECTRONICS Co.,Ltd.

Address before: 100871 No. 5, the Summer Palace Road, Beijing, Haidian District

Patentee before: Peking University

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: BEIJING FOUNDER ELECTRONICS Co.,Ltd.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190830