Summary of the invention
The present invention is based on the problems referred to above, it is proposed that a kind of new technical scheme, can solve user
The technical problem obtaining target retrieval result can not be retrieved exactly in microblogging document.
In view of this, an aspect of of the present present invention proposes a kind of search method, including: right receiving
When microblogging document in microblogging language material set carries out the query statement retrieved, create according to described query statement
Build original query model corresponding with described query statement;Identify the target entity in described query statement;
According to target entity topic model corresponding with described target entity, described original query model and according to
The microblogging document language model that every microblogging document in described microblogging collection of document is set up, to described former
Beginning interrogation model is extended, with the interrogation model that is expanded;Add up described expanding query model and institute
State the similarity between microblogging document language model, to determine described query statement according to described similarity
Target retrieval result.
In this technical scheme, using query statement, the microblogging document in microblogging language material set is carried out
During retrieval, owing to query statement including the another name of target entity, therefore, by identifying inquiry language
Target entity in Ju can be effectively improved retrieval effectiveness, it addition, by corresponding to query statement
Ground original query model is extended the interrogation model that is expanded, so according to expanding query model to micro-
When blog article shelves are retrieved, the substantial amounts of microblogging document relevant to query statement can be retrieved, i.e. wrap
Include the information that user is interested, such that it is able to efficiently avoid the missing inspection to microblogging document, and then
Make microblogging document to carry out retrieval ground more comprehensively, and micro-by statistics expanding query model and every
Similarity between blog article shelves corresponding microblogging document language model determines target retrieval result, thus
Make target retrieval result more accurate, also improve the robustness of retrieval simultaneously.Therefore, by this skill
Art scheme, user can retrieve exactly in microblogging document and obtain target retrieval result, thus improve
The accuracy rate of retrieval, wherein, target entity is that the key word in query statement, such as query statement are
Target entity in " Zhou Jielun New cinema " is " Zhou Jielun ".
In technique scheme, it is preferable that by below equation add up described expanding query model with
Described similarity between described microblogging document language model, and by similarity more than or equal to presetting phase
Like the target microblogging document spent as described target retrieval result:
Wherein, Score (Q, D) represents described similarity, and V represents in described microblogging document language model
All entities,Represent described expanding query model,Represent described microblogging document language model,Represent the probability that described target entity is occupied in described expanding query model,Represent the probability that described target entity is occupied in described microblogging document language model.
In this technical scheme, substantial amounts of microblogging can be retrieved by the expanding query model after extension
Document, but may comprise in this substantial amounts of microblogging document have many consumers the information less paid close attention to or
These information does not arranges according to certain order of priority, and the information that i.e. user less pays close attention to may
Before coming the information that user pays special attention to, therefore, by statistics expanding query model and microblogging literary composition
Similarity between shelves language model, and determine target retrieval result according to the height of this similarity, can
To filter out information the most inessential, that relatedness is less or user less pays close attention to, therefore, by this
Technical scheme, can improve the matching accuracy rate of retrieval result, improves target retrieval result further
Accuracy, wherein, above-mentioned formula be KL distance (Kullback-Leibler Divergence, also known as
Relative entropy) calculating, wherein, all entities refer to every microblogging document in microblogging document language model
In all of word, such as, certain microblogging document in microblogging document language model be " week Jie Lunxin
Film is excellent ", then all entities in this microblogging document are " Zhou Jielun ", " newly ",
" film " and " excellent ", in a word, entity is exactly to represent the word on our ordinary meaning, target
Entity is exactly the key word such as " Zhou Jielun " that user wants inquiry.
In technique scheme, it is preferable that be calculated described expanding query mould according to following equation
Type:
Wherein,Represent described expanding query model,Represent described original query model,Table
Show described target entity topic model,Represent that described target entity is at described expanding query
The probability occupied in model,Represent that described target entity is in described original query model
The probability occupied,Represent that described target entity is occupied in described target entity model
Probability, described α represents initial interpolation parameter.
In this technical scheme, the retrieval results contrast corresponding due to original query model is few, even also
Do not comprise user and need the information of retrieval, accordingly, it would be desirable to be extended being expanded to original query model
Exhibition interrogation model, when so retrieving microblogging document according to expanding query model, can retrieve
The substantial amounts of microblogging document relevant to query statement, i.e. includes the information that user is interested, thus can
Efficiently avoid the missing inspection to microblogging document, and then microblogging document is carried out retrieval ground more comprehensively,
Further increasing retrieval effectiveness.
In technique scheme, it is preferable that according to the more newer command received, according to below equation
Update described α, to obtain α ':
Wherein, w represents described target entity, and E represents all entities in described target entity model,
Q represents all entities in described query statement, w1Represent any entity in described query statement,
IDF (w) represent the described target entity reverse document frequency in described microblogging language material set,
IDF(w1) represent the described any entity reverse document frequency in described microblogging language material set.
In this technical scheme, due to the important journey of target entity same in different query statements
Degree is different, and initial interpolation parameter α can to and target entity model corresponding with target entity
There is certain relation, therefore, need when different query statements is retrieved initial interpolation is joined
Number α are updated so that it becomes adaptive interpolation parameter, and determine extension according to the α ' after updating
Interrogation model, so that expanding query model is more accurate, wherein, all entities refer to microblogging document
The all of word in every microblogging document in language model, such as, in microblogging document language model
Certain microblogging document is " Zhou Jielun New cinema is excellent ", then all entities in this microblogging document
Being " Zhou Jielun ", " newly ", " film " and " excellent ", in a word, entity represents exactly
Word on our ordinary meaning, target entity is exactly the key word such as " Zhou Jielun " that user wants inquiry.
In technique scheme, it is preferable that when described target entity is multiple, according to each institute
State the target entity reverse document frequency in described microblogging language material set and each described target entity
Described target entity topic model, determines final entity topic model, to use described final reality
Body topic model, described original query model and create described expansion with described microblogging document language model
Exhibition interrogation model.
In this technical scheme, when query statement has multiple target entity, according to each target
The target entity topic model of entity and each target entity reverse literary composition in described microblogging language material set
Shelves frequency determines final entity topic model, with the expansion obtained by final entity topic model
Exhibition interrogation model is retrieved, thus the target retrieval result obtained is more accurate, i.e. target retrieval
Result has the relevant microblog document of each target entity in multiple target entity, and then makes target examine
Hitch fruit is the microblogging document that user wants to retrieve, and improves Consumer's Experience.
In technique scheme, it is preferable that according to the first establishment order received, by following
Formula determines described final entity topic model:
Wherein,Represent described final entity topic model,Represent each described mesh
The probability that mark entity is occupied in described final entity topic model, n represents described target entity
Number,Represent the target entity topic model of each described target entity, IDF (Ei) represent every
The individual described target entity reverse document frequency in described microblogging language material set,Represent every
Individual described target entity is occupied in described target entity topic model corresponding with described target entity
Probability, EiRepresent target entity described in the i-th in multiple described target entity.
In this technical scheme, when query statement has multiple target entity, permissible from formula
Find out, according to each target entity corresponding target entity topic model and each target entity described
Reverse document frequency in microblogging language material set is calculated final entity topic model, due to each
The target entity reverse document frequency in described microblogging language material set represents that each target entity is at microblogging
Significance level in language material set, therefore, is looked into by the extension obtained by final entity topic model
Ask model to retrieve, make target retrieval result have real with each target in multiple target entities
The microblogging document that body is related, and according to each target entity significance level in microblogging language material set
Determine target retrieval result, so that target retrieval result is the information that user wants to retrieve, enter
And improve retrieval effectiveness, wherein, reverse document frequency (Inverse Document Frequency,
IDF) being the significance level for weighing target entity, the IDF for target entity can be by microblogging
In language material set, the total quantity of microblogging document is divided by the quantity of the microblogging document comprising this target entity, then
The business obtained is taken the logarithm and obtains, and the IDF of target entity can affect the ginseng of the initial difference after renewal
Number.
In technique scheme, it is preferable that according to the second establishment order received, by following
Process creates target entity topic model corresponding with described target entity: when described microblogging language material set
When the language material collective database at place receives described target entity, according to described target entity from described
Microblogging language material set is extracted the M bar microblogging document relevant to described target entity;According to described mesh
Mark target domain belonging to entity, in the target domain knowledge being connected with described language material collective database
Searching for the multiple key words relevant to described target domain in storehouse, wherein, multiple described key words include
Described target entity;The virtual document corresponding with described target domain is generated according to multiple described key words;
Domain language model is set up according to described virtual document, and according to every in described microblogging language material set
All entities in microblogging document set up background language model;Use described domain language model, described
Background language model and the initial solid model corresponding with described target entity travel through described M bar microblogging
Document, and carry out n times interative computation, to obtain described target entity topic model, wherein, M >=1,
N >=1, and M and N be positive integer.
In this technical scheme, by domain language model, background language model and and the target set up
The initial solid model that entity is corresponding can control " background noise " and " field coherent noise ", only
Change microblogging document, thus accurately determine the target entity topic model of target entity, thus by by mesh
When the expanding query model that mark entity topic model extension obtains is retrieved, can retrieve substantial amounts of
The microblogging document relevant to query statement, i.e. includes the information that user is interested, such that it is able to effectively
Avoid the missing inspection to microblogging document, and then improve retrieval effectiveness, wherein, all entities refer to micro-
The all of word in every microblogging document in blog article shelves language model, such as, microblogging document language mould
Certain microblogging document in type is " Zhou Jielun New cinema is excellent ", then the institute in this microblogging document
Having entity to be " Zhou Jielun ", " newly ", " film " and " excellent ", in a word, entity is just
Being to represent the word on our ordinary meaning, target entity is exactly the key word such as " week that user wants inquiry
Jie Lun ".
In technique scheme, it is preferable that also include: corresponding with described target domain generating
After described virtual document, add up described target entity at the described virtual literary composition corresponding with described target domain
Each described key word in the first occurrence number in Dang, and multiple described key word is at described mesh
The second occurrence number in the described virtual document that mark field is corresponding;According to described first occurrence number and
Described second occurrence number determines the field priori value of described target entity;According to described field priori value
Update described domain language model.
In this technical scheme, by statistics target entity in the virtual document corresponding with target domain
The first occurrence number and each key word in multiple key words at virtual document corresponding to target domain
In the second occurrence number, determine the field priori value of target entity, thus according to field priori value pair
Domain language model is updated, and then the domain language model obtained is more accurate, i.e. domain language
Model relates to each field of target entity, and then improves retrieval effectiveness.
Another aspect of the present invention proposes a kind of searching system, including: the first model creating unit,
When receiving the query statement that the microblogging document in microblogging language material set is retrieved, according to described
Query statement creates original query model corresponding with described query statement;Entity recognition unit, identifies
Target entity in described query statement;Model extension unit, according to corresponding with described target entity
Target entity topic model, described original query model and according to every in described microblogging collection of document
The microblogging document language model that microblogging document is set up, is extended described original query model, with
To expanding query model;Retrieval result determines unit, adds up described expanding query model and described microblogging
Similarity between document language model, to determine the target of described query statement according to described similarity
Retrieval result.
In this technical scheme, using query statement, the microblogging document in microblogging language material set is carried out
During retrieval, owing to query statement including the another name of target entity, therefore, by identifying inquiry language
Target entity in Ju can be effectively improved retrieval effectiveness, it addition, by corresponding to query statement
Ground original query model is extended the interrogation model that is expanded, so according to expanding query model to micro-
When blog article shelves are retrieved, the substantial amounts of microblogging document relevant to query statement can be retrieved, i.e. wrap
Include the information that user is interested, such that it is able to efficiently avoid the missing inspection to microblogging document, and then
Make microblogging document to carry out retrieval ground more comprehensively, and micro-by statistics expanding query model and every
Similarity between blog article shelves corresponding microblogging document language model determines target retrieval result, thus
Make target retrieval result more accurate, also improve the robustness of retrieval simultaneously.Therefore, by this skill
Art scheme, user can retrieve exactly in microblogging document and obtain target retrieval result, thus improve
Accuracy rate, wherein, target entity is the target keyword that the user in query statement wants inquiry, example
As query statement be the target entity in " Zhou Jielun New cinema " be " Zhou Jielun ", and " newly " and
" film " is also other entities or refers to the word on our ordinary meaning.
In technique scheme, it is preferable that described retrieval result determines that unit includes: similarity is united
Meter unit, by below equation add up described expanding query model and described microblogging document language model it
Between described similarity, and using similarity more than or equal to preset similarity target microblogging document as
Described target retrieval result:
Wherein, Score (Q, D) represents described similarity, and V represents in described microblogging document language model
All entities,Represent described expanding query model,Represent described microblogging document language model,Represent the probability that described target entity is occupied in described expanding query model,Represent the probability that described target entity is occupied in described microblogging document language model.
In this technical scheme, substantial amounts of microblogging can be retrieved by the expanding query model after extension
Document, but may comprise in this substantial amounts of microblogging document have many consumers the information less paid close attention to or
These information does not arranges according to certain order of priority, and the information that i.e. user less pays close attention to may
Before coming the information that user pays special attention to, therefore, by statistics expanding query model and microblogging literary composition
Similarity between shelves language model, and determine target retrieval result according to the height of this similarity, can
To filter out information the most inessential, that relatedness is less or user less pays close attention to, therefore, by this
Technical scheme, can improve the matching accuracy rate of retrieval result, improves target retrieval result further
Accuracy, wherein, above-mentioned formula be KL distance (Kullback-Leibler Divergence, also known as
Relative entropy) calculating, wherein, all entities refer to every microblogging document in microblogging document language model
In all of word, such as, certain microblogging document in microblogging document language model be " week Jie Lunxin
Film is excellent ", then all entities in this microblogging document are " Zhou Jielun ", " newly ",
" film " and " excellent ", in a word, entity is exactly to represent the word on our ordinary meaning, target
Entity is exactly the key word such as " Zhou Jielun " that user wants inquiry.
In technique scheme, it is preferable that model extension unit specifically for: according to following equation
It is calculated described expanding query model:
Wherein,Represent described expanding query model,Represent described original query model,Table
Show described target entity topic model,Represent that described target entity is at described expanding query
The probability occupied in model,Represent that described target entity is in described original query model
The probability occupied,Represent that described target entity is occupied in described target entity model
Probability, described α represents initial interpolation parameter.
In this technical scheme, the retrieval results contrast corresponding due to original query model is few, even also
Do not comprise user and need the information of retrieval, accordingly, it would be desirable to be extended being expanded to original query model
Exhibition interrogation model, when so retrieving microblogging document according to expanding query model, can retrieve
The substantial amounts of microblogging document relevant to query statement, i.e. includes the information that user is interested, thus can
Efficiently avoid the missing inspection to microblogging document, and then microblogging document is carried out retrieval ground more comprehensively,
Further increasing retrieval effectiveness.
In technique scheme, it is preferable that also include: parameter updating block, according to receive
More newer command, updates described α according to below equation, to obtain α ':
Wherein, w represents described target entity, and E represents all entities in described target entity model,
Q represents all entities in described query statement, w1Represent any entity in described query statement,
IDF (w) represent the described target entity reverse document frequency in described microblogging language material set,
IDF(w1) represent the described any entity reverse document frequency in described microblogging language material set.
In this technical scheme, due to the important journey of target entity same in different query statements
Degree is different, and initial interpolation parameter α can to and target entity model corresponding with target entity
There is certain relation, therefore, need when different query statements is retrieved initial interpolation is joined
Number α are updated so that it becomes adaptive interpolation parameter, and determine extension according to the α ' after updating
Interrogation model, so that expanding query model is more accurate, wherein, all entities refer to microblogging document
The all of word in every microblogging document in language model, such as, in microblogging document language model
Certain microblogging document is " Zhou Jielun New cinema is excellent ", then all entities in this microblogging document
Being " Zhou Jielun ", " newly ", " film " and " excellent ", in a word, entity represents exactly
Word on our ordinary meaning, target entity is exactly the key word such as " Zhou Jielun " that user wants inquiry.
In technique scheme, it is preferable that also include: described model extension unit is additionally operable to: when
When described target entity is multiple, according to each described target entity in described microblogging language material set
Reverse document frequency and the described target entity topic model of each described target entity, determine final
Entity topic model, with use described final entity topic model, described original query model and with
Described microblogging document language model creates described expanding query model.
In this technical scheme, when query statement has multiple target entity, according to each target
The target entity topic model of entity and each target entity reverse literary composition in described microblogging language material set
Shelves frequency determines final entity topic model, with the expansion obtained by final entity topic model
Exhibition interrogation model is retrieved, thus the target retrieval result obtained is more accurate, i.e. target retrieval
Result has the relevant microblog document of each target entity in multiple target entity, and then makes target examine
Hitch fruit is the microblogging document that user wants to retrieve, and improves Consumer's Experience.
In technique scheme, it is preferable that also include: described model extension unit specifically for:
According to the first establishment order received, determine described final entity topic model by below equation:
Wherein,Show described final entity topic model,Represent each described mesh
The probability that mark entity is occupied in described final entity topic model, n represents described target entity
Number,Represent the target entity topic model of each described target entity, IDF (Ei) represent every
The individual described target entity reverse document frequency in described microblogging language material set,Represent every
Individual described target entity is occupied in described target entity topic model corresponding with described target entity
Probability, EiRepresent target entity described in the i-th in multiple described target entity.
In this technical scheme, when query statement has multiple target entity, permissible from formula
Find out, according to each target entity corresponding target entity topic model and each target entity described
Reverse document frequency in microblogging language material set is calculated final entity topic model, due to each
The target entity reverse document frequency in described microblogging language material set represents that each target entity is at microblogging
Significance level in language material set, therefore, is looked into by the extension obtained by final entity topic model
Ask model to retrieve, make target retrieval result have real with each target in multiple target entities
The microblogging document that body is related, and according to each target entity significance level in microblogging language material set
Determine target retrieval result, so that target retrieval result is the information that user wants to retrieve, enter
And improve retrieval effectiveness, wherein, reverse document frequency (Inverse Document Frequency,
IDF) being the significance level for weighing target entity, the IDF for target entity can be by microblogging
In language material set, the total quantity of microblogging document is divided by the quantity of the microblogging document comprising this target entity, then
The business obtained is taken the logarithm and obtains, and the IDF of target entity can affect the ginseng of the initial difference after renewal
Number.
In technique scheme, it is preferable that also include: the second model creating unit, for basis
The the second establishment order received, creates target corresponding with described target entity by procedure below real
Body topic model: when the language material collective database at described microblogging language material set place receives described target
During entity, extract and described target entity phase from described microblogging language material set according to described target entity
The M bar microblogging document closed, according to the target domain belonging to described target entity, with described language material
The target domain knowledge base that collective database is connected is searched for the multiple passes relevant to described target domain
Keyword, wherein, multiple described key words include described target entity, raw according to multiple described key words
Become the virtual document corresponding with described target domain, set up domain language model according to described virtual document,
And set up background language mould according to all entities in every microblogging document in described microblogging language material set
Type, uses described domain language model, described background language model and corresponding with described target entity
Initial solid model travels through described M bar microblogging document, and carries out n times interative computation, to obtain
State target entity topic model, wherein, M >=1, N >=1, and M and N and be positive integer.
In this technical scheme, by domain language model, background language model and and the target set up
The initial solid model that entity is corresponding can control " background noise " and " field coherent noise ", only
Change microblogging document, thus accurately determine the target entity topic model of target entity, thus by by mesh
When the expanding query model that mark entity topic model extension obtains is retrieved, can retrieve substantial amounts of
The microblogging document relevant to query statement, i.e. includes the information that user is interested, such that it is able to effectively
Avoid the missing inspection to microblogging document, and then improve retrieval effectiveness, wherein, all entities refer to micro-
The all of word in every microblogging document in blog article shelves language model, such as, microblogging document language mould
Certain microblogging document in type is " Zhou Jielun New cinema is excellent ", then the institute in this microblogging document
Having entity to be " Zhou Jielun ", " newly ", " film " and " excellent ", in a word, entity is just
Being to represent the word on our ordinary meaning, target entity is exactly the key word such as " week that user wants inquiry
Jie Lun ".
In technique scheme, it is preferable that described second theme model creating unit also includes::
Number of times statistic unit, after generating the described virtual document corresponding with described target domain, statistics is described
The target entity the first occurrence number in the described virtual document corresponding with described target domain, and
Each described key word in multiple described key words is at described virtual document corresponding to described target domain
In the second occurrence number;Priori value determines unit, according to described first occurrence number and described second
Occurrence number determines the field priori value of described target entity;Domain model updating block, according to described
Field priori value updates described domain language model.
In this technical scheme, by statistics target entity in the virtual document corresponding with target domain
The first occurrence number and each key word in multiple key words at virtual document corresponding to target domain
In the second occurrence number, determine the field priori value of target entity, thus according to field priori value pair
Domain language model is updated, and then the domain language model obtained is more accurate, i.e. domain language
Model relates to each field of target entity, and then improves retrieval effectiveness.
By technical scheme, make user to retrieve in microblogging document exactly and obtain mesh
Mark retrieval result, thus improve recall precision and accuracy rate, the robust of retrieval can also be strengthened simultaneously
Property.
Detailed description of the invention
In order to the above-mentioned purpose of the present invention, feature and advantage can be more clearly understood that, below in conjunction with attached
The present invention is further described in detail by figure and detailed description of the invention.It should be noted that not
In the case of conflict, the feature in embodiments herein and embodiment can be mutually combined.
Elaborate a lot of detail in the following description so that fully understanding the present invention, but,
The present invention can implement to use other to be different from other modes described here, therefore, and the present invention
Protection domain do not limited by following public specific embodiment.
Fig. 1 shows the schematic flow sheet of search method according to an embodiment of the invention.
As it is shown in figure 1, search method according to an embodiment of the invention, including: step 102,
When receiving the query statement that the microblogging document in microblogging language material set is retrieved, according to described
Query statement creates original query model corresponding with described query statement;Step 104, identifies described
Target entity in query statement;Step 106, according to target entity corresponding with described target entity
Topic model, described original query model and according to every microblogging document in described microblogging collection of document
The microblogging document language model set up, is extended described original query model, looks into be expanded
Ask model;Step 108, adds up between described expanding query model and described microblogging document language model
Similarity, to determine the target retrieval result of described query statement according to described similarity.
In this technical scheme, using query statement, the microblogging document in microblogging language material set is carried out
During retrieval, owing to query statement including the another name of target entity, therefore, by identifying inquiry language
Target entity in Ju can be effectively improved retrieval effectiveness, it addition, by corresponding to query statement
Ground original query model is extended the interrogation model that is expanded, so according to expanding query model to micro-
When blog article shelves are retrieved, the substantial amounts of microblogging document relevant to query statement can be retrieved, i.e. wrap
Include the information that user is interested, such that it is able to efficiently avoid the missing inspection to microblogging document, and then
Make microblogging document to carry out retrieval ground more comprehensively, and micro-by statistics expanding query model and every
Similarity between blog article shelves corresponding microblogging document language model determines target retrieval result, thus
Make target retrieval result more accurate, also improve the robustness of retrieval simultaneously.Therefore, by this skill
Art scheme, user can retrieve exactly in microblogging document and obtain target retrieval result, thus improve
The accuracy rate of retrieval, wherein, target entity is that the key word in query statement, such as query statement are
Target entity in " Zhou Jielun New cinema " is " Zhou Jielun ".
In technique scheme, it is preferable that by below equation add up described expanding query model with
Described similarity between described microblogging document language model, and by similarity more than or equal to presetting phase
Like the target microblogging document spent as described target retrieval result:
Wherein, Score (Q, D) represents described similarity, and V represents in described microblogging document language model
All entities,Represent described expanding query model,Represent described microblogging document language model,Represent the probability that described target entity is occupied in described expanding query model,Represent the probability that described target entity is occupied in described microblogging document language model.
In this technical scheme, substantial amounts of microblogging can be retrieved by the expanding query model after extension
Document, but may comprise in this substantial amounts of microblogging document have many consumers the information less paid close attention to or
These information does not arranges according to certain order of priority, and the information that i.e. user less pays close attention to may
Before coming the information that user pays special attention to, therefore, by statistics expanding query model and microblogging literary composition
Similarity between shelves language model, and determine target retrieval result according to the height of this similarity, can
To filter out information the most inessential, that relatedness is less or user less pays close attention to, therefore, by this
Technical scheme, can improve the matching accuracy rate of retrieval result, improves target retrieval result further
Accuracy, wherein, above-mentioned formula be KL distance (Kullback-Leibler Divergence, also known as
Relative entropy) calculating, wherein, all entities refer to every microblogging document in microblogging document language model
In all of word, such as, certain microblogging document in microblogging document language model be " week Jie Lunxin
Film is excellent ", then all entities in this microblogging document are " Zhou Jielun ", " newly ",
" film " and " excellent ", in a word, entity is exactly to represent the word on our ordinary meaning, target
Entity is exactly the key word such as " Zhou Jielun " that user wants inquiry.
In technique scheme, it is preferable that be calculated described expanding query mould according to following equation
Type:
WhereinRepresent described expanding query model,Represent described original query model,Table
Show described target entity topic model,Represent that described target entity is at described expanding query
The probability occupied in model,Represent that described target entity is in described original query model
The probability occupied,Represent that described target entity is occupied in described target entity model
Probability, described α represents initial interpolation parameter.
In this technical scheme, the retrieval results contrast corresponding due to original query model is few, even also
Do not comprise user and need the information of retrieval, accordingly, it would be desirable to be extended being expanded to original query model
Exhibition interrogation model, when so retrieving microblogging document according to expanding query model, can retrieve
The substantial amounts of microblogging document relevant to query statement, i.e. includes the information that user is interested, thus can
Efficiently avoid the missing inspection to microblogging document, and then microblogging document is carried out retrieval ground more comprehensively,
Further increasing retrieval effectiveness.
In technique scheme, it is preferable that according to the more newer command received, according to below equation
Update described α, to obtain α ':
Wherein, w represents described target entity, and E represents all entities in described target entity model,
Q represents all entities in described query statement, w1Represent any entity in described query statement,
IDF (w) represent the described target entity reverse document frequency in described microblogging language material set,
IDF(w1) represent the described any entity reverse document frequency in described microblogging language material set.
In this technical scheme, due to the important journey of target entity same in different query statements
Degree is different, and initial interpolation parameter α can to and target entity model corresponding with target entity
There is certain relation, therefore, need when different query statements is retrieved initial interpolation is joined
Number α are updated so that it becomes adaptive interpolation parameter, and determine extension according to the α ' after updating
Interrogation model, so that expanding query model is more accurate, wherein, all entities refer to microblogging document
The all of word in every microblogging document in language model, such as, in microblogging document language model
Certain microblogging document is " Zhou Jielun New cinema is excellent ", then all entities in this microblogging document
Being " Zhou Jielun ", " newly ", " film " and " excellent ", in a word, entity represents exactly
Word on our ordinary meaning, target entity is exactly the key word such as " Zhou Jielun " that user wants inquiry.
In technique scheme, it is preferable that when described target entity is multiple, according to each institute
State the target entity reverse document frequency in described microblogging language material set and each described target entity
Described target entity topic model, determines final entity topic model, to use described final reality
Body topic model, described original query model and create described expansion with described microblogging document language model
Exhibition interrogation model.
In this technical scheme, when query statement has multiple target entity, according to each target
The target entity topic model of entity and each target entity reverse literary composition in described microblogging language material set
Shelves frequency determines final entity topic model, with the expansion obtained by final entity topic model
Exhibition interrogation model is retrieved, thus the target retrieval result obtained is more accurate, i.e. target retrieval
Result has the relevant microblog document of each target entity in multiple target entity, and then makes target examine
Hitch fruit is the microblogging document that user wants to retrieve, and improves Consumer's Experience.
In technique scheme, it is preferable that according to the first establishment order received, by following
Formula determines described final entity topic model:
Wherein,Represent described final entity topic model,Represent each described mesh
The probability that mark entity is occupied in described final entity topic model, n represents described target entity
Number,Represent the target entity topic model of each described target entity, IDF (Ei) represent every
The individual described target entity reverse document frequency in described microblogging language material set,Represent every
Individual described target entity is occupied in described target entity topic model corresponding with described target entity
Probability, EiRepresent target entity described in the i-th in multiple described target entity.
In this technical scheme, when query statement has multiple target entity, permissible from formula
Find out, according to each target entity corresponding target entity topic model and each target entity described
Reverse document frequency in microblogging language material set is calculated final entity topic model, due to each
The target entity reverse document frequency in described microblogging language material set represents that each target entity is at microblogging
Significance level in language material set, therefore, is looked into by the extension obtained by final entity topic model
Ask model to retrieve, make target retrieval result have real with each target in multiple target entities
The microblogging document that body is related, and according to each target entity significance level in microblogging language material set
Determine target retrieval result, so that target retrieval result is the information that user wants to retrieve, enter
And improve retrieval effectiveness, wherein, reverse document frequency (Inverse Document Frequency,
IDF) being the significance level for weighing target entity, the IDF for target entity can be by microblogging
In language material set, the total quantity of microblogging document is divided by the quantity of the microblogging document comprising this target entity, then
The business obtained is taken the logarithm and obtains, and the IDF of target entity can affect the ginseng of the initial difference after renewal
Number.
In technique scheme, it is preferable that according to the second establishment order received, by following
Process creates target entity topic model corresponding with described target entity: when described microblogging language material set
When the language material collective database at place receives described target entity, according to described target entity from described
Microblogging language material set is extracted the M bar microblogging document relevant to described target entity;According to described mesh
Mark target domain belonging to entity, in the target domain knowledge being connected with described language material collective database
Searching for the multiple key words relevant to described target domain in storehouse, wherein, multiple described key words include
Described target entity;The virtual document corresponding with described target domain is generated according to multiple described key words;
Domain language model is set up according to described virtual document, and according to every in described microblogging language material set
All entities in microblogging document set up background language model;Use described domain language model, described
Background language model and the initial solid model corresponding with described target entity travel through described M bar microblogging
Document, and carry out n times interative computation, to obtain described target entity topic model, wherein, M >=1,
N >=1, and M and N be positive integer.
In this technical scheme, by domain language model, background language model and and the target set up
The initial solid model that entity is corresponding can control " background noise " and " field coherent noise ", only
Change microblogging document, thus accurately determine the target entity topic model of target entity, thus by by mesh
When the expanding query model that mark entity topic model extension obtains is retrieved, can retrieve substantial amounts of
The microblogging document relevant to query statement, i.e. includes the information that user is interested, such that it is able to effectively
Avoid the missing inspection to microblogging document, and then improve retrieval effectiveness, wherein, all entities refer to micro-
The all of word in every microblogging document in blog article shelves language model, such as, microblogging document language mould
Certain microblogging document in type is " Zhou Jielun New cinema is excellent ", then the institute in this microblogging document
Having entity to be " Zhou Jielun ", " newly ", " film " and " excellent ", in a word, entity is just
Being to represent the word on our ordinary meaning, target entity is exactly the key word such as " week that user wants inquiry
Jie Lun ".
In technique scheme, it is preferable that also include: corresponding with described target domain generating
After described virtual document, add up described target entity at the described virtual literary composition corresponding with described target domain
Each described key word in the first occurrence number in Dang, and multiple described key word is at described mesh
The second occurrence number in the described virtual document that mark field is corresponding;According to described first occurrence number and
Described second occurrence number determines the field priori value of described target entity;According to described field priori value
Update described domain language model.
In this technical scheme, by statistics target entity in the virtual document corresponding with target domain
The first occurrence number and each key word in multiple key words at virtual document corresponding to target domain
In the second occurrence number, determine the field priori value of target entity, thus according to field priori value pair
Domain language model is updated, and then the domain language model obtained is more accurate, i.e. domain language
Model relates to each field of target entity, and then improves retrieval effectiveness.
Fig. 2 shows the schematic flow sheet of search method according to another embodiment of the invention.
As in figure 2 it is shown, search method according to another embodiment of the invention, including:
Step 202, obtains all microblogging documents in microblogging stream.
Step 204, sets up microblogging document language model according to every microblogging document, and enters step
218。
Step 206, obtains microblogging language material set in microblogging stream, and wherein, microblogging language material set includes
Microblogging document.
Step 208, identifies all entities in microblogging document, such as, utilizes Entity recognition instrument
TwitterNLP identifies all entities, and the entity setting up each entity in all entities indexes, its
In, the list of the corresponding microblogging document sorted sequentially in time of each entity.
Step 210, identifies the target entity in query statement.
Step 212, estimates the target entity topic model of target entity, enters step 216.
Step 214, is receiving the inquiry language retrieving the microblogging document in microblogging language material set
During sentence, create original query corresponding with query statement by maximal possibility estimation and according to query statement
Model.
Step 216, according to target entity topic model and original query model (according to target entity
Corresponding target entity topic model, original query model and micro-according to every in microblogging collection of document
The microblogging document language model that blog article shelves are set up), original query model is extended, to be expanded
Exhibition interrogation model.
Step 218, sets up according to every microblogging document in expanding query model and microblogging collection of document
Microblogging document language model, carry out KL distance and calculate (statistics expanding query model and microblogging document
Similarity between language model).
Step 220, determines the target retrieval result of query statement according to similarity.
Fig. 3 shows the flow process signal of preliminary acquisition microblogging document according to an embodiment of the invention
Figure.
As it is shown on figure 3, preliminary acquisition microblogging document according to an embodiment of the invention, including:
Step 302, identifies all entities in microblogging language material set.
Step 304, the entity setting up each entity in all entities indexes, wherein, each entity
The list of a corresponding microblogging document sorted sequentially in time.
Step 306, searches out the M relevant to this target entity according to target entity in entity indexes
Bar microblogging document, the microblogging document of up-to-date issue during wherein this M bar microblogging document is entity index.
Fig. 4 shows the flow process determining target entity topic model according to an embodiment of the invention
Schematic diagram;Fig. 5 shows the principle of target entity topic model according to an embodiment of the invention
Schematic diagram.
Technical scheme is described in detail below in conjunction with Fig. 4 and Fig. 5:
As shown in Figure 4, according to an embodiment of the invention determine target entity topic model, bag
Include:
Step 402, identifies the target entity in query statement.
Step 404, according to the target domain belonging to target entity, is being connected with language material collective database
The target domain knowledge base connect is searched for the multiple key words relevant to target domain, wherein, Duo Geguan
Keyword includes target entity.
Step 406, generates the virtual document corresponding with target domain according to multiple key words, and according to void
Intend document and set up domain language model, and according in every microblogging document in microblogging language material set
All entities are set up background language model and set up the initial solid model corresponding with target entity, thus
Mixed model is set up, such as Fig. 5 institute by domain language model, background language model and initial solid model
Show, and set up process by mixed model, derive the target entity model of target entity, wherein,
λ shown in Fig. 5CAnd λEIt is parameter preset, γ1And γkRepresent the power of the 1st domain language model
Weight values and the weighted value of kth domain language model, EF represents the M bar microblogging document in Fig. 3,Represent initial solid model,Represent background language model andRepresent k domain language
Model.
Step 408 (is equal to step 306), according to target entity entity index in search out with
The M bar microblogging document that this target entity is relevant (extracts from microblogging language material set according to target entity
The M bar microblogging document relevant to target entity).
Step 410, carries out model parameter iterative computation by EM algorithm traversal M bar microblogging document,
Wherein, EM algorithmic notation expectation-maximization algorithm (Expectation Maximization Algorithm,
Also known as EM algorithm).
Step 412, is iterated calculating to mixed model according to the model parameter after iterative computation, with
Obtaining target entity topic model, wherein, iterations is preset times n times, when carrying out for the first time
During iteration, the initial solid model corresponding with target entity may be approximately equal to background language model,
M >=1, N >=1, and M and N are positive integer.
Fig. 6 show according to an embodiment of the invention determine expanding query model and target inspection
The schematic flow sheet of hitch fruit.
As shown in Figure 6, according to an embodiment of the invention expanding query model and target are determined
Retrieval result, including:
Step 602, identifies the target entity in query statement.
Step 604, sets up the target entity topic model corresponding with target entity, enters step 610.
Step 606, calculates initial interpolation parameter α, to obtain α ', enters step 610.
Step 608, creates and query statement correspondingly original query model according to query statement, enters
Step 610.
Step 610, enters target entity topic model, initial interpolation parameter α ' and original query model
Line linearity superposition, determines expanding query model.
Step 612, obtains microblogging document in microblogging stream.
Step 614, sets up microblogging document language mould according to every microblogging document in microblogging collection of document
Type.
Step 616, carries out KL distance and calculates expanding query model and microblogging document language model
(similarity between statistics expanding query model and microblogging document language model).
Step 618, using similarity more than or equal to presetting the target microblogging document of similarity as target
Retrieval result.
Fig. 7 shows the structural representation of searching system according to an embodiment of the invention.
As it is shown in fig. 7, searching system 700 according to an embodiment of the invention, including: first
Model creating unit 702, Entity recognition unit 704, model extension unit 706 and retrieval result are true
Cell 708, wherein, described first model creating unit 702 is for receiving microblogging language material
When microblogging document in set carries out the query statement retrieved, create with described according to described query statement
Query statement corresponding original query model;Entity recognition unit 704, identifies in described query statement
Target entity;Model extension unit 706, according to target entity master corresponding with described target entity
Inscribe model, described original query model and build according to every microblogging document in described microblogging collection of document
Vertical microblogging document language model, is extended described original query model, with the inquiry that is expanded
Model;Retrieval result determines unit 708, adds up described expanding query model and described microblogging document language
Similarity between speech model, to determine that according to described similarity the target retrieval of described query statement is tied
Really.
In this technical scheme, using query statement, the microblogging document in microblogging language material set is carried out
During retrieval, owing to query statement including the another name of target entity, therefore, by identifying inquiry language
Target entity in Ju can be effectively improved retrieval effectiveness, it addition, by corresponding to query statement
Ground original query model is extended the interrogation model that is expanded, so according to expanding query model to micro-
When blog article shelves are retrieved, the substantial amounts of microblogging document relevant to query statement can be retrieved, i.e. wrap
Include the information that user is interested, such that it is able to efficiently avoid the missing inspection to microblogging document, and then
Make microblogging document to carry out retrieval ground more comprehensively, and micro-by statistics expanding query model and every
Similarity between blog article shelves corresponding microblogging document language model determines target retrieval result, thus
Make target retrieval result more accurate, also improve the robustness of retrieval simultaneously.Therefore, by this skill
Art scheme, user can retrieve exactly in microblogging document and obtain target retrieval result, thus improve
Accuracy rate, wherein, target entity is the target keyword that the user in query statement wants inquiry, example
As query statement be the target entity in " Zhou Jielun New cinema " be " Zhou Jielun ", and " newly " and
" film " is also other entities or refers to the word on our ordinary meaning.In technique scheme,
Preferably, described retrieval result determines that unit 708 includes: similarity statistic unit 7082, passes through
It is described similar that below equation is added up between described expanding query model and described microblogging document language model
Degree, and using similarity more than or equal to presetting the target microblogging document of similarity as described target retrieval
Result:
Wherein, Score (Q, D) represents described similarity, and V represents in described microblogging document language model
All entities,Represent described expanding query model,Represent described microblogging document language model,Represent the probability that described target entity is occupied in described expanding query model,Represent the probability that described target entity is occupied in described microblogging document language model.
In this technical scheme, substantial amounts of microblogging can be retrieved by the expanding query model after extension
Document, but may comprise in this substantial amounts of microblogging document have many consumers the information less paid close attention to or
These information does not arranges according to certain order of priority, and the information that i.e. user less pays close attention to may
Before coming the information that user pays special attention to, therefore, by statistics expanding query model and microblogging literary composition
Similarity between shelves language model, and determine target retrieval result according to the height of this similarity, can
To filter out information the most inessential, that relatedness is less or user less pays close attention to, therefore, by this
Technical scheme, can improve the matching accuracy rate of retrieval result, improves target retrieval result further
Accuracy, wherein, above-mentioned formula be KL distance (Kullback-Leibler Divergence, also known as
Relative entropy) calculating, wherein, all entities refer to every microblogging document in microblogging document language model
In all of word, such as, certain microblogging document in microblogging document language model be " week Jie Lunxin
Film is excellent ", then all entities in this microblogging document are " Zhou Jielun ", " newly ",
" film " and " excellent ", in a word, entity is exactly to represent the word on our ordinary meaning, target
Entity is exactly the key word such as " Zhou Jielun " that user wants inquiry.
In technique scheme, it is preferable that model extension unit 706 specifically for: according to following
Formula is calculated described expanding query model:
Wherein,Represent described expanding query model,Represent described original query model,Table
Show described target entity topic model,Represent that described target entity is at described expanding query
The probability occupied in model,Represent that described target entity is in described original query model
The probability occupied,Represent that described target entity is occupied in described target entity model
Probability, described α represents initial interpolation parameter.
In this technical scheme, the retrieval results contrast corresponding due to original query model is few, even also
Do not comprise user and need the information of retrieval, accordingly, it would be desirable to be extended being expanded to original query model
Exhibition interrogation model, when so retrieving microblogging document according to expanding query model, can retrieve
The substantial amounts of microblogging document relevant to query statement, i.e. includes the information that user is interested, thus can
Efficiently avoid the missing inspection to microblogging document, and then microblogging document is carried out retrieval ground more comprehensively,
Further increasing retrieval effectiveness.
In technique scheme, it is preferable that also include: parameter updating block 710, according to reception
The more newer command arrived, updates described α according to below equation, to obtain α ':
Wherein, w represents described target entity, and E represents all entities in described target entity model,
Q represents all entities in described query statement, w1Represent any entity in described query statement,
IDF (w) represent the described target entity reverse document frequency in described microblogging language material set,
IDF(w1) represent the described any entity reverse document frequency in described microblogging language material set.
In this technical scheme, due to the important journey of target entity same in different query statements
Degree is different, and initial interpolation parameter α can to and target entity model corresponding with target entity
There is certain relation, therefore, need when different query statements is retrieved initial interpolation is joined
Number α are updated so that it becomes adaptive interpolation parameter, and determine extension according to the α ' after updating
Interrogation model, so that expanding query model is more accurate, wherein, all entities refer to microblogging document
The all of word in every microblogging document in language model, such as, in microblogging document language model
Certain microblogging document is " Zhou Jielun New cinema is excellent ", then all entities in this microblogging document
Being " Zhou Jielun ", " newly ", " film " and " excellent ", in a word, entity represents exactly
Word on our ordinary meaning, target entity is exactly the key word such as " Zhou Jielun " that user wants inquiry.
In technique scheme, it is preferable that also include: described model extension unit 706 is additionally operable to:
When described target entity is multiple, according to each described target entity in described microblogging language material set
Reverse document frequency and the described target entity topic model of each described target entity, determine final
Entity topic model, with use described final entity topic model, described original query model and
Described expanding query model is created with described microblogging document language model.
In this technical scheme, when query statement has multiple target entity, according to each target
The target entity topic model of entity and each target entity reverse literary composition in described microblogging language material set
Shelves frequency determines final entity topic model, with the expansion obtained by final entity topic model
Exhibition interrogation model is retrieved, thus the target retrieval result obtained is more accurate, i.e. target retrieval
Result has the relevant microblog document of each target entity in multiple target entity, and then makes target examine
Hitch fruit is the microblogging document that user wants to retrieve, and improves Consumer's Experience.
In technique scheme, it is preferable that also include: described model extension unit 706 is specifically used
In: according to the first establishment order received, determine described final entity theme by below equation
Model:
Wherein,Represent described final entity topic model,Represent each described mesh
The probability that mark entity is occupied in described final entity topic model, n represents described target entity
Number,Represent the target entity topic model of each described target entity, IDF (Ei) represent every
The individual described target entity reverse document frequency in described microblogging language material set,Represent every
Individual described target entity is occupied in described target entity topic model corresponding with described target entity
Probability, EiRepresent target entity described in the i-th in multiple described target entity.
In this technical scheme, when query statement has multiple target entity, permissible from formula
Find out, according to each target entity corresponding target entity topic model and each target entity described
Reverse document frequency in microblogging language material set is calculated final entity topic model, due to each
The target entity reverse document frequency in described microblogging language material set represents that each target entity is at microblogging
Significance level in language material set, therefore, is looked into by the extension obtained by final entity topic model
Ask model to retrieve, make target retrieval result have real with each target in multiple target entities
The microblogging document that body is related, and according to each target entity significance level in microblogging language material set
Determine target retrieval result, so that target retrieval result is the information that user wants to retrieve, enter
And improve retrieval effectiveness, wherein, reverse document frequency (Inverse Document Frequency,
IDF) being the significance level for weighing target entity, the IDF for target entity can be by microblogging
In language material set, the total quantity of microblogging document is divided by the quantity of the microblogging document comprising this target entity, then
The business obtained is taken the logarithm and obtains, and the IDF of target entity can affect the ginseng of the initial difference after renewal
Number.
In technique scheme, it is preferable that also include: the second model creating unit 712, it is used for
According to the second establishment order received, create mesh corresponding with described target entity by procedure below
Mark entity topic model: when the language material collective database at described microblogging language material set place receives described
During target entity, extract real with described target from described microblogging language material set according to described target entity
The M bar microblogging document that body is relevant, according to the target domain belonging to described target entity, with described
The target domain knowledge base that language material collective database is connected is searched for relevant to described target domain many
Individual key word, wherein, multiple described key words include described target entity, according to multiple described keys
Word generates the virtual document corresponding with described target domain, sets up domain language according to described virtual document
Model, and set up background according to all entities in every microblogging document in described microblogging language material set
Language model, use described domain language model, described background language model and with described target entity
Corresponding initial solid model travels through described M bar microblogging document, and carries out n times interative computation, with
Obtain described target entity topic model, wherein, M >=1, N >=1, and M and N and be positive integer.
In this technical scheme, by domain language model, background language model and and the target set up
The initial solid model that entity is corresponding can control " background noise " and " field coherent noise ", only
Change microblogging document, thus accurately determine the target entity topic model of target entity, thus by by mesh
When the expanding query model that mark entity topic model extension obtains is retrieved, can retrieve substantial amounts of
The microblogging document relevant to query statement, i.e. includes the information that user is interested, such that it is able to effectively
Avoid the missing inspection to microblogging document, and then improve retrieval effectiveness, wherein, all entities refer to micro-
The all of word in every microblogging document in blog article shelves language model, such as, microblogging document language mould
Certain microblogging document in type is " Zhou Jielun New cinema is excellent ", then the institute in this microblogging document
Having entity to be " Zhou Jielun ", " newly ", " film " and " excellent ", in a word, entity is just
Being to represent the word on our ordinary meaning, target entity is exactly the key word such as " week that user wants inquiry
Jie Lun ".
In technique scheme, it is preferable that described second theme model creating unit also includes::
Number of times statistic unit 7122, after generating the described virtual document corresponding with described target domain, system
Count the described target entity the first occurrence number in the described virtual document corresponding with described target domain,
And each described key word in multiple described key word is corresponding described virtual of described target domain
The second occurrence number in document;Priori value determines unit 7124, according to described first occurrence number
With the field priori value that described second occurrence number determines described target entity;Domain model updating block
7126, update described domain language model according to described field priori value.
In this technical scheme, by statistics target entity in the virtual document corresponding with target domain
The first occurrence number and each key word in multiple key words at virtual document corresponding to target domain
In the second occurrence number, determine the field priori value of target entity, thus according to field priori value pair
Domain language model is updated, and then the domain language model obtained is more accurate, i.e. domain language
Model relates to each field of target entity, and then improves retrieval effectiveness.
Fig. 8 shows the structural representation of searching system according to another embodiment of the invention.
As shown in Figure 8, searching system 800 according to another embodiment of the invention (is equivalent to figure
The searching system 700 of the embodiment shown in 7), including: entity microblogging set acquisition module 802,
For collecting the microblogging document relevant to target entity;Entity topic model estimation module 804 is (quite
The second model creating unit 712 in the embodiment shown in Fig. 7), it is used for carrying out target entity theme
The estimation of model;Adaptive Query Processing expansion module 806 (is equivalent to the model of the embodiment shown in Fig. 7
Expanding element 706), for target entity topic model is incorporated in microblogging document language model.
These modules the following detailed description of searching system 800:
1. entity microblogging set acquisition module 802 specifically for: the target entity in query statement is entered
Row identifies, the foundation of entity index, and chooses the microblogging document relevant to target entity.
2. entity topic model estimation module 804 includes: knowledge base link module 8042, priori value meter
Calculate module 8044 (being equivalent to the priori value computing unit 7124 of the embodiment shown in Fig. 7) and generate
Formula model construction module 8046, knowledge base link module 8042 is for being linked to target entity
Freebase knowledge base, and obtain this target entity target domain belonging in Freebase knowledge base
(field in Freebase can regard the different spaces of a whole page of popular newspaper as: such as business, life style,
Art, amusement, politics, economic dispatch);Priori value computing module 8044 is led with target for obtaining
Multiple key words that territory is relevant, wherein, multiple described key words include described target entity, according to many
Individual key word generates the virtual document corresponding with target domain, carries out maximum likelihood on this virtual document
Estimate to generate field priori value;Production model construction module 8046 is used for building and target entity
Corresponding initial solid model, background language model and domain language model, and utilize EM algorithm to exist
Microblogging document is iterated calculate, to obtain target entity topic model.
3. Adaptive Query Processing expansion module 806, for being modeled obtaining original query to query statement
Model, and be modeled obtaining microblogging document language to every microblogging document in microblogging collection of document
Model, is extended original query model by target entity topic model, with the inquiry that is expanded
Model, expanding query model and microblogging document language model are carried out KL distance calculate, by according in terms of
Calculate result and obtain target retrieval result.Technical scheme will be explained in further detail below:
One, entity is identified.
1. utilize Entity recognition instrument TwitterNLP to identify all entities in microblogging document.
2. set up entity index, corresponding for each entity in all entities one according to time sequence
The list of microblogging document.
3. identify the target entity in query statement, and in entity indexes, obtain the M bar of up-to-date issue
Comprise the microblogging document of this target entity.
Two, target entity topic model is set up.
1. target entity is linked to Freebase knowledge base (target domain knowledge base), reads target
Entity entity information in Freebase knowledge base, to obtain the target domain belonging to target entity
(such as music field, world of art, books field).Particularly, if target entity does not links
To entity information, then it is assumed that this target entity belongs to any one field.
2. calculating field priori value, in indexing according to entity, all entity trial Freebase search connects
Mouth is linked to Freebase knowledge base, and the attribute under different field and type word are constituted a virtual literary composition
Shelves are (relevant with target domain to search in the target domain knowledge base that language material collective database is connected
Multiple key words, wherein, multiple key words include target entity, and generate according to multiple key words
The virtual document corresponding with target domain), this virtual document use following equation carry out the most seemingly
So estimate to generate field priori value:
Wherein, w represents that target entity, d represent the target domain belonging to target entity, w2Expression is many
Each key word in individual key word, (w d) represents that w is in the virtual document that target domain d is corresponding to c
The first occurrence number, c (w2, d) represent that each key word in multiple key word is at target domain pair
The second occurrence number in the virtual document answered, n represents the total quantity of key word.
3. set up target entity topic model, set up domain language model according to virtual document, and according to
The all entities in every microblogging document in microblogging language material set set up background language model, Yi Jijian
The vertical initial solid model corresponding with target entity, wherein, initial solid model can be similar to background
Language model, is formed mixed model by domain language model, background language model and initial solid model.
4. utilize EM algorithm to carry out model estimation.According to mixed model as shown in Figure 5, Wo Menke
It is expressed as with the log-likelihood function by the M bar microblogging set EF of return:
Wherein, EF represents the M bar microblogging document searched out above, and i is used for traveling through microblogging corpus
All microblogging documents in conjunction, w represents all realities in every microblogging document in microblogging language material set
Each entity in body, DiRepresenting i-th microblogging document in microblogging language material set, k represents target
The quantity of the target domain belonging to entity,Represent what w was occupied in target entity model
Probability,Represent the frequency that word w is occupied in background language model,Represent word
The frequency that w is occupied in domain language model, c (w, Di) it is that word w is at DiThe number of times of middle appearance,
λCRepresent the first parameter preset, λERepresent the second parameter preset, λCAnd λEIt is respectively used to control background make an uproar
Sound and field coherent noise, γdRepresent the weighted value of target domain language model.
Use EM algorithm i.e.Mixed model is carried out maximal possibility estimation, at microblogging
Iteration undated parameter on language material set EF, thus obtain below equation:
Wherein, n represents the number of times of current iteration, and w represents target entity, and w ' represents microblogging corpus
Each entity in all entities closed, d ' represents each field in all spectra,
s(n)(w), r(n)W () is the intermediate variable in order to represent convenience of calculation,Represent that w exists
The probability in domain language model during (n+1) wheel iteration,Represent that w is the
(n+1) probability in entity topic model during wheel iteration,When representing (n+1) wheel iteration
The weighted value of domain language model, in summation subscript, w/w ' is used for traveling through in microblogging language material set
All entities, i for travel through feedback microblogging set in all microblogging documents, d/d ' is used for traveling through institute
Having field, k to represent the quantity of the target domain belonging to target entity E, λ represents default iterative parameter.
It addition, updateDuring can use the field priori value of target entity
p(w|d).At each gram language model p (w | d) one conjugate prior of upper definition, (i.e. Di Li Cray is first
Test), then, use maximum a posteriori probability (Maximum A Posteriori, MAP) to estimate
All of parameter, it is only necessary to do the least change on the more new formula of domain language model, by under
Row formula carries out MAP estimation:
So far, use after above formula number of iterations wheel (such as 100 take turns), target entity can be obtained
Topic model
Three, Adaptive Query Processing extension.
1. when receiving the query statement that the microblogging document in microblogging language material set is retrieved, root
Original query model corresponding with query statement is created according to query statement, and according to microblogging collection of document
In every microblogging document set up microblogging document language model.
2. by target entity topic model, original query model is extended the interrogation model that is expanded.
It is calculated expanding query model according to following equation:
Wherein,Represent expanding query model,Represent original query model,Represent that target is real
Body topic model,Represent the probability that target entity is occupied in expanding query model,Represent the probability that target entity is occupied in original query model,Represent mesh
The probability that mark entity is occupied in target entity model, α represents initial interpolation parameter, and α controls target
The significance level of entity topic model.
In the related, initial interpolation parameter α is disposed as one admittedly for all of query statement
Fixed value, however, it is contemplated that the importance degree of same target entity is not in different query statement
Identical, it is possible to initial interpolation parameter is updated, updates α according to below equation, with
To α ':
Wherein, w represents that target entity, E represent all entities in target entity model, and Q represents and looks into
Ask all entities in statement, w1Representing any entity in query statement, IDF (w) represents that target is real
The body reverse document frequency in microblogging language material set, IDF (w1) represent that any entity is at microblogging language material
Reverse document frequency in set.
Particularly, when query statement there being multiple target entity identified, real according to each target
The cum rights meansigma methods of the target entity topic model of body determines final entity topic model, specifically,
Final entity topic model is determined by below equation:
Wherein,Represent final entity topic model,Represent that each target entity exists
The probability occupied in final entity topic model, n represents the number of target entity,Represent every
The target entity topic model of individual target entity, IDF (Ei) represent that each target entity is in microblogging corpus
Reverse document frequency in conjunction,Represent that each target entity is at mesh corresponding with target entity
The probability occupied in mark entity topic model, EiRepresent that the i-th target in multiple target entity is real
Body.
3.KL distance calculate (statistics expanding query model and microblogging document language model between similar
Degree), by the similarity between below equation statistics expanding query model and microblogging document language model,
And using similarity more than or equal to presetting the target microblogging document of similarity as target retrieval result:
Wherein, Score (Q, D) represents similarity, and V represents all realities in microblogging document language model
Body,Represent expanding query model,Represent microblogging document language model,Represent
The probability that target entity is occupied in expanding query model,Represent that target entity is at microblogging
The probability occupied in document language model.
Below in conjunction with an embodiment, the present invention is further described through:
1) carry out pretreatment stage, every microblogging document in microblogging stream is all used Entity recognition instrument
Identify all entities comprised.Such as microblogging document is that " New cinema of Zhou Jielun is really clapped so good
", we have identified entity " Zhou Jielun ", then this microblogging numbering (id) is stored in reality by us
Entity item corresponding in body index;For target entity, we obtain from entity indexes and are newly joined
M bar microblogging document as microblogging language material set.
2) firstly for target entity " Zhou Jielun ", Freebase searching interface is used to attempt link
Object in Freebase knowledge base, and obtain its affiliated target domain, i.e. film, music,
TV, personage, media, awards.
Build mixed model, this mixed model include initial solid topic model that " Zhou Jielun " is corresponding,
Background language model and six domain language models.
Use domain language model, background language model and the initial solid model corresponding with target entity
Traversal M bar microblogging document, and carry out n times interative computation, to obtain target entity topic model,
Wherein, M >=1, N >=1, and M and N be positive integer.
3) query statement and every microblogging document being carried out maximum likelihood modeling, such as query statement is
" Zhou Jielun New cinema ", obtains after participle [" Zhou Jielun ", " newly ", " film "], passes through
Maximal possibility estimation creates original query model, p (Zhou Jielun)=0.33, p (newly)=0.33, p (electricity
Shadow)=0.33, and set up microblogging document language model according to every microblogging document, wherein, for often
The Maximum-likelihood estimation modeling of bar microblogging document is similar with the estimation of original query model modeling.
Identifying the target entity in query statement, such as query statement is " Zhou Jielun New cinema ", knows
Do not go out target entity for " Zhou Jielun ".
Utilize " Zhou Jielun " target entity topic model to extend original query model, be expanded and look into
Ask model, calculate initial interpolation parameter:
Original query model is extended, due to query statement " week according to linear interpolation formula above
Outstanding human relations New cinema " in only have a target entity " Zhou Jielun ", therefore, it can directly utilize this mesh
The target entity topic model of mark entity is extended.
Utilize KL expanding query model after computing formula calculates extension and microblogging document language
The similarity of model, microblogging document language model utilizes the Maximum-likelihood estimation of microblogging document, and carries out
Di Li Cray smoothing processing.
The target retrieval result of query statement is determined according to similarity.
Technical scheme is described in detail, it is possible to use family is exactly micro-above in association with accompanying drawing
In blog article shelves, retrieval obtains target retrieval result, thus improves retrieval rate, can also have simultaneously
Effect ground strengthens the robustness of retrieval.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for
For those skilled in the art, the present invention can have various modifications and variations.All essences in the present invention
Within god and principle, any modification, equivalent substitution and improvement etc. made, should be included in the present invention
Protection domain within.