Disclosure of Invention
Based on the problems, the invention provides a new technical scheme, and can solve the technical problem that a user cannot accurately retrieve a target retrieval result from a microblog document.
In view of the above, an aspect of the present invention provides a retrieval method, including: when a query statement for searching microblog documents in a microblog corpus set is received, an original query model corresponding to the query statement is created according to the query statement; identifying a target entity in the query statement; expanding the original query model according to a target entity topic model corresponding to the target entity, the original query model and a microblog document language model established according to each microblog document in the microblog document set to obtain an expanded query model; and carrying out statistics on the similarity between the expanded query model and the microblog document language model so as to determine a target retrieval result of the query statement according to the similarity.
In the technical scheme, when the query statement is used for searching the microblog documents in the microblog corpus set, the query statement comprises the alias of the target entity, so that the searching effect can be effectively improved by identifying the target entity in the query statement, in addition, the expanded query model is obtained by expanding the original query model corresponding to the query statement, so that when the microblog documents are searched according to the expanded query model, a large number of microblog documents related to the query statement, namely information interested by a user, can be searched, the missed detection of the microblog documents can be effectively avoided, the microblog documents can be searched more comprehensively, and the target searching result can be determined by counting the similarity between the expanded query model and the microblog document language model corresponding to each microblog document, so that the target searching result is more accurate, and meanwhile, the retrieval robustness is also improved. Therefore, according to the technical scheme, the user can accurately retrieve the target retrieval result from the microblog document, so that the retrieval accuracy is improved, wherein the target entity is a keyword in the query statement, for example, the target entity in the query statement of 'Zhongjilun New film' is 'Zhongjilun'.
In the above technical solution, preferably, the similarity between the extended query model and the microblog document language model is counted by the following formula, and a target microblog document with the similarity greater than or equal to a preset similarity is taken as the target retrieval result:
wherein Score (Q, D) represents the similarity, V represents all entities in the microblog document language model,representing the extended query model in a representation of the query model,representing the language model of the microblog documents,representing a probability that the target entity occupies the extended query model,and representing the probability of the target entity in the microblog document language model.
In the technical scheme, a large number of microblog documents can be retrieved through the expanded query model after expansion, but the large number of microblog documents may include a lot of information which is not much concerned by the user or the information is not arranged according to a certain priority, that is, the information which is not much concerned by the user may be arranged before the information which is much concerned by the user, so that the matching accuracy of the retrieval result can be improved and the accuracy of the target retrieval result can be further improved by counting the similarity between the expanded query model and the microblog document language model and determining the target retrieval result according to the similarity, and thus, the formula is the calculation of KL distance (also called relative entropy), all entities refer to all words in each microblog document in the microblog document language model, for example, if a certain microblog document in the microblog document language model is "zhou jilun new movie is too good", all entities in the microblog document are "zhou jilun", "new", "movie" and "too good", in short, the entities are words representing the general meaning of the user, and the target entity is a keyword which the user wants to query, such as "zhou jilun".
In the above technical solution, preferably, the extended query model is obtained by calculation according to the following formula:
wherein,representing the extended query model in a representation of the query model,representing the original query model in a representation of the original query model,representing the subject model of the target entity,representing a probability that the target entity occupies the extended query model,representing the probability that the target entity occupies the original query model,representing the probability that the target entity occupies the target entity model, and the α representing the initial interpolation parameter.
In the technical scheme, because the retrieval results corresponding to the original query model are less, and even the information which needs to be retrieved by the user is not contained, the original query model needs to be expanded to obtain an expanded query model, so that when the microblog documents are retrieved according to the expanded query model, a large number of microblog documents related to the query sentence, namely the microblog documents including the information which is interested by the user, can be retrieved, missing detection of the microblog documents can be effectively avoided, the microblog documents can be retrieved more comprehensively, and the retrieval effect is further improved.
In the above technical solution, preferably, according to the received update command, α is updated according to the following formula to obtain α':
wherein w represents the target entity, E represents all entities in the target entity model, Q represents all entities in the query statement, w represents1Represents any entity in the query statement, IDF (w) represents the reverse document frequency of the target entity in the microblog corpus set, and IDF (w)1) And representing the reverse document frequency of any entity in the microblog corpus set.
In the technical scheme, because the importance degrees of the same target entity in different query sentences are different, and the initial interpolation parameter α has a certain relationship with a target entity model corresponding to the target entity, when different query sentences are retrieved, the initial interpolation parameter α needs to be updated to be changed into a self-adaptive interpolation parameter, and an extended query model is determined according to the updated α ', so that the extended query model is more accurate, wherein all entities refer to all words in each microblog document in the microblog document language model, for example, if a certain microblog document in the microblog document language model is ' Zhou Jie Lun New movie Tai good ', all entities in the microblog document are ' Zhou Jie Lun ', ' New ', ' movie ' and ' Tai good ', in short, the entities represent words in the common sense of us, the target entity is a keyword such as "Zhou Jieren" that the user wants to query.
In the foregoing technical solution, preferably, when a plurality of target entities are provided, a final entity topic model is determined according to a reverse document frequency of each target entity in the micro-blog corpus set and the target entity topic model of each target entity, so as to create the extended query model by using the final entity topic model, the original query model and the micro-blog document language model.
In the technical scheme, when a query statement comprises a plurality of target entities, a final entity topic model is determined according to a target entity topic model of each target entity and the reverse document frequency of each target entity in the microblog corpus set, and retrieval is performed through an extended query model obtained through the final entity topic model, so that an obtained target retrieval result is more accurate, namely the target retrieval result comprises related microblog documents of each target entity in the plurality of target entities, the target retrieval result is a microblog document which a user wants to retrieve, and user experience is improved.
In the above technical solution, preferably, according to the received first creation command, the final entity topic model is determined by the following formula:
wherein,representing the final solid topic model in the form of a model,representing the probability that each of the target entities occupies in the final entity topic model, n representing the number of the target entities,a target entity topic model, IDF (E), representing each of said target entitiesi) Representing the reverse document frequency of each target entity in the microblog corpus set,representing the probability that each of the target entities occupies in the target entity topic model corresponding to the target entity, EiRepresenting an ith said target entity of a plurality of said target entities.
In the technical scheme, when a query statement has a plurality of target entities, a final entity topic model is obtained by calculation according to a target entity topic model corresponding to each target entity and the reverse document frequency of each target entity in the microblog corpus set according to a formula, because the reverse document frequency of each target entity in the microblog corpus set represents the importance degree of each target entity in the microblog corpus set, the target retrieval result has microblog documents related to each target entity in the plurality of target entities by searching through an expanded query model obtained by the final entity topic model, and the target retrieval result is determined according to the importance degree of each target entity in the microblog corpus set, so that the target retrieval result is information which a user wants to retrieve, and the retrieval effect is improved, the Inverse Document Frequency (IDF) is used for measuring the importance of the target entity, the IDF of the target entity may be obtained by dividing the total number of microblog documents in the microblog corpus set by the number of microblog documents including the target entity, and then taking the logarithm of the obtained quotient, and the IDF of the target entity may affect the updated initial difference parameter.
In the foregoing technical solution, preferably, according to the received second creation command, a target entity topic model corresponding to the target entity is created through the following process: when a corpus set database where the microblog corpus set is located receives the target entity, extracting M microblog documents related to the target entity from the microblog corpus set according to the target entity; searching a target domain knowledge base connected with the corpus collection database for a plurality of keywords related to the target domain according to the target domain to which the target entity belongs, wherein the keywords comprise the target entity; generating a virtual document corresponding to the target field according to the plurality of keywords; establishing a domain language model according to the virtual document, and establishing a background language model according to all entities in each microblog document in the microblog corpus set; and traversing the M microblog documents by using the domain language model, the background language model and the initial entity model corresponding to the target entity, and performing N times of iterative operation to obtain the target entity topic model, wherein M is more than or equal to 1, N is more than or equal to 1, and M and N are positive integers.
In the technical scheme, the background noise and the field-related noise can be controlled through the established field language model, the background language model and the initial entity model corresponding to the target entity, microblog documents are purified, so that the target entity topic model of the target entity is accurately determined, a large number of microblog documents related to query sentences can be retrieved when the expanded query model obtained by expanding the target entity topic model is retrieved, namely the microblog documents comprise information interested by users, the missed detection of the microblog documents can be effectively avoided, and the retrieval effect is improved, wherein all entities refer to all words in each microblog document in the microblog document language model, for example, a certain microblog document in the microblog document language model is 'Zhou Jie Lun new movie Tai', all entities in the microblog document are 'Zhou Jie Lun', "New", "movie" and "Taikung", in summary, the entities are words representing our usual meaning, and the target entities are keywords that the user wants to query like "Zhou Jilun".
In the above technical solution, preferably, the method further includes: after the virtual document corresponding to the target field is generated, counting the first occurrence times of the target entity in the virtual document corresponding to the target field and the second occurrence times of each keyword in a plurality of keywords in the virtual document corresponding to the target field; determining a domain prior value of the target entity according to the first occurrence number and the second occurrence number; and updating the domain language model according to the domain prior value.
According to the technical scheme, the domain prior value of the target entity is determined by counting the first occurrence frequency of the target entity in the virtual document corresponding to the target domain and the second occurrence frequency of each keyword in the plurality of keywords in the virtual document corresponding to the target domain, so that the domain language model is updated according to the domain prior value, the obtained domain language model is more accurate, namely, each domain related to the target entity in the domain language model, and the retrieval effect is improved.
Another aspect of the present invention provides a retrieval system, including: the microblog query modeling method comprises a first model creating unit, a second model creating unit and a third model creating unit, wherein the first model creating unit creates an original query model corresponding to a query statement according to the query statement when the query statement for searching microblog documents in a microblog corpus set is received; an entity identification unit which identifies a target entity in the query statement; the model expansion unit is used for expanding the original query model according to a target entity topic model corresponding to the target entity, the original query model and a microblog document language model established according to each microblog document in the microblog document set so as to obtain an expanded query model; and the retrieval result determining unit is used for counting the similarity between the expanded query model and the microblog document language model so as to determine the target retrieval result of the query statement according to the similarity.
In the technical scheme, when the query statement is used for searching the microblog documents in the microblog corpus set, the query statement comprises the alias of the target entity, so that the searching effect can be effectively improved by identifying the target entity in the query statement, in addition, the expanded query model is obtained by expanding the original query model corresponding to the query statement, so that when the microblog documents are searched according to the expanded query model, a large number of microblog documents related to the query statement, namely information interested by a user, can be searched, the missed detection of the microblog documents can be effectively avoided, the microblog documents can be searched more comprehensively, and the target searching result can be determined by counting the similarity between the expanded query model and the microblog document language model corresponding to each microblog document, so that the target searching result is more accurate, and meanwhile, the retrieval robustness is also improved. Therefore, according to the technical scheme, the user can accurately retrieve the target retrieval result from the microblog document, so that the accuracy is improved, wherein the target entity is a target keyword which the user wants to query in the query statement, for example, the target entity in the query statement of 'Zhou Ji Lun New film' is 'Zhou Ji Lun', and 'New' and 'film' are other entities or words in the general sense of us.
In the foregoing technical solution, preferably, the search result determining unit includes: a similarity counting unit, counting the similarity between the expanded query model and the microblog document language model through the following formula, and taking a target microblog document with the similarity greater than or equal to a preset similarity as the target retrieval result:
wherein Score (Q, D) represents the similarity, V represents all entities in the microblog document language model,representing the extended query model in a representation of the query model,representing the language model of the microblog documents,representing a probability that the target entity occupies the extended query model,and representing the probability of the target entity in the microblog document language model.
In the technical scheme, a large number of microblog documents can be retrieved through the expanded query model after expansion, but the large number of microblog documents may include a lot of information which is not much concerned by the user or the information is not arranged according to a certain priority, that is, the information which is not much concerned by the user may be arranged before the information which is much concerned by the user, so that the matching accuracy of the retrieval result can be improved and the accuracy of the target retrieval result can be further improved by counting the similarity between the expanded query model and the microblog document language model and determining the target retrieval result according to the similarity, and thus, the formula is the calculation of KL distance (also called relative entropy), all entities refer to all words in each microblog document in the microblog document language model, for example, if a certain microblog document in the microblog document language model is "zhou jilun new movie is too good", all entities in the microblog document are "zhou jilun", "new", "movie" and "too good", in short, the entities are words representing the general meaning of the user, and the target entity is a keyword which the user wants to query, such as "zhou jilun".
In the above technical solution, preferably, the model extension unit is specifically configured to: the extended query model is calculated according to the following formula:
wherein,representing the extended query model in a representation of the query model,representing the original query model in a representation of the original query model,representing the subject model of the target entity,representing a probability that the target entity occupies the extended query model,representing the probability that the target entity occupies the original query model,representing the probability that the target entity occupies the target entity model, and the α representing the initial interpolation parameter.
In the technical scheme, because the retrieval results corresponding to the original query model are less, and even the information which needs to be retrieved by the user is not contained, the original query model needs to be expanded to obtain an expanded query model, so that when the microblog documents are retrieved according to the expanded query model, a large number of microblog documents related to the query sentence, namely the microblog documents including the information which is interested by the user, can be retrieved, missing detection of the microblog documents can be effectively avoided, the microblog documents can be retrieved more comprehensively, and the retrieval effect is further improved.
In the above technical solution, preferably, the method further includes: the parameter updating unit updates the alpha according to the following formula according to the received updating command so as to obtain alpha':
wherein w represents the target entity, E represents all entities in the target entity model, Q represents all entities in the query statement, w represents1Represents any entity in the query statement, IDF (w) represents the reverse document frequency of the target entity in the microblog corpus set, and IDF (w)1) And representing the reverse document frequency of any entity in the microblog corpus set.
In the technical scheme, because the importance degrees of the same target entity in different query sentences are different, and the initial interpolation parameter α has a certain relationship with a target entity model corresponding to the target entity, when different query sentences are retrieved, the initial interpolation parameter α needs to be updated to be changed into a self-adaptive interpolation parameter, and an extended query model is determined according to the updated α ', so that the extended query model is more accurate, wherein all entities refer to all words in each microblog document in the microblog document language model, for example, if a certain microblog document in the microblog document language model is ' Zhou Jie Lun New movie Tai good ', all entities in the microblog document are ' Zhou Jie Lun ', ' New ', ' movie ' and ' Tai good ', in short, the entities represent words in the common sense of us, the target entity is a keyword such as "Zhou Jieren" that the user wants to query.
In the above technical solution, preferably, the method further includes: the model extension unit is further configured to: when the number of the target entities is multiple, determining a final entity topic model according to the reverse document frequency of each target entity in the microblog corpus set and the target entity topic model of each target entity, and creating the extended query model by using the final entity topic model, the original query model and the microblog document language model.
In the technical scheme, when a query statement comprises a plurality of target entities, a final entity topic model is determined according to a target entity topic model of each target entity and the reverse document frequency of each target entity in the microblog corpus set, and retrieval is performed through an extended query model obtained through the final entity topic model, so that an obtained target retrieval result is more accurate, namely the target retrieval result comprises related microblog documents of each target entity in the plurality of target entities, the target retrieval result is a microblog document which a user wants to retrieve, and user experience is improved.
In the above technical solution, preferably, the method further includes: the model extension unit is specifically configured to: determining the final entity topic model according to the received first creation command by the following formula:
wherein,showing the final entity topic model in the form of a graphical representation,representing the probability that each of the target entities occupies in the final entity topic model, n representing the number of the target entities,a target entity topic model, IDF (E), representing each of said target entitiesi) Representing the reverse document frequency of each target entity in the microblog corpus set,representing the probability that each of the target entities occupies in the target entity topic model corresponding to the target entity, EiRepresenting an ith said target entity of a plurality of said target entities.
In the technical scheme, when a query statement has a plurality of target entities, a final entity topic model is obtained by calculation according to a target entity topic model corresponding to each target entity and the reverse document frequency of each target entity in the microblog corpus set according to a formula, because the reverse document frequency of each target entity in the microblog corpus set represents the importance degree of each target entity in the microblog corpus set, the target retrieval result has microblog documents related to each target entity in the plurality of target entities by searching through an expanded query model obtained by the final entity topic model, and the target retrieval result is determined according to the importance degree of each target entity in the microblog corpus set, so that the target retrieval result is information which a user wants to retrieve, and the retrieval effect is improved, the Inverse Document Frequency (IDF) is used for measuring the importance of the target entity, the IDF of the target entity may be obtained by dividing the total number of microblog documents in the microblog corpus set by the number of microblog documents including the target entity, and then taking the logarithm of the obtained quotient, and the IDF of the target entity may affect the updated initial difference parameter.
In the above technical solution, preferably, the method further includes: a second model creating unit, configured to create, according to the received second creation command, a target entity topic model corresponding to the target entity through the following processes: when a corpus set database where the microblog corpus set is located receives the target entity, extracting M microblog documents related to the target entity from the microblog corpus set according to the target entity, searching a target domain knowledge base connected with the corpus set database for a plurality of keywords related to the target domain according to the target domain to which the target entity belongs, wherein the keywords comprise the target entity, generating a virtual document corresponding to the target domain according to the keywords, establishing a domain language model according to the virtual document, establishing a background language model according to all entities in each microblog document in the microblog corpus set, and traversing the M microblog documents by using the domain language model, the background language model and an initial entity model corresponding to the target entity, and carrying out N times of iterative operation to obtain the target entity topic model, wherein M is more than or equal to 1, N is more than or equal to 1, and M and N are positive integers.
In the technical scheme, the background noise and the field-related noise can be controlled through the established field language model, the background language model and the initial entity model corresponding to the target entity, microblog documents are purified, so that the target entity topic model of the target entity is accurately determined, a large number of microblog documents related to query sentences can be retrieved when the expanded query model obtained by expanding the target entity topic model is retrieved, namely the microblog documents comprise information interested by users, the missed detection of the microblog documents can be effectively avoided, and the retrieval effect is improved, wherein all entities refer to all words in each microblog document in the microblog document language model, for example, a certain microblog document in the microblog document language model is 'Zhou Jie Lun new movie Tai', all entities in the microblog document are 'Zhou Jie Lun', "New", "movie" and "Taikung", in summary, the entities are words representing our usual meaning, and the target entities are keywords that the user wants to query like "Zhou Jilun".
In the above technical solution, preferably, the second topic model creating unit further includes: : the number counting unit is used for counting the first occurrence number of the target entity in the virtual document corresponding to the target field and the second occurrence number of each keyword in a plurality of keywords in the virtual document corresponding to the target field after the virtual document corresponding to the target field is generated; the prior value determining unit is used for determining a domain prior value of the target entity according to the first occurrence frequency and the second occurrence frequency; and the domain model updating unit is used for updating the domain language model according to the domain prior value.
According to the technical scheme, the domain prior value of the target entity is determined by counting the first occurrence frequency of the target entity in the virtual document corresponding to the target domain and the second occurrence frequency of each keyword in the plurality of keywords in the virtual document corresponding to the target domain, so that the domain language model is updated according to the domain prior value, the obtained domain language model is more accurate, namely, each domain related to the target entity in the domain language model, and the retrieval effect is improved.
By the technical scheme, the user can accurately retrieve the microblog document to obtain the target retrieval result, so that the retrieval efficiency and the retrieval accuracy are improved, and the retrieval robustness can be enhanced.
Detailed Description
So that the manner in which the above recited objects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments thereof which are illustrated in the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.
Fig. 1 shows a schematic flow diagram of a retrieval method according to an embodiment of the invention.
As shown in fig. 1, a retrieval method according to an embodiment of the present invention includes: 102, when a query statement for searching microblog documents in a microblog corpus set is received, creating an original query model corresponding to the query statement according to the query statement; step 104, identifying a target entity in the query statement; 106, expanding the original query model according to a target entity theme model corresponding to the target entity, the original query model and a microblog document language model established according to each microblog document in the microblog document set to obtain an expanded query model; and 108, counting the similarity between the expanded query model and the microblog document language model to determine a target retrieval result of the query statement according to the similarity.
In the technical scheme, when the query statement is used for searching the microblog documents in the microblog corpus set, the query statement comprises the alias of the target entity, so that the searching effect can be effectively improved by identifying the target entity in the query statement, in addition, the expanded query model is obtained by expanding the original query model corresponding to the query statement, so that when the microblog documents are searched according to the expanded query model, a large number of microblog documents related to the query statement, namely information interested by a user, can be searched, the missed detection of the microblog documents can be effectively avoided, the microblog documents can be searched more comprehensively, and the target searching result can be determined by counting the similarity between the expanded query model and the microblog document language model corresponding to each microblog document, so that the target searching result is more accurate, and meanwhile, the retrieval robustness is also improved. Therefore, according to the technical scheme, the user can accurately retrieve the target retrieval result from the microblog document, so that the retrieval accuracy is improved, wherein the target entity is a keyword in the query statement, for example, the target entity in the query statement of 'Zhongjilun New film' is 'Zhongjilun'.
In the above technical solution, preferably, the similarity between the extended query model and the microblog document language model is counted by the following formula, and a target microblog document with the similarity greater than or equal to a preset similarity is taken as the target retrieval result:
wherein Score (Q, D) represents the similarity, V represents all entities in the microblog document language model,representing the extended query model in a representation of the query model,representing the language model of the microblog documents,representing a probability that the target entity occupies the extended query model,and representing the probability of the target entity in the microblog document language model.
In the technical scheme, a large number of microblog documents can be retrieved through the expanded query model after expansion, but the large number of microblog documents may include a lot of information which is not much concerned by the user or the information is not arranged according to a certain priority, that is, the information which is not much concerned by the user may be arranged before the information which is much concerned by the user, so that the matching accuracy of the retrieval result can be improved and the accuracy of the target retrieval result can be further improved by counting the similarity between the expanded query model and the microblog document language model and determining the target retrieval result according to the similarity, and thus, the formula is the calculation of KL distance (also called relative entropy), all entities refer to all words in each microblog document in the microblog document language model, for example, if a certain microblog document in the microblog document language model is "zhou jilun new movie is too good", all entities in the microblog document are "zhou jilun", "new", "movie" and "too good", in short, the entities are words representing the general meaning of the user, and the target entity is a keyword which the user wants to query, such as "zhou jilun".
In the above technical solution, preferably, the extended query model is obtained by calculation according to the following formula:
whereinRepresenting the extended query model in a representation of the query model,representing the original query model in a representation of the original query model,representing the subject model of the target entity,representing a probability that the target entity occupies the extended query model,representing the probability that the target entity occupies the original query model,representing the probability that the target entity occupies the target entity model, and the α representing the initial interpolation parameter.
In the technical scheme, because the retrieval results corresponding to the original query model are less, and even the information which needs to be retrieved by the user is not contained, the original query model needs to be expanded to obtain an expanded query model, so that when the microblog documents are retrieved according to the expanded query model, a large number of microblog documents related to the query sentence, namely the microblog documents including the information which is interested by the user, can be retrieved, missing detection of the microblog documents can be effectively avoided, the microblog documents can be retrieved more comprehensively, and the retrieval effect is further improved.
In the above technical solution, preferably, according to the received update command, α is updated according to the following formula to obtain α':
wherein w represents the target entity, E represents all entities in the target entity model, Q represents all entities in the query statement, w represents1Represents any entity in the query statement, IDF (w) represents the reverse document frequency of the target entity in the microblog corpus set, and IDF (w)1) And representing the reverse document frequency of any entity in the microblog corpus set.
In the technical scheme, because the importance degrees of the same target entity in different query sentences are different, and the initial interpolation parameter α has a certain relationship with a target entity model corresponding to the target entity, when different query sentences are retrieved, the initial interpolation parameter α needs to be updated to be changed into a self-adaptive interpolation parameter, and an extended query model is determined according to the updated α ', so that the extended query model is more accurate, wherein all entities refer to all words in each microblog document in the microblog document language model, for example, if a certain microblog document in the microblog document language model is ' Zhou Jie Lun New movie Tai good ', all entities in the microblog document are ' Zhou Jie Lun ', ' New ', ' movie ' and ' Tai good ', in short, the entities represent words in the common sense of us, the target entity is a keyword such as "Zhou Jieren" that the user wants to query.
In the foregoing technical solution, preferably, when a plurality of target entities are provided, a final entity topic model is determined according to a reverse document frequency of each target entity in the micro-blog corpus set and the target entity topic model of each target entity, so as to create the extended query model by using the final entity topic model, the original query model and the micro-blog document language model.
In the technical scheme, when a query statement comprises a plurality of target entities, a final entity topic model is determined according to a target entity topic model of each target entity and the reverse document frequency of each target entity in the microblog corpus set, and retrieval is performed through an extended query model obtained through the final entity topic model, so that an obtained target retrieval result is more accurate, namely the target retrieval result comprises related microblog documents of each target entity in the plurality of target entities, the target retrieval result is a microblog document which a user wants to retrieve, and user experience is improved.
In the above technical solution, preferably, according to the received first creation command, the final entity topic model is determined by the following formula:
wherein,representing the final solid topic model in the form of a model,representing the probability that each of the target entities occupies in the final entity topic model, n representing the number of the target entities,a target entity topic model, IDF (E), representing each of said target entitiesi) Representing the reverse document frequency of each target entity in the microblog corpus set,representing each of said object entitiesProbability of a body occupying in the target entity topic model corresponding to the target entity, EiRepresenting an ith said target entity of a plurality of said target entities.
In the technical scheme, when a query statement has a plurality of target entities, a final entity topic model is obtained by calculation according to a target entity topic model corresponding to each target entity and the reverse document frequency of each target entity in the microblog corpus set according to a formula, because the reverse document frequency of each target entity in the microblog corpus set represents the importance degree of each target entity in the microblog corpus set, the target retrieval result has microblog documents related to each target entity in the plurality of target entities by searching through an expanded query model obtained by the final entity topic model, and the target retrieval result is determined according to the importance degree of each target entity in the microblog corpus set, so that the target retrieval result is information which a user wants to retrieve, and the retrieval effect is improved, the Inverse Document Frequency (IDF) is used for measuring the importance of the target entity, the IDF of the target entity may be obtained by dividing the total number of microblog documents in the microblog corpus set by the number of microblog documents including the target entity, and then taking the logarithm of the obtained quotient, and the IDF of the target entity may affect the updated initial difference parameter.
In the foregoing technical solution, preferably, according to the received second creation command, a target entity topic model corresponding to the target entity is created through the following process: when a corpus set database where the microblog corpus set is located receives the target entity, extracting M microblog documents related to the target entity from the microblog corpus set according to the target entity; searching a target domain knowledge base connected with the corpus collection database for a plurality of keywords related to the target domain according to the target domain to which the target entity belongs, wherein the keywords comprise the target entity; generating a virtual document corresponding to the target field according to the plurality of keywords; establishing a domain language model according to the virtual document, and establishing a background language model according to all entities in each microblog document in the microblog corpus set; and traversing the M microblog documents by using the domain language model, the background language model and the initial entity model corresponding to the target entity, and performing N times of iterative operation to obtain the target entity topic model, wherein M is more than or equal to 1, N is more than or equal to 1, and M and N are positive integers.
In the technical scheme, the background noise and the field-related noise can be controlled through the established field language model, the background language model and the initial entity model corresponding to the target entity, microblog documents are purified, so that the target entity topic model of the target entity is accurately determined, a large number of microblog documents related to query sentences can be retrieved when the expanded query model obtained by expanding the target entity topic model is retrieved, namely the microblog documents comprise information interested by users, the missed detection of the microblog documents can be effectively avoided, and the retrieval effect is improved, wherein all entities refer to all words in each microblog document in the microblog document language model, for example, a certain microblog document in the microblog document language model is 'Zhou Jie Lun new movie Tai', all entities in the microblog document are 'Zhou Jie Lun', "New", "movie" and "Taikung", in summary, the entities are words representing our usual meaning, and the target entities are keywords that the user wants to query like "Zhou Jilun".
In the above technical solution, preferably, the method further includes: after the virtual document corresponding to the target field is generated, counting the first occurrence times of the target entity in the virtual document corresponding to the target field and the second occurrence times of each keyword in a plurality of keywords in the virtual document corresponding to the target field; determining a domain prior value of the target entity according to the first occurrence number and the second occurrence number; and updating the domain language model according to the domain prior value.
According to the technical scheme, the domain prior value of the target entity is determined by counting the first occurrence frequency of the target entity in the virtual document corresponding to the target domain and the second occurrence frequency of each keyword in the plurality of keywords in the virtual document corresponding to the target domain, so that the domain language model is updated according to the domain prior value, the obtained domain language model is more accurate, namely, each domain related to the target entity in the domain language model, and the retrieval effect is improved.
Fig. 2 shows a schematic flow diagram of a retrieval method according to another embodiment of the invention.
As shown in fig. 2, a retrieval method according to another embodiment of the present invention includes:
step 202, acquiring all microblog documents in the microblog flow.
Step 204, establishing a microblog document language model according to each microblog document, and entering step 218.
Step 206, acquiring a microblog corpus set from the microblog stream, wherein the microblog corpus set comprises microblog documents.
Step 208, identifying all entities in the microblog documents, for example, identifying all entities by using an entity identification tool twitter nlp, and establishing an entity index of each entity in all entities, wherein each entity corresponds to a list of microblog documents sorted according to a time sequence.
At step 210, a target entity in the query statement is identified.
Step 212, estimating the target entity topic model of the target entity, and entering step 216.
Step 214, when a query statement for searching the microblog documents in the microblog corpus set is received, an original query model corresponding to the query statement is created according to the query statement through maximum likelihood estimation.
And step 216, expanding the original query model according to the target entity topic model and the original query model (according to the target entity topic model corresponding to the target entity, the original query model and a microblog document language model established according to each microblog document in the microblog document set) to obtain an expanded query model.
In step 218, KL distance calculation (statistics of similarity between the expanded query model and the microblog document language model) is performed according to the expanded query model and the microblog document language model established by each microblog document in the microblog document set.
And step 220, determining a target retrieval result of the query statement according to the similarity.
FIG. 3 is a schematic flow chart illustrating a preliminary microblog document acquisition according to one embodiment of the invention.
As shown in fig. 3, the preliminary obtaining of the microblog document according to one embodiment of the invention includes:
step 302, identifying all entities in the microblog corpus set.
Step 304, an entity index of each entity in all entities is established, wherein each entity corresponds to a list of microblog documents sorted according to a time sequence.
Step 306, searching M microblog documents related to the target entity in the entity index according to the target entity, wherein the M microblog documents are the latest microblog documents released in the entity index.
FIG. 4 illustrates a flow diagram for determining a target entity topic model in accordance with one embodiment of the present invention; FIG. 5 illustrates a schematic diagram of a target entity topic model in accordance with one embodiment of the present invention.
The technical scheme of the invention is explained in detail by combining the following figures 4 and 5:
as shown in FIG. 4, determining a target entity topic model according to one embodiment of the invention includes:
at step 402, a target entity in a query statement is identified.
Step 404, according to the target domain to which the target entity belongs, searching a target domain knowledge base connected with the corpus collection database for a plurality of keywords related to the target domain, wherein the keywords include the target entity.
Step 406, generating a virtual document corresponding to the target domain according to the plurality of keywords, establishing a domain language model according to the virtual document, establishing a background language model according to all entities in each microblog document in the microblog corpus set, and establishing an initial entity model corresponding to the target entity, so as to establish a mixed model by the domain language model, the background language model and the initial entity model, as shown in fig. 5, and deducing a target entity model of the target entity by the establishment process of the mixed model, wherein λ shown in fig. 5CAnd λEAre all preset parameters, gamma1And gammakRepresenting the weighted value of the 1 st domain language model and the weighted value of the kth domain language model, EF representing the M microblog documents in fig. 3,the initial solid model is represented as a model of the entity,representing a background language model andrepresenting k domain language models.
Step 408 (equal to step 306), searching M microblog documents related to the target entity in the entity index according to the target entity (extracting M microblog documents related to the target entity from the microblog corpus set according to the target entity).
Step 410, traversing the M microblog documents through an EM Algorithm to perform model parameter iterative computation, wherein the EM Algorithm represents an Expectation Maximization Algorithm (also called a maximum Expectation Algorithm).
And 412, performing iterative computation on the hybrid model according to the model parameters after the iterative computation to obtain a target entity topic model, wherein the iterative times are preset times N, when the first iteration is performed, an initial entity model corresponding to the target entity can be approximately equal to a background language model, M is greater than or equal to 1, N is greater than or equal to 1, and M and N are positive integers.
FIG. 6 illustrates a flow diagram for determining an extended query model and a target search result according to one embodiment of the invention.
As shown in FIG. 6, determining an extended query model and a target search result according to one embodiment of the invention includes:
at step 602, a target entity in a query statement is identified.
Step 604, establishing a target entity topic model corresponding to the target entity, and entering step 610.
Step 606, calculating the initial interpolation parameter α to obtain α', and entering step 610.
Step 608, an original query model corresponding to the query statement is created according to the query statement, and step 610 is entered.
And step 610, performing linear superposition on the target entity topic model, the initial interpolation parameter alpha' and the original query model to determine an expanded query model.
Step 612, acquiring microblog documents from the microblog flow.
And 614, establishing a microblog document language model according to each microblog document in the microblog document set.
And step 616, performing KL distance calculation on the expanded query model and the microblog document language model (counting the similarity between the expanded query model and the microblog document language model).
And step 618, taking the target microblog documents with the similarity greater than or equal to the preset similarity as target retrieval results.
Fig. 7 shows a schematic structural diagram of a retrieval system according to an embodiment of the present invention.
As shown in fig. 7, a retrieval system 700 according to one embodiment of the present invention includes: a first model creating unit 702, an entity identifying unit 704, a model expanding unit 706, and a retrieval result determining unit 708, where the first model creating unit 702 is configured to, when receiving a query statement for retrieving a microblog document in a microblog corpus set, create an original query model corresponding to the query statement according to the query statement; an entity identification unit 704 that identifies a target entity in the query statement; a model expansion unit 706 configured to expand the original query model according to a target entity topic model corresponding to the target entity, the original query model, and a microblog document language model established according to each microblog document in the microblog document set, so as to obtain an expanded query model; the retrieval result determining unit 708 counts the similarity between the expanded query model and the microblog document language model to determine a target retrieval result of the query statement according to the similarity.
In the technical scheme, when the query statement is used for searching the microblog documents in the microblog corpus set, the query statement comprises the alias of the target entity, so that the searching effect can be effectively improved by identifying the target entity in the query statement, in addition, the expanded query model is obtained by expanding the original query model corresponding to the query statement, so that when the microblog documents are searched according to the expanded query model, a large number of microblog documents related to the query statement, namely information interested by a user, can be searched, the missed detection of the microblog documents can be effectively avoided, the microblog documents can be searched more comprehensively, and the target searching result can be determined by counting the similarity between the expanded query model and the microblog document language model corresponding to each microblog document, so that the target searching result is more accurate, and meanwhile, the retrieval robustness is also improved. Therefore, according to the technical scheme, the user can accurately retrieve the target retrieval result from the microblog document, so that the accuracy is improved, wherein the target entity is a target keyword which the user wants to query in the query statement, for example, the target entity in the query statement of 'Zhou Ji Lun New film' is 'Zhou Ji Lun', and 'New' and 'film' are other entities or words in the general sense of us. In the above technical solution, preferably, the search result determining unit 708 includes: a similarity counting unit 7082, which counts the similarity between the extended query model and the microblog document language model according to the following formula, and takes a target microblog document with a similarity greater than or equal to a preset similarity as the target retrieval result:
wherein Score (Q, D) represents the similarity, V represents all entities in the microblog document language model,representing the extended query model in a representation of the query model,representing the language model of the microblog documents,representing a probability that the target entity occupies the extended query model,and representing the probability of the target entity in the microblog document language model.
In the technical scheme, a large number of microblog documents can be retrieved through the expanded query model after expansion, but the large number of microblog documents may include a lot of information which is not much concerned by the user or the information is not arranged according to a certain priority, that is, the information which is not much concerned by the user may be arranged before the information which is much concerned by the user, so that the matching accuracy of the retrieval result can be improved and the accuracy of the target retrieval result can be further improved by counting the similarity between the expanded query model and the microblog document language model and determining the target retrieval result according to the similarity, and thus, the formula is the calculation of KL distance (also called relative entropy), all entities refer to all words in each microblog document in the microblog document language model, for example, if a certain microblog document in the microblog document language model is "zhou jilun new movie is too good", all entities in the microblog document are "zhou jilun", "new", "movie" and "too good", in short, the entities are words representing the general meaning of the user, and the target entity is a keyword which the user wants to query, such as "zhou jilun".
In the foregoing technical solution, preferably, the model extension unit 706 is specifically configured to: the extended query model is calculated according to the following formula:
wherein,representing the extended query model in a representation of the query model,representing the original query model in a representation of the original query model,representing the subject model of the target entity,representing a probability that the target entity occupies the extended query model,representing the probability that the target entity occupies the original query model,representing the probability that the target entity occupies the target entity model, and the α representing the initial interpolation parameter.
In the technical scheme, because the retrieval results corresponding to the original query model are less, and even the information which needs to be retrieved by the user is not contained, the original query model needs to be expanded to obtain an expanded query model, so that when the microblog documents are retrieved according to the expanded query model, a large number of microblog documents related to the query sentence, namely the microblog documents including the information which is interested by the user, can be retrieved, missing detection of the microblog documents can be effectively avoided, the microblog documents can be retrieved more comprehensively, and the retrieval effect is further improved.
In the above technical solution, preferably, the method further includes: the parameter updating unit 710 updates α according to the following formula to obtain α' according to the received update command:
wherein w represents the target entity, E represents all entities in the target entity model, Q represents all entities in the query statement, w represents1Represents any entity in the query statement, IDF (w) represents the reverse document frequency of the target entity in the microblog corpus set, and IDF (w)1) And representing the reverse document frequency of any entity in the microblog corpus set.
In the technical scheme, because the importance degrees of the same target entity in different query sentences are different, and the initial interpolation parameter α has a certain relationship with a target entity model corresponding to the target entity, when different query sentences are retrieved, the initial interpolation parameter α needs to be updated to be changed into a self-adaptive interpolation parameter, and an extended query model is determined according to the updated α ', so that the extended query model is more accurate, wherein all entities refer to all words in each microblog document in the microblog document language model, for example, if a certain microblog document in the microblog document language model is ' Zhou Jie Lun New movie Tai good ', all entities in the microblog document are ' Zhou Jie Lun ', ' New ', ' movie ' and ' Tai good ', in short, the entities represent words in the common sense of us, the target entity is a keyword such as "Zhou Jieren" that the user wants to query.
In the above technical solution, preferably, the method further includes: the model extension unit 706 is further configured to: when the number of the target entities is multiple, determining a final entity topic model according to the reverse document frequency of each target entity in the microblog corpus set and the target entity topic model of each target entity, and creating the extended query model by using the final entity topic model, the original query model and the microblog document language model.
In the technical scheme, when a query statement comprises a plurality of target entities, a final entity topic model is determined according to a target entity topic model of each target entity and the reverse document frequency of each target entity in the microblog corpus set, and retrieval is performed through an extended query model obtained through the final entity topic model, so that an obtained target retrieval result is more accurate, namely the target retrieval result comprises related microblog documents of each target entity in the plurality of target entities, the target retrieval result is a microblog document which a user wants to retrieve, and user experience is improved.
In the above technical solution, preferably, the method further includes: the model extension unit 706 is specifically configured to: determining the final entity topic model according to the received first creation command by the following formula:
wherein,representing the final solid topic model in the form of a model,representing the probability that each of the target entities occupies in the final entity topic model, n representing the number of the target entities,a target entity topic model, IDF (E), representing each of said target entitiesi) Representing the reverse document frequency of each target entity in the microblog corpus set,representing the probability that each of the target entities occupies in the target entity topic model corresponding to the target entity, EiRepresenting an ith said target entity of a plurality of said target entities.
In the technical scheme, when a query statement has a plurality of target entities, a final entity topic model is obtained by calculation according to a target entity topic model corresponding to each target entity and the reverse document frequency of each target entity in the microblog corpus set according to a formula, because the reverse document frequency of each target entity in the microblog corpus set represents the importance degree of each target entity in the microblog corpus set, the target retrieval result has microblog documents related to each target entity in the plurality of target entities by searching through an expanded query model obtained by the final entity topic model, and the target retrieval result is determined according to the importance degree of each target entity in the microblog corpus set, so that the target retrieval result is information which a user wants to retrieve, and the retrieval effect is improved, the Inverse Document Frequency (IDF) is used for measuring the importance of the target entity, the IDF of the target entity may be obtained by dividing the total number of microblog documents in the microblog corpus set by the number of microblog documents including the target entity, and then taking the logarithm of the obtained quotient, and the IDF of the target entity may affect the updated initial difference parameter.
In the above technical solution, preferably, the method further includes: a second model creating unit 712, configured to create, according to the received second creation command, a target entity topic model corresponding to the target entity through the following processes: when a corpus set database where the microblog corpus set is located receives the target entity, extracting M microblog documents related to the target entity from the microblog corpus set according to the target entity, searching a target domain knowledge base connected with the corpus set database for a plurality of keywords related to the target domain according to the target domain to which the target entity belongs, wherein the keywords comprise the target entity, generating a virtual document corresponding to the target domain according to the keywords, establishing a domain language model according to the virtual document, establishing a background language model according to all entities in each microblog document in the microblog corpus set, and traversing the M microblog documents by using the domain language model, the background language model and an initial entity model corresponding to the target entity, and carrying out N times of iterative operation to obtain the target entity topic model, wherein M is more than or equal to 1, N is more than or equal to 1, and M and N are positive integers.
In the technical scheme, the background noise and the field-related noise can be controlled through the established field language model, the background language model and the initial entity model corresponding to the target entity, microblog documents are purified, so that the target entity topic model of the target entity is accurately determined, a large number of microblog documents related to query sentences can be retrieved when the expanded query model obtained by expanding the target entity topic model is retrieved, namely the microblog documents comprise information interested by users, the missed detection of the microblog documents can be effectively avoided, and the retrieval effect is improved, wherein all entities refer to all words in each microblog document in the microblog document language model, for example, a certain microblog document in the microblog document language model is 'Zhou Jie Lun new movie Tai', all entities in the microblog document are 'Zhou Jie Lun', "New", "movie" and "Taikung", in summary, the entities are words representing our usual meaning, and the target entities are keywords that the user wants to query like "Zhou Jilun".
In the above technical solution, preferably, the second topic model creating unit further includes: : a number counting unit 7122, configured to count a first number of occurrences of the target entity in the virtual document corresponding to the target domain and a second number of occurrences of each of the plurality of keywords in the virtual document corresponding to the target domain after generating the virtual document corresponding to the target domain; a priori value determining unit 7124, configured to determine a domain priori value of the target entity according to the first occurrence number and the second occurrence number; a domain model updating unit 7126 that updates the domain language model according to the domain prior value.
According to the technical scheme, the domain prior value of the target entity is determined by counting the first occurrence frequency of the target entity in the virtual document corresponding to the target domain and the second occurrence frequency of each keyword in the plurality of keywords in the virtual document corresponding to the target domain, so that the domain language model is updated according to the domain prior value, the obtained domain language model is more accurate, namely, each domain related to the target entity in the domain language model, and the retrieval effect is improved.
Fig. 8 shows a schematic structural diagram of a retrieval system according to another embodiment of the present invention.
As shown in fig. 8, a retrieval system 800 according to another embodiment of the present invention (equivalent to the retrieval system 700 of the embodiment shown in fig. 7) includes: an entity microblog set acquisition module 802 for collecting microblog documents related to a target entity; a solid topic model estimation module 804 (equivalent to the second model creation unit 712 of the embodiment shown in fig. 7) for performing estimation of the target solid topic model; an adaptive query expansion module 806 (corresponding to the model expansion unit 706 of the embodiment shown in fig. 7) is used for merging the target entity topic model into the microblog document language model.
The several modules of retrieval system 800 are described in detail below:
1. the entity microblog set obtaining module 802 is specifically configured to: and identifying a target entity in the query statement, establishing an entity index, and selecting a microblog document related to the target entity.
2. The entity topic model estimation module 804 includes: a knowledge base linking module 8042, a priori value calculating module 8044 (equivalent to the priori value calculating unit 7124 of the embodiment shown in fig. 7) and a generative model constructing module 8046, where the knowledge base linking module 8042 is configured to link a target entity to a Freebase knowledge base and obtain a target field to which the target entity belongs in the Freebase knowledge base (the field in the Freebase may be regarded as different layouts of popular newspapers, such as business, lifestyle, art, entertainment, politics, economy, and the like); the prior value calculating module 8044 is configured to obtain a plurality of keywords related to a target field, where the plurality of keywords include the target entity, generate a virtual document corresponding to the target field according to the plurality of keywords, and perform maximum likelihood estimation on the virtual document to generate a field prior value; the generative model building module 8046 is configured to build an initial entity model, a background language model, and a domain language model corresponding to the target entity, and perform iterative computation in the microblog document by using an EM algorithm to obtain a target entity topic model.
3. The adaptive query expansion module 806 is configured to model a query statement to obtain an original query model, model each microblog document in the microblog document set to obtain a microblog document language model, expand the original query model through the target entity topic model to obtain an expanded query model, and perform KL distance calculation on the expanded query model and the microblog document language model to obtain a target retrieval result according to a calculation result. The technical solution of the present invention will be further explained in detail below:
firstly, an entity is identified.
1. And identifying all entities in the microblog document by utilizing an entity identification tool TwitterNLP.
2. And establishing an entity index, wherein each entity in all the entities corresponds to a list of microblog documents in a time sequence.
3. And identifying a target entity in the query statement, and acquiring M newly issued microblog documents containing the target entity from the entity index.
Secondly, establishing a target entity topic model.
1. And linking the target entity to a Freebase knowledge base (target domain knowledge base), and reading entity information of the target entity in the Freebase knowledge base to acquire a target domain (such as a music domain, an art domain and a book domain) to which the target entity belongs. Specifically, if a target entity is not linked to entity information, the target entity is considered to belong to any one of the domains.
2. Calculating a domain prior value, attempting to link to a Freebase knowledge base by a Freebase search interface according to all entities in an entity index, forming a virtual document by using attributes and type words under different domains (searching a plurality of keywords related to a target domain in the target domain knowledge base connected with a corpus collection database, wherein the keywords comprise a target entity, and generating a virtual document corresponding to the target domain according to the keywords), and performing maximum likelihood estimation on the virtual document by using the following formula to generate the domain prior value:
wherein w represents a target entity, d represents a target domain to which the target entity belongs, and w2Representing each keyword of a plurality of keywords, c (w, d) representing the first occurrence of w in the virtual document corresponding to the target domain d, c (w)2And d) represents the second occurrence number of each keyword in the plurality of keywords in the virtual document corresponding to the target field, and n represents the total number of the keywords.
3. Establishing a target entity theme model, establishing a field language model according to the virtual documents, establishing a background language model according to all entities in each microblog document in the microblog corpus set, and establishing an initial entity model corresponding to the target entity, wherein the initial entity model can be similar to the background language model, and a mixed model is formed by the field language model, the background language model and the initial entity model.
4. And performing model estimation by using an EM algorithm. According to the hybrid model shown in fig. 5, we can express the log-likelihood function of the returned M microblog sets EF as:
wherein EF represents M microblog documents searched out above, i is used for traversing all microblog documents in the microblog corpus set, w represents each entity in all entities in each microblog document in the microblog corpus set, and DiRepresenting the ith microblog document in the microblog corpus set, k representing the number of target fields to which the target entities belong,representing the probability that w is occupied in the target solid model,representing the frequency that the word w occupies in the background language model,representing the frequency occupied by the word w in the domain language model, c (w, D)i) Is that the word w is at DiNumber of occurrences in, λCDenotes a first predetermined parameter, λEDenotes a second predetermined parameter, λCAnd λEFor controlling background and field-dependent noise, respectively, gammadRepresenting weight values of the target domain language model.
Using EM algorithmsCarrying out maximum likelihood estimation on the hybrid model, and iteratively updating parameters on a microblog corpus set EF to obtain the following formula:
wherein n represents the number of current iterations, w represents a target entity, w 'represents each entity in all entities of the microblog corpus set, d' represents each domain in all domains,s(n)(w),r(n)(w) are intermediate variables for ease of computation,represents the probability of w in the domain language model at the (n +1) th iteration,representing the probability of w in the solid topic model at the (n +1) th iteration,and representing the weight value of the domain language model in the (n +1) th iteration, wherein in the summation subscript, w/w 'is used for traversing all entities in the microblog corpus set, i is used for traversing and feeding back all microblog documents in the microblog set, d/d' is used for traversing all domains, k represents the number of target domains to which the target entity E belongs, and lambda represents a preset iteration parameter.
In addition, updateMay use a domain prior value p (w | d) of the target entity. Defining a conjugate prior (namely Dirichlet prior) on each unary language model p (w | d), then, adopting Maximum a Posteriori probability (MAP) to estimate all parameters, only needing to make small change on an update formula of the domain language model, and carrying out MAP estimation through the following formula:
to this end, after the number of rounds (for example, 100 rounds) are iterated by using the above formula, the target entity topic model can be obtained
And thirdly, adaptive query expansion.
1. When a query statement for searching the microblog documents in the microblog corpus set is received, an original query model corresponding to the query statement is established according to the query statement, and a microblog document language model is established according to each microblog document in the microblog document set.
2. And expanding the original query model through the target entity topic model to obtain an expanded query model. The extended query model is calculated according to the following formula:
wherein,an extended query model is represented that is,which represents the original query model, is,a topic model representing the target entity,representing the probability that the target entity occupies in the extended query model,representing target entitiesThe probability of what is occupied in the original query model,representing the probability that the target entity occupies the target entity model, α representing the initial interpolation parameter, α controlling the importance of the target entity topic model.
In the related art, the initial interpolation parameter α is set to a fixed value for all query statements, however, considering that the importance degrees of the same target entity in different query statements are different, the initial interpolation parameter may be updated, and α is updated according to the following formula to obtain α':
where w represents the target entity, E represents all entities in the target entity model, Q represents all entities in the query statement, w1Representing any entity in the query statement, IDF (w) representing the reverse document frequency of the target entity in the microblog corpus set, IDF (w)1) Representing any entity in microblog corpus setReverse document frequency in (1).
Specifically, when a plurality of target entities are identified in the query statement, a final entity topic model is determined according to the weighted average of the target entity topic models of each target entity, and specifically, the final entity topic model is determined by the following formula:
wherein,The final entity topic model is represented as,representing the probability that each target entity occupies in the final entity topic model, n representing the number of target entities,target entity topic model, IDF (E), representing each target entityi) Representing the reverse document frequency of each target entity in the microblog corpus set,representing the probability that each target entity occupies in the target entity topic model corresponding to the target entity, EiRepresenting the ith target entity of the plurality of target entities.
Calculating KL distance (calculating the similarity between the expanded query model and the microblog document language model), calculating the similarity between the expanded query model and the microblog document language model by the following formula, and taking a target microblog document with the similarity being greater than or equal to the preset similarity as a target retrieval result:
wherein Score (Q, D) represents similarity, V represents all entities in the microblog document language model,an extended query model is represented that is,a language model of the microblog documents is represented,representing the probability that the target entity occupies in the extended query model,and representing the probability of the target entity in the microblog document language model.
The invention is further described below with reference to an embodiment:
1) and a preprocessing stage is carried out, and all entities contained in each microblog document in the microblog flow are identified by an entity identification tool. For example, if the microblog document is that the "new movie of Zhou Ji Lun is really too good" and we identify the entity "Zhou Ji Lun", we store the microblog number (id) in the corresponding entity item in the entity index; for a target entity, M pieces of microblog documents added latest are obtained from an entity index to serve as a microblog corpus set.
2) First, for a target entity, Zhougelon, a Freebase search interface is used to try to link to objects in a Freebase knowledge base and obtain the target fields to which the objects belong, namely movies, music, televisions, characters, media and awards.
And constructing a mixed model, wherein the mixed model comprises an initial entity topic model, a background language model and six field language models corresponding to the Zhougelong.
And traversing M microblog documents by using the domain language model, the background language model and the initial entity model corresponding to the target entity, and performing N times of iterative operation to obtain a target entity topic model, wherein M is more than or equal to 1, N is more than or equal to 1, and M and N are positive integers.
3) Performing maximum likelihood modeling on the query statement and each microblog document, for example, the query statement is ' Zhonglun new film ', dividing words to obtain ' Zhonglun ', ' new ' and ' movie ', creating an original query model through maximum likelihood estimation, wherein p (Zhonglun) ' 0.33, p (new) ' 0.33 and p (movie) ' 0.33, and establishing a microblog document language model according to each microblog document, wherein the maximum likelihood estimation modeling for each microblog document is similar to the estimation modeling of the original query model.
Target entities in the query statement are identified, e.g., the query statement is "Zhougelong New film" and the target entities are identified as "Zhougelong".
Expanding the original query model by using a target entity topic model of 'Zhongjilun' to obtain an expanded query model, and calculating an initial interpolation parameter:
the original query model is expanded according to the foregoing linear interpolation formula, and since there is only one target entity, "zhojjlun", in the query statement, "zhojlun new movie", the target entity topic model of the target entity can be directly utilized for expansion.
And calculating the similarity of the expanded query model and the microblog document language model by using a KL distance calculation formula, and performing Dirichlet smoothing by using the microblog document language model by using the maximum likelihood estimation of the microblog document.
And determining a target retrieval result of the query statement according to the similarity.
The technical scheme of the invention is explained in detail in combination with the attached drawings, so that the user can accurately retrieve the target retrieval result from the microblog document, the retrieval accuracy is improved, and the retrieval robustness can be effectively enhanced.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.