CN103593425B - Preference-based intelligent retrieval method and system - Google Patents

Preference-based intelligent retrieval method and system Download PDF

Info

Publication number
CN103593425B
CN103593425B CN201310549069.5A CN201310549069A CN103593425B CN 103593425 B CN103593425 B CN 103593425B CN 201310549069 A CN201310549069 A CN 201310549069A CN 103593425 B CN103593425 B CN 103593425B
Authority
CN
China
Prior art keywords
user
subject matter
vector
retrieval
matter preferences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310549069.5A
Other languages
Chinese (zh)
Other versions
CN103593425A (en
Inventor
李鹏
周育忠
王庆红
龚婷
陈传夫
王平
冉从敬
吴江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
CSG Electric Power Research Institute
Research Institute of Southern Power Grid Co Ltd
Original Assignee
Wuhan University WHU
Research Institute of Southern Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU, Research Institute of Southern Power Grid Co Ltd filed Critical Wuhan University WHU
Priority to CN201310549069.5A priority Critical patent/CN103593425B/en
Publication of CN103593425A publication Critical patent/CN103593425A/en
Application granted granted Critical
Publication of CN103593425B publication Critical patent/CN103593425B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

The invention relates to the field of data retrieval, and discloses a preference-based intelligent retrieval method and system. The method includes the steps that a subject preference model of a user is established on the basis of subject classification of data, user characteristics and operation logs; query expansion is performed through the subject preference model of the user and retrieval input of the user to obtain a primary retrieval result; subject preference scoring is performed on the data through the subject preference model of the user and distribution conditions of the data on all subjects, and personalized retrieval ranking is preformed on the primary retrieval result on the basis of subject preference; secondary feedback retrieval is performed on the ranked primary retrieval result through a comprehensive model of relevance feedback and pseudo relevance feedback to obtain a final retrieval result. According to the method and system, subject distribution of data resources is determined through the subject indexing technology, retrieval vectors better representing user requirements are established through the query expansion based on subjects, relevance feedback and other technologies, and the retrieval result meeting potential requirements of the user better is provided for the user.

Description

Based on intelligent search method and the system of preference
Technical field
The present invention relates to field of data retrieval, especially relate to a kind of intelligent search method based on preference and system.
Background technology
Along with improving constantly and the high speed development of information technoloy equipment of social informatization degree, the memory space exponentially ascendant trend of information; And meanwhile people require more and more higher to the acquisition of information, retrieval technique how is utilized to find required useful information more and more difficult fast.Traditional search engine is retrieved based on keyword, even if but adopt multiple keyword to carry out combined retrieval, in the face of the network information of magnanimity, obtain result quantity remain millions of, the information needed most will be found from these results to be also a large order concerning user.Therefore, the problem of current data retrieval most critical is exactly the information how finding user to need most from result for retrieval.
In prior art, search engine or data retrieval system can sort to result for retrieval based on partial statistical information, to strive for result higher for the degree of correlation to be preferentially supplied to user.Similar statistical information mainly contains the keyword frequency of occurrences, matching degree and clicking rate etc., and these information are added up the determination content of data itself, although treatment capacity is comparatively large, content is clear and definite more easily realizes.In addition, the more advanced system of part is also had to carry out further optimizing, statistical nature such as based on various text semantic is expanded etc. by Data classification or to keyword, makes every effort to make forward result for retrieval high as much as possible with the degree of correlation of carrying out the keyword retrieved.But the descriptor (keyword, time, range of search etc. require combination) in the inquiry request that aforesaid way is mainly submitted to based on user's single and the text message of data, and due to above-mentioned two kinds of Information Availability contents limited, add that the information of data itself cannot embody the difference between user, even if adopt the mode of prior art to be optimized, result for retrieval is also difficult to the demand difference embodying different user all sidedly, and this causes the recall precision of existing mode, degree of accuracy and user satisfaction to be difficult to reach desirable state.
Summary of the invention
For the above-mentioned defect existed in prior art, technical matters to be solved by this invention how to retrieve for the difference optimization of different user.
For solving the problems of the technologies described above, on the one hand, the invention provides a kind of intelligent search method based on preference, the method comprising the steps of:
S1, based on Data subject classification, user characteristics and Operation Log, sets up user's subject matter preferences model;
S2, utilizes user's subject matter preferences model and user search input, carries out query expansion and obtain preliminary search result;
S3, utilizes user's subject matter preferences model and the distribution situation of data on each theme, carries out the subject matter preferences marking of data, carries out sorting based on the personalized retrieval of subject matter preferences to preliminary search result;
S4, utilizes relevant feedback and pseudo-linear filter unified model to carry out secondary feedback searching to the preliminary search result after sequence and obtains final result for retrieval.
Preferably, in described step S1, the described user's of foundation subject matter preferences model comprises step:
Theme vector space is set up according to described subject classification;
The predefine subject matter preferences vector of user is determined according to described user characteristics;
The history subject matter preferences vector of user is determined according to described Operation Log;
History subject matter preferences vector described in described predefine subject matter preferences vector sum is weighted, obtains described user's subject matter preferences model.
Preferably, in described step S2, described in carry out expanding query and comprise step:
Calculate the probability distribution of each lexical item in the term corresponding data set in the input of described user search;
Calculate the probability distribution of each lexical item in the corresponding data acquisition of each descriptor in the vector space of described user's subject matter preferences model;
Weigh the mutual difference of above-mentioned two kinds of probability distribution, the descriptor that select probability distributional difference is less, it is added in retrieval vector with certain weight.
Preferably, in described step S3, described personalized retrieval sequence comprises step:
By calculating the vector similarity of each result and described user's subject matter preferences model in described preliminary search result, pass judgment on the score of described each result on the theme of user preference;
Calculate the quality score of described each result;
Sort the end obtaining described each result according to the weighting of described vector similarity, described score on the theme of user preference and described quality score score, sorts to each result in described preliminary search result according to the described score of sequence eventually.
Preferably, in described step S4, described secondary feedback searching comprises step:
Described relevant feedback is utilized to determine the vector set of the correlated results in described preliminary search result;
Described pseudo-linear filter is utilized to determine the vector set of the uncorrelated result in described preliminary search result;
The vector set of described user's subject matter preferences model, described correlated results, the vector set of described uncorrelated result and original query vector are combined and carries out feedback query.
On the other hand, the present invention also provides a kind of intelligent retrieval system based on preference simultaneously, and this system comprises:
User's subject matter preferences identification module, for based on Data subject classification, user characteristics and Operation Log, sets up user's subject matter preferences model;
Query expansion module, utilizes user's subject matter preferences model and user search input, carries out query expansion and obtain preliminary search result;
Retrieval ordering module, utilizes user's subject matter preferences model and the distribution situation of data on each theme, carries out the subject matter preferences marking of data, carries out sorting based on the personalized retrieval of subject matter preferences to preliminary search result;
Feedback searching module, utilizes relevant feedback and pseudo-linear filter unified model to carry out secondary feedback searching to the preliminary search result after sequence and obtains final result for retrieval.
Preferably, described user's subject matter preferences identification module comprises further:
Theme vector space module, for setting up theme vector space according to described subject classification;
Predefine preference module, for determining the predefine subject matter preferences vector of user according to described user characteristics;
History preference module, for determining the history subject matter preferences vector of user according to described Operation Log;
Preference pattern acquisition module, for being weighted by history subject matter preferences vector described in described predefine subject matter preferences vector sum, obtains described user's subject matter preferences model.
Preferably, described query expansion module comprises further:
Term distribution module, for calculating the probability distribution of each lexical item in the term corresponding data set in the input of described user search;
Descriptor distribution module, for calculate described user's subject matter preferences model vector space in the probability distribution of each lexical item in the corresponding data acquisition of each descriptor;
Expansion module, for weighing the mutual difference of above-mentioned two kinds of probability distribution, the descriptor that select probability distributional difference is less, adds it in retrieval vector with certain weight.
Preferably, described retrieval ordering module comprises further:
Theme obtains sub-module, for the vector similarity by calculating each result and described user's subject matter preferences model in described preliminary search result, passes judgment on the score of described each result on the theme of user preference;
Quality score module, for calculating the quality score of described each result;
Order module, sort the end for obtaining described each result according to the weighting of described vector similarity, described score on the theme of user preference and described quality score score, sorts to each result in described preliminary search result according to the described score of sequence eventually.
Preferably, described feedback searching module comprises further:
Relevant feedback module, for the vector set utilizing described relevant feedback to determine the correlated results in described preliminary search result;
Pseudo-linear filter module, for the vector set utilizing described pseudo-linear filter to determine the uncorrelated result in described preliminary search result;
Feedback module, carries out feedback query for the vector set of described user's subject matter preferences model, described correlated results, the vector set of described uncorrelated result and original query vector being combined.
The invention provides a kind of intelligent search method based on preference and system, utilize the theme distribution of subject indexing technology determination data resource, using more can the retrieval vector of representative of consumer demand based on technique constructions such as the query expansion of theme and relevant feedback, again by combining the intelligent sequencing model of user's subject matter preferences, provide the result for retrieval more meeting its potential demand to user.The algorithm that the present invention realizes and system can identify user potential, the Intelligence Request that is described based on professional thesaurus, thus there is better retrieval effectiveness.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet based on the intelligent search method of preference in one embodiment of the present of invention;
Fig. 2 is the query expansion algorithm flow schematic diagram based on theme in a preferred embodiment of the present invention;
Fig. 3 is the Relevance Feedback Algorithms schematic flow sheet in conjunction with theme in a preferred embodiment of the present invention;
Based on the modular structure schematic diagram of the intelligent retrieval system of preference in the typical apply scene of the present invention of Fig. 4 position.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described.Obviously, described embodiment is for implementing better embodiment of the present invention, and described description is to illustrate for the purpose of rule of the present invention, and is not used to limit scope of the present invention.Protection scope of the present invention should be as the criterion with the claim person of defining, and based on the embodiment in the present invention, the every other embodiment that those of ordinary skill in the art obtain under the prerequisite not making creative work, all belongs to the scope of protection of the invention.
Prior art is optimized mainly for the data that are retrieved, and optimal situation has also just carried out precise classification and expansion to the data that are retrieved, and then the descriptor in the inquiry request itself and user's single submitted to is mated.Although this mode greatly enhances the degree of accuracy of retrieval, but it does not embody the difference between user, as long as inquiry request is identical, result for retrieval will be identical, and this user different from actual conditions has the situation of different demands to there is obvious difference.
In an embodiment of the present invention, the potential demand of user is obtained by the retrieval behavior of observation analysis user within longer a period of time, user's request and Data classification are combined, dominant relevant feedback and recessive Relevance Feedback are dissolved in Optimization of Information Retrieval, and the demand difference accurately having embodied user also effectively improves whole efficiency and the degree of accuracy of data retrieval.
See Fig. 1, in one embodiment of the invention, the intelligent search method based on preference comprises step:
S1, based on Data subject classification, user characteristics and Operation Log, sets up user's subject matter preferences model;
S2, utilizes user's subject matter preferences model and user search input, carries out query expansion and obtain preliminary search result;
S3, utilizes user's subject matter preferences model and the distribution situation of data on each theme, carries out the subject matter preferences marking of data, carries out sorting based on the personalized retrieval of subject matter preferences to preliminary search result;
S4, utilizes relevant feedback and pseudo-linear filter unified model to carry out secondary feedback searching to the preliminary search result after sequence and obtains final result for retrieval.
Below the various optimal ways of above-described embodiment are done and further expand explanation, in preferred embodiment hereafter, in order to give prominence to Technique Rule of the present invention and actual effect further, the data area be retrieved is limited in technical information information, but relevant technical staff in the field should be appreciated that, technical information information is a specifically classification in total data, technical scheme of the present invention obviously can directly apply in various numerical information, and following preferred embodiment should not regard as limitation of the present invention.
There is potential theme demand in the acquisition of user to data resource, for scientific and technical literature, the demand of user to same keyword of different field has significant difference, makes the theme demand of this recessiveness show more obvious.In a preferred embodiment of the invention, in step S1, use descriptor category table to map user's request, find that user is in the classificatory preference of document resource, thus provide good basis for intelligent retrieval.Subject matter preferences is mainly considered from following two aspects:
One, the predefine of user's subject matter preferences
Different users has different features, wherein has the potential demand much reflecting user, therefore, and can according to the subject matter preferences of the more pre-defined users of user characteristics (region of such as user, functional information or post scope of item etc.).Specifically, the such as user in power industry mesohigh test post, the document resource relevant to power transformer, isolating switch, mutual inductor etc. has specific demand, thus descriptor can be extracted from these post documents, in conjunction with Position Responsibility descriptor, be mapped on the subject category of specification, as the Demand perference predefine of user.More preferably, the subject matter preferences of user is represented in step S1 with vector space model:
First, analyze theme distribution situation, set up N and tie up theme vector space [(k 1, w 1), (k 2, w 2) ... (k n, w n)]; Wherein, k ibe i-th theme, w ifor user is at k ion preference, i ∈ 1,2 ..., N.
Then, from user characteristics (as Position Responsibility descriptor, post document etc.), extract descriptor, add up the frequency of these descriptor calculate its probability distribution; Wherein, be the theme word sub iword frequency, freq sub_totaltotal word frequency of the set of words that is the theme.
Finally, will be used for characterizing consumer at each descriptor sub after certain system call interception ion preference, thus obtain predefined user's subject matter preferences vector W pre=(w ' 1, w ' 2..., w ' n); Wherein, i=1,2 ..., n, represents that user is at theme k nupper predefined preference.
Two, from User operation log, user's subject matter preferences is found
The retrieval behavior of user is the part in the global behavior of user's obtaining information; The relevant user that has clicks, downloads, collects the operations such as document from system, and these operations all can be recorded in system journal.Thus the subject matter preferences of user can be excavated from a large amount of Operation Log information of user, for intelligent retrieval provides shoring of foundation.In the step S1 of said method, also set up complete Operation Log collection mechanism, utilize Operation Log to determine user's subject matter preferences.
Particularly, collect and analyze daily record, obtaining the set D of user operation document op={ d op1, d op2..., d opN.Right counting user is to d ithe operation such as click, download, the collection frequency, and give different operating weight, calculate user after weighting to d iaccess frequency.According to the subject indexing of document, d can be obtained idistribution in descriptor, then in conjunction with d iaccess frequency, the access frequency of user in each descriptor can be obtained, it can be used as the subject matter preferences degree of user, correspond in theme vector space, thus obtain user subject matter preferences vector W op=(w 1, w 2..., w n).
Finally, by above two kinds of subject matter preferences being weighted, thus the subject matter preferences W=α of user is determined 1w pre+ α 2w op; Wherein α 1, α 2be two kinds of vectors weights separately, carry out presetting or adjusting according to the degree of laying particular stress on.It should be noted that and obtain user preference along with time variations according to log analysis, need to upgrade accordingly according to the update status of daily record.
Inquiry request is the direct reaction of user's query demand, wherein contains potential theme demand equally, and this theme demand has reacted user to a certain extent to the abstract of required document and summary, more can reflect the demand of user.Descriptor as the mark of document resource, can have been reacted content core and the classified information of document, can better express the essence of document simultaneously.Comprehensive these two aspects is considered, in step S2 of the present invention, choosing a topic word carries out query expansion, and from the effect improving retrieval to a great extent, its algorithm flow as shown in Figure 2.
If the retrieval input of user is exactly directly the descriptor of specification, can, by the incidence relation such as hypernym, hyponym in subject category list, relevant descriptor be found to carry out query expansion.But many times, the inquiry request of user's input does not have dominant associating with between potential theme demand, can be at this moment its relation that is associated by history searching document and subject indexing document.As shown in Figure 2, basic thought is as follows:
The collection of document that note user search request Q is corresponding is: D qrery={ d q1, d q2..., d qN.By to D queryin each document carry out participle, obtain one group of Term set, be designated as T query={ t q1, t q2..., t qN.Right statistical probability thus obtain D querycorresponding set T queryprobability distribution, be designated as wherein, for t qiword frequency, freq totalfor T querythe word frequency sum of middle Term.
For the descriptor in theme vector space, also can obtain one group of collection of document by the subject indexing of document, be designated as D subject={ d s1, d s2..., d sN.Similarly, obtain entry set by collection of document, then by the calculating of corresponding word frequency, can D be obtained subjectthe probability distribution of corresponding entry set, is designated as F subject = ( p st 1 , p st 2 , . . . , p st N ) .
After obtaining the probability distribution of these two aspects, the similarity that can be distributed by calculating probability, finds the maximally related descriptor with term, and then is used for doing the query expansion of descriptor.
When calculating the probability distribution similarity of term and two groups of documents corresponding to descriptor, preferred consideration uses Kullback-Leibler divergence (abbreviation of Kullback-Leibler Divergence is also called relative entropy Relative Entropy) to calculate.
Like this, D is passed through kL(F subject|| F query) can F be calculated subjectrelative to F queryprobability distribution variances, get difference compared with for little descriptor build query expansion
D KL ( F subject | | F query ) = Σ p st i log p st i p qt i ;
For obtaining better query expansion effect, further study inquiry request and the distribution situation of descriptor on the document vector of acceptance system, accordingly above-mentioned calculating being optimized further, select Jensen-Shannon divergence to carry out smoothing computation, by calculating D jS(F subject|| F query) weigh F subjectand F querymutual difference
D JS ( F subject | | F query ) = 1 2 D KL ( F subject | | R ) + 1 2 D KL ( F query | | R ) ; Wherein, R = 1 2 ( F subject + F query ) .
After selecting the descriptor that probability distribution variances is less, added in retrieval vector with certain weight, built the query vector of expansion, to improve recall precision.
In the step S3 of said method, on the basis of file correlation sequence, consider that the subject matter preferences of user is weighted the core that sequence is personalized retrieval sequence.From user's subject matter preferences model, obtain the subject matter preferences vector W of user.For retrieving the collection of document obtained, according to document subject index situation, the theme distribution vector V=(v of every section of document can be obtained 1, v 2..., v n).Like this, by calculating the vector similarity sim (V, W) of W and V, the score of document on the theme of user preference retrieved can be passed judgment on.The document that sim (V, W) calculated value is high, its preference-score is also higher.Wherein,
sim ( V , W ) = Σ k = 1 n v k × w k ( Σ k = 1 n v k 2 ) ( Σ k = 1 n w k 2 ) .
After considering the weighting of user's subject matter preferences, the quality of document is also an important Weighted Guidelines.The evaluation of the quality of literature a lot of because have, herein main be cited from paper the factor, the frequency be downloaded, deliver periodical rank, whether be the factor of these 4 aspects of self-built resource, bonus point evaluation is carried out to document.Wherein self-built resource mainly consider our unit by be purchased to resource buy and voluntarily gather two kinds of modes collect document resource.And have passed through manual examination and verification according to the resource that specialty gathers voluntarily, therefore there is higher quality.Shared by each factor, weight is in table 1.
The factor f Quote f Download f Periodical f Self-built
Weight 0.5 0.1 0.2 0.2
Table 1 the quality of literature evaluation factors weight table
Calculated by the normalization of relevant field in document metadata, draw the score of each factor of document.The quality score G of document is obtained after weighting factor=0.45f quote+ 0.15f download+ 0.2f periodical+ 0.2f self-built.
By the weighting to above two aspect scores and document and retrieval similarity score, sort the end calculating result for retrieval document score G sort1g query+ β 2sim (V, W)+β 3g factor; Wherein, G querythe score value drawn based on particular user inquiry (query) that LUCENE returns, β 1, β 2and β 3be the weight that each score value is corresponding, computation process considers the setting of different weight, specifically determines according to system service condition and Document metrology situation.
Relevant feedback supplementing as retrieval request, effectively can improve the accuracy of retrieval.In the step S4 of said method, relevant feedback and pseudo-linear filter are combined, and by subject category classification and the Operation Log analysis of user, effectively define the scope of relevant documentation and uncorrelated document, thus make the effect that feedback reaches more excellent, the specific algorithm flow process of relevant feedback as shown in Figure 3:
User, after primary retrieval, carries out correlativity mark to result for retrieval.According to the mark situation of user, set up relevant documentation vector set D rwith uncorrelated document vector set D nr.After acquisition relevant documentation and uncorrelated document, can consider, under the guidance of Rocchio algorithm idea, to set up relevance feedback retrieval vector
q → m = γ 1 q → 0 + γ 2 1 | D r | Σ d → j ∈ D r d → j - γ 3 1 | D nr | Σ d → j ∈ D nr d → j ;
Wherein, original query vector, D rand D nrknown relevant and uncorrelated collection of document, γ 1, γ 2, γ 3it is respective weights.
But under the use scenes of native system, directly use above-mentioned formula, relevant feedback effect cannot reach optimum.Consider to improve from following two aspects model: relevant documentation set D rand uncorrelated collection of document D nrdefine with filter, the document that feeds back vector combines to set up with subject matter preferences vector and feeds back query vector afterwards.
Consider that user is after primary retrieval, limited to the feedback labeling operation of document, need the angle from user search history and the distribution of theme interest, helping which defines is relevant documentation, and which is uncorrelated document.The correlativity of the document of user's Direct Mark and judgement is explicit relevant feedback, and this part is the basis of relevant feedback, in relevant feedback calculates, give higher weight.And in result for retrieval Top-N, the document that user does not mark, can by calculating the similarity of document subject matter vector and user preference theme vector, get similarity high add in relevant documentation, what similarity was low adds in uncorrelated document, this two-part document, when user's relevant feedback calculates, can be considered with the scoring of preference topic similarity as its weight l j.Like this, while alleviation user operation burden, the document sets needed for feedback searching is effectively obtained.
Determining D rand D nrdocument scope after, note for the set of relevant documentation vector, note for the vector set of uncorrelated document.Right get high frequency entry and word frequency thereof, set up document vector, be designated as wherein, freq tifor the word frequency in document.
After determining feedback document vector, further its wooden fork is heavily adjusted.The document weight that user directly marks composes 1, and other document calculates according to document subject matter vector and user's subject matter preferences vector similarity score.Thus feedback document is joined in feedback searching vector with corresponding weight.Also the subject matter preferences of user vector is joined in feedback vector with weight δ simultaneously.According to Using statistics analysis, δ get 0.2 ~ 0.3 between effect more excellent.In addition, due to uncorrelated document mainly system automatically to select from the document that user does not mark, uncertain high.For strengthening the stability of feedback searching, get the most incoherent documents representative D by Similarity measures nr, join in calculating.
Namely only get in uncorrelated collection of document calculate.
Comprehensive above consideration, the feedback query formula be improved q → m = q → 0 + Σ d → j ∈ D r l j d → j + δ · W - arg max d → ∈ D nr sin ( d → , q → 0 ) Carry out the query expansion of feedback searching.
Wherein, original query vector, D rand D nrit is known relevant and uncorrelated collection of document.L jit is the weight of each relevant documentation.W is the subject matter preferences vector of user, and δ is the weight of W.Obtain query expansion by this formulae discovery and carry out secondary feedback searching, improve retrieval rate and recall rate.
One of ordinary skill in the art will appreciate that, the all or part of step realized in above-described embodiment method is that the hardware that can carry out instruction relevant by program has come, described program can be stored in a computer read/write memory medium, this program is when performing, comprise each step of above-described embodiment method, and described storage medium can be: ROM/RAM, magnetic disc, CD, storage card etc.Therefore, with said method accordingly, the present invention also also discloses a kind of intelligent retrieval system based on preference, comprising:
User's subject matter preferences identification module, for based on Data subject classification, user characteristics and Operation Log, sets up user's subject matter preferences model;
Query expansion module, utilizes user's subject matter preferences model and user search input, carries out query expansion and obtain preliminary search result;
Retrieval ordering module, utilizes user's subject matter preferences model and the distribution situation of data on each theme, carries out the subject matter preferences marking of data, carries out sorting based on the personalized retrieval of subject matter preferences to preliminary search result;
Feedback searching module, utilizes relevant feedback and pseudo-linear filter unified model to carry out secondary feedback searching to the preliminary search result after sequence and obtains final result for retrieval.
As the example of the present invention's typical apply scene, technique scheme is adopted to establish the subsystem of south electric network information center system, intelligent retrieval system makes full use of the user journal information of comprehensive collection, and descriptor category table, the Demand perference of degree of depth digging user, and as support, realize the demand of user individual retrieval, improve accuracy and the satisfaction of retrieval.System adopts Lucene4.3 as bottom retrieval technique, provides unified retrieval entrance.Designing user subject matter preferences identification module, related subject intelligent prompt and query expansion module, based on theme relevant feedback module, merge the personalized retrieval order module of theme, thus build individualized intelligent searching system.See Fig. 4, specifically from following some carry out system module design:
(1) user's subject matter preferences identification module: systematic analysis User operation log, by the number of operations such as click, download, collection that subject classification statistics is corresponding, and press the access temperature score of each theme of weight computing of action type, as the preference of user to theme.This calculating relates to the log analysis of big data quantity, and unit operation is difficult to support.System uses Hadoop platform, by MapReduce Distributed Calculation, realizes the analysis of daily record.
(2) query expansion module: when user submits retrieval request to, system uses ICTCLAS segmenter to carry out participle to retrieve statement.Calculate the degree of correlation between retrieval participle and descriptor by Jensen-Shannon divergence balancing method, get the high descriptor of the degree of correlation and carry out query expansion, build new retrieval vector.Also can, by being prompted to the mode of user, user be helped definitely to represent the Search Requirement of oneself.
(3) retrieval ordering module: system provides multiple sequence interface.In integrated ordered, using the degree of correlation of document and term as the basis of sorting.Consider the preference of user's theme, be expressed as subject matter preferences vector.Calculate the space vector of document on theme and the distance of user's subject matter preferences vector, as the weight score of document on user preference, be added in global weight score.In addition, the quality score of document is calculated.To the mean cited times of document, download the frequency, factors affecting periodicals etc. and be normalized respectively, after being multiplied by corresponding weight, obtaining the quality score of document, then be added in global weight score with certain weight.
(4) feedback searching module: based on the relevant feedback module of theme for first time ranking results, is marked by user and points out which is relevant document.From the results page that user leafed through, collect unlabelled document, as initial uncorrelated document.Again according to the probability results that user journal is analyzed, therefrom filter out the document of don't know.To select relevant documentation and uncorrelated document, carry out feedback query expansion by above-mentioned algorithm, carry out secondary feedback searching, focus on the result for retrieval that user wants most further.
The invention provides a kind of intelligent search method based on preference and system, utilize the theme distribution of subject indexing technology determination data resource, using more can the retrieval vector of representative of consumer demand based on technique constructions such as the query expansion of theme and relevant feedback, again by combining the intelligent sequencing model of user's subject matter preferences, provide the result for retrieval more meeting its potential demand to user.The algorithm that the present invention realizes and system can identify user potential, the Intelligence Request that is described based on professional thesaurus, thus there is better retrieval effectiveness.
Above-mentioned explanation illustrate and describes some preferred embodiments of the present invention, but as previously mentioned, be to be understood that the present invention is not limited to the form disclosed by this paper, should not regard the eliminating to other embodiments as, and can be used for other combinations various, amendment and environment, and can in invention contemplated scope described herein, changed by the technology of above-mentioned instruction or association area or knowledge.And the change that those skilled in the art carry out and change do not depart from the spirit and scope of the present invention, then all should in the protection domain of claims of the present invention.

Claims (2)

1. based on an intelligent search method for preference, it is characterized in that, described method comprises step:
S1, based on Data subject classification, user characteristics and Operation Log, sets up user's subject matter preferences model; Wherein, set up theme vector space according to described subject classification, determine the predefine subject matter preferences of user vector according to described user characteristics, determine that according to described Operation Log the history subject matter preferences of user is vectorial, history subject matter preferences vector described in described predefine subject matter preferences vector sum is weighted, obtains described user's subject matter preferences model;
S2, utilizes user's subject matter preferences model and user search input, carries out query expansion and obtain preliminary search result; Wherein, described query expansion of carrying out comprises: the probability distribution calculating each lexical item in the term corresponding data set in the input of described user search, calculate the probability distribution of each lexical item in the corresponding data acquisition of each descriptor in the vector space of described user's subject matter preferences model, weigh the mutual difference of above-mentioned two kinds of probability distribution, the descriptor that select probability distributional difference is less, adds in retrieval vector by it with certain weight;
S3, utilizes user's subject matter preferences model and the distribution situation of data on each theme, carries out the subject matter preferences marking of data, carries out sorting based on the personalized retrieval of subject matter preferences to preliminary search result; Wherein, in described preliminary search result document sort be divided into: G sort1g query+ β 2sim (V, W)+β 3g factor; G querythe score value inquiring about to draw based on a particular user that LUCENE returns, β 1, β 2and β 3the weight that each score value is corresponding, G factorbe the document quality scoring after weighting, sim (V, W) is the vector similarity of the subject matter preferences vector W of user and the theme distribution vector V of every section of document, has
n is the dimension of vectorial W, V, v k, w krepresent a kth element of vectorial V, W respectively; According to described sequence score, each result in described preliminary search result is sorted subsequently;
S4, utilizes relevant feedback and pseudo-linear filter unified model to carry out secondary feedback searching to the preliminary search result after sequence and obtains final result for retrieval; Wherein, described relevant feedback is utilized to determine the vector set of the correlated results in described preliminary search result; Described pseudo-linear filter is utilized to determine the vector set of the uncorrelated result in described preliminary search result; The vector set of described user's subject matter preferences model, described correlated results, the vector set of described uncorrelated result and original query vector are combined and carries out feedback query.
2. based on an intelligent retrieval system for preference, it is characterized in that, described system comprises:
User's subject matter preferences identification module, for based on Data subject classification, user characteristics and Operation Log, sets up user's subject matter preferences model; Wherein, set up theme vector space according to described subject classification, determine the predefine subject matter preferences of user vector according to described user characteristics, determine that according to described Operation Log the history subject matter preferences of user is vectorial, history subject matter preferences vector described in described predefine subject matter preferences vector sum is weighted, obtains described user's subject matter preferences model;
Query expansion module, utilizes user's subject matter preferences model and user search input, carries out query expansion and obtain preliminary search result; Wherein, described query expansion of carrying out comprises: the probability distribution calculating each lexical item in the term corresponding data set in the input of described user search, calculate the probability distribution of each lexical item in the corresponding data acquisition of each descriptor in the vector space of described user's subject matter preferences model, weigh the mutual difference of above-mentioned two kinds of probability distribution, the descriptor that select probability distributional difference is less, adds in retrieval vector by it with certain weight;
Retrieval ordering module, utilizes user's subject matter preferences model and the distribution situation of data on each theme, carries out the subject matter preferences marking of data, carries out sorting based on the personalized retrieval of subject matter preferences to preliminary search result; Wherein, in described preliminary search result document sort be divided into: G sort1g query+ β 2sim (V, W)+β 3g factor; G querythe score value inquiring about to draw based on a particular user that LUCENE returns, β 1, β 2and β 3the weight that each score value is corresponding, G factorbe the document quality scoring after weighting, sim (V, W) is the vector similarity of the subject matter preferences vector W of user and the theme distribution vector V of every section of document, has
n is the dimension of vectorial W, V, v k, w krepresent a kth element of vectorial V, W respectively; According to described sequence score, each result in described preliminary search result is sorted subsequently;
Feedback searching module, utilizes relevant feedback and pseudo-linear filter unified model to carry out secondary feedback searching to the preliminary search result after sequence and obtains final result for retrieval; Wherein, described relevant feedback is utilized to determine the vector set of the correlated results in described preliminary search result; Described pseudo-linear filter is utilized to determine the vector set of the uncorrelated result in described preliminary search result; The vector set of described user's subject matter preferences model, described correlated results, the vector set of described uncorrelated result and original query vector are combined and carries out feedback query.
CN201310549069.5A 2013-11-08 2013-11-08 Preference-based intelligent retrieval method and system Active CN103593425B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310549069.5A CN103593425B (en) 2013-11-08 2013-11-08 Preference-based intelligent retrieval method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310549069.5A CN103593425B (en) 2013-11-08 2013-11-08 Preference-based intelligent retrieval method and system

Publications (2)

Publication Number Publication Date
CN103593425A CN103593425A (en) 2014-02-19
CN103593425B true CN103593425B (en) 2015-01-07

Family

ID=50083566

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310549069.5A Active CN103593425B (en) 2013-11-08 2013-11-08 Preference-based intelligent retrieval method and system

Country Status (1)

Country Link
CN (1) CN103593425B (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462611B (en) * 2015-01-05 2018-06-08 五八同城信息技术有限公司 Modeling method, sort method and model building device, the collator of information sorting model
CN105045875B (en) * 2015-07-17 2018-06-12 北京林业大学 Personalized search and device
CN105550282A (en) * 2015-12-10 2016-05-04 成都陌云科技有限公司 User interest forecasting method by utilizing multidimensional data
CN105512298A (en) * 2015-12-10 2016-04-20 成都陌云科技有限公司 Interested content prediction method based on machine learning
CN106547864B (en) * 2016-10-24 2019-07-16 湖南科技大学 A kind of Personalized search based on query expansion
CN108520033B (en) * 2018-03-28 2020-01-24 华中师范大学 Enhanced pseudo-correlation feedback model information retrieval method based on hyperspace simulation language
CN108846050B (en) * 2018-05-30 2022-01-21 重庆望江工业有限公司 Intelligent core process knowledge pushing method and system based on multi-model fusion
CN109213908A (en) * 2018-08-01 2019-01-15 浙江工业大学 A kind of academic meeting paper supplying system based on data mining
CN109361929B (en) * 2018-09-28 2021-05-28 武汉斗鱼网络科技有限公司 Method for determining live broadcast room label and related equipment
CN109408713B (en) * 2018-10-09 2020-12-04 哈尔滨工程大学 Software demand retrieval system based on user feedback information
CN110046243A (en) * 2019-04-23 2019-07-23 北京恒冠网络数据处理有限公司 A kind of patent personalized retrieval analysis system based on big data
CN110427400A (en) * 2019-06-21 2019-11-08 贵州电网有限责任公司 Search method is excavated based on operation of power networks information interactive information user's demand depth
CN110489638A (en) * 2019-07-08 2019-11-22 广州视源电子科技股份有限公司 A kind of searching method, device, server, system and storage medium
CN110569431A (en) * 2019-08-14 2019-12-13 深圳市赛为智能股份有限公司 public opinion information monitoring method and device, computer equipment and storage medium
CN110659768B (en) * 2019-08-14 2023-01-17 中国科学院计算机网络信息中心 Academic influence evaluation and prediction method for data publications
CN113505290A (en) * 2021-08-31 2021-10-15 上海飞旗网络技术股份有限公司 Information retrieval method and system for user-defined user intention model
CN115906155A (en) * 2022-11-04 2023-04-04 浙江联运知慧科技有限公司 Data management system of sorting center
CN116719954B (en) * 2023-08-04 2023-10-17 中国人民解放军海军潜艇学院 Information retrieval method, electronic equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101539930A (en) * 2009-04-21 2009-09-23 武汉大学 Search method of related feedback images

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101539930A (en) * 2009-04-21 2009-09-23 武汉大学 Search method of related feedback images

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
唐晓玲,何天云.基于主题偏好的个性化检索模型研究.《情报杂志》.2011,第30卷(第4期),第134-136页. *

Also Published As

Publication number Publication date
CN103593425A (en) 2014-02-19

Similar Documents

Publication Publication Date Title
CN103593425B (en) Preference-based intelligent retrieval method and system
CN102902806B (en) A kind of method and system utilizing search engine to carry out query expansion
CN103336793B (en) A kind of personalized article recommends method and system thereof
CN102982042B (en) A kind of personalization content recommendation method, platform and system
CN106339383B (en) A kind of search ordering method and system
CN103064945A (en) Situation searching method based on body
Budikova et al. Evaluation platform for content-based image retrieval systems
CN110674407A (en) Hybrid recommendation method based on graph convolution neural network
CN101520785A (en) Information retrieval method and system therefor
CN103049440A (en) Recommendation processing method and processing system for related articles
CN103577416A (en) Query expansion method and system
CN103186574A (en) Method and device for generating searching result
CN104484380A (en) Personalized search method and personalized search device
CN103309869A (en) Method and system for recommending display keyword of data object
CN103729365A (en) Searching method and system
CN105426550A (en) Collaborative filtering tag recommendation method and system based on user quality model
CN103927339B (en) Knowledge Reorganizing system and method for knowledge realignment
CN104572733A (en) User interest tag classification method and device
CN102063454A (en) Method and equipment combining search and application
CN101840438B (en) Retrieval system oriented to meta keywords of source document
CN105550282A (en) User interest forecasting method by utilizing multidimensional data
KR20130089699A (en) Method, search server and computer readable recording medium for determining ranking of stock-collection with stock exchange information
Cha et al. Topic model based approach for improved indexing in content based document retrieval
Krishnan et al. Select, link and rank: Diversified query expansion and entity ranking using wikipedia
Li et al. Research on hot news discovery model based on user interest and topic discovery

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant