CN103593425B

CN103593425B - Preference-based intelligent retrieval method and system

Info

Publication number: CN103593425B
Application number: CN201310549069.5A
Authority: CN
Inventors: 李鹏; 周育忠; 王庆红; 龚婷; 陈传夫; 王平; 冉从敬; 吴江
Original assignee: Wuhan University WHU; Research Institute of Southern Power Grid Co Ltd
Current assignee: Wuhan University WHU; CSG Electric Power Research Institute; Research Institute of Southern Power Grid Co Ltd
Priority date: 2013-11-08
Filing date: 2013-11-08
Publication date: 2015-01-07
Anticipated expiration: 2033-11-08
Also published as: CN103593425A

Abstract

The invention relates to the field of data retrieval, and discloses a preference-based intelligent retrieval method and system. The method includes the steps that a subject preference model of a user is established on the basis of subject classification of data, user characteristics and operation logs; query expansion is performed through the subject preference model of the user and retrieval input of the user to obtain a primary retrieval result; subject preference scoring is performed on the data through the subject preference model of the user and distribution conditions of the data on all subjects, and personalized retrieval ranking is preformed on the primary retrieval result on the basis of subject preference; secondary feedback retrieval is performed on the ranked primary retrieval result through a comprehensive model of relevance feedback and pseudo relevance feedback to obtain a final retrieval result. According to the method and system, subject distribution of data resources is determined through the subject indexing technology, retrieval vectors better representing user requirements are established through the query expansion based on subjects, relevance feedback and other technologies, and the retrieval result meeting potential requirements of the user better is provided for the user.

Description

Based on intelligent search method and the system of preference

Technical field

The present invention relates to field of data retrieval, especially relate to a kind of intelligent search method based on preference and system.

Background technology

Along with improving constantly and the high speed development of information technoloy equipment of social informatization degree, the memory space exponentially ascendant trend of information; And meanwhile people require more and more higher to the acquisition of information, retrieval technique how is utilized to find required useful information more and more difficult fast.Traditional search engine is retrieved based on keyword, even if but adopt multiple keyword to carry out combined retrieval, in the face of the network information of magnanimity, obtain result quantity remain millions of, the information needed most will be found from these results to be also a large order concerning user.Therefore, the problem of current data retrieval most critical is exactly the information how finding user to need most from result for retrieval.

In prior art, search engine or data retrieval system can sort to result for retrieval based on partial statistical information, to strive for result higher for the degree of correlation to be preferentially supplied to user.Similar statistical information mainly contains the keyword frequency of occurrences, matching degree and clicking rate etc., and these information are added up the determination content of data itself, although treatment capacity is comparatively large, content is clear and definite more easily realizes.In addition, the more advanced system of part is also had to carry out further optimizing, statistical nature such as based on various text semantic is expanded etc. by Data classification or to keyword, makes every effort to make forward result for retrieval high as much as possible with the degree of correlation of carrying out the keyword retrieved.But the descriptor (keyword, time, range of search etc. require combination) in the inquiry request that aforesaid way is mainly submitted to based on user's single and the text message of data, and due to above-mentioned two kinds of Information Availability contents limited, add that the information of data itself cannot embody the difference between user, even if adopt the mode of prior art to be optimized, result for retrieval is also difficult to the demand difference embodying different user all sidedly, and this causes the recall precision of existing mode, degree of accuracy and user satisfaction to be difficult to reach desirable state.

Summary of the invention

For the above-mentioned defect existed in prior art, technical matters to be solved by this invention how to retrieve for the difference optimization of different user.

For solving the problems of the technologies described above, on the one hand, the invention provides a kind of intelligent search method based on preference, the method comprising the steps of:

S1, based on Data subject classification, user characteristics and Operation Log, sets up user's subject matter preferences model;

S2, utilizes user's subject matter preferences model and user search input, carries out query expansion and obtain preliminary search result;

S3, utilizes user's subject matter preferences model and the distribution situation of data on each theme, carries out the subject matter preferences marking of data, carries out sorting based on the personalized retrieval of subject matter preferences to preliminary search result;

S4, utilizes relevant feedback and pseudo-linear filter unified model to carry out secondary feedback searching to the preliminary search result after sequence and obtains final result for retrieval.

Preferably, in described step S1, the described user's of foundation subject matter preferences model comprises step:

Theme vector space is set up according to described subject classification;

The predefine subject matter preferences vector of user is determined according to described user characteristics;

The history subject matter preferences vector of user is determined according to described Operation Log;

History subject matter preferences vector described in described predefine subject matter preferences vector sum is weighted, obtains described user's subject matter preferences model.

Preferably, in described step S2, described in carry out expanding query and comprise step:

Calculate the probability distribution of each lexical item in the term corresponding data set in the input of described user search;

Calculate the probability distribution of each lexical item in the corresponding data acquisition of each descriptor in the vector space of described user's subject matter preferences model;

Weigh the mutual difference of above-mentioned two kinds of probability distribution, the descriptor that select probability distributional difference is less, it is added in retrieval vector with certain weight.

Preferably, in described step S3, described personalized retrieval sequence comprises step:

By calculating the vector similarity of each result and described user's subject matter preferences model in described preliminary search result, pass judgment on the score of described each result on the theme of user preference;

Calculate the quality score of described each result;

Sort the end obtaining described each result according to the weighting of described vector similarity, described score on the theme of user preference and described quality score score, sorts to each result in described preliminary search result according to the described score of sequence eventually.

Preferably, in described step S4, described secondary feedback searching comprises step:

Described relevant feedback is utilized to determine the vector set of the correlated results in described preliminary search result;

Described pseudo-linear filter is utilized to determine the vector set of the uncorrelated result in described preliminary search result;

The vector set of described user's subject matter preferences model, described correlated results, the vector set of described uncorrelated result and original query vector are combined and carries out feedback query.

On the other hand, the present invention also provides a kind of intelligent retrieval system based on preference simultaneously, and this system comprises:

User's subject matter preferences identification module, for based on Data subject classification, user characteristics and Operation Log, sets up user's subject matter preferences model;

Query expansion module, utilizes user's subject matter preferences model and user search input, carries out query expansion and obtain preliminary search result;

Retrieval ordering module, utilizes user's subject matter preferences model and the distribution situation of data on each theme, carries out the subject matter preferences marking of data, carries out sorting based on the personalized retrieval of subject matter preferences to preliminary search result;

Feedback searching module, utilizes relevant feedback and pseudo-linear filter unified model to carry out secondary feedback searching to the preliminary search result after sequence and obtains final result for retrieval.

Preferably, described user's subject matter preferences identification module comprises further:

Theme vector space module, for setting up theme vector space according to described subject classification;

Predefine preference module, for determining the predefine subject matter preferences vector of user according to described user characteristics;

History preference module, for determining the history subject matter preferences vector of user according to described Operation Log;

Preference pattern acquisition module, for being weighted by history subject matter preferences vector described in described predefine subject matter preferences vector sum, obtains described user's subject matter preferences model.

Preferably, described query expansion module comprises further:

Term distribution module, for calculating the probability distribution of each lexical item in the term corresponding data set in the input of described user search;

Descriptor distribution module, for calculate described user's subject matter preferences model vector space in the probability distribution of each lexical item in the corresponding data acquisition of each descriptor;

Expansion module, for weighing the mutual difference of above-mentioned two kinds of probability distribution, the descriptor that select probability distributional difference is less, adds it in retrieval vector with certain weight.

Preferably, described retrieval ordering module comprises further:

Theme obtains sub-module, for the vector similarity by calculating each result and described user's subject matter preferences model in described preliminary search result, passes judgment on the score of described each result on the theme of user preference;

Quality score module, for calculating the quality score of described each result;

Order module, sort the end for obtaining described each result according to the weighting of described vector similarity, described score on the theme of user preference and described quality score score, sorts to each result in described preliminary search result according to the described score of sequence eventually.

Preferably, described feedback searching module comprises further:

Relevant feedback module, for the vector set utilizing described relevant feedback to determine the correlated results in described preliminary search result;

Pseudo-linear filter module, for the vector set utilizing described pseudo-linear filter to determine the uncorrelated result in described preliminary search result;

Feedback module, carries out feedback query for the vector set of described user's subject matter preferences model, described correlated results, the vector set of described uncorrelated result and original query vector being combined.

The invention provides a kind of intelligent search method based on preference and system, utilize the theme distribution of subject indexing technology determination data resource, using more can the retrieval vector of representative of consumer demand based on technique constructions such as the query expansion of theme and relevant feedback, again by combining the intelligent sequencing model of user's subject matter preferences, provide the result for retrieval more meeting its potential demand to user.The algorithm that the present invention realizes and system can identify user potential, the Intelligence Request that is described based on professional thesaurus, thus there is better retrieval effectiveness.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet based on the intelligent search method of preference in one embodiment of the present of invention;

Fig. 2 is the query expansion algorithm flow schematic diagram based on theme in a preferred embodiment of the present invention;

Fig. 3 is the Relevance Feedback Algorithms schematic flow sheet in conjunction with theme in a preferred embodiment of the present invention;

Based on the modular structure schematic diagram of the intelligent retrieval system of preference in the typical apply scene of the present invention of Fig. 4 position.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described.Obviously, described embodiment is for implementing better embodiment of the present invention, and described description is to illustrate for the purpose of rule of the present invention, and is not used to limit scope of the present invention.Protection scope of the present invention should be as the criterion with the claim person of defining, and based on the embodiment in the present invention, the every other embodiment that those of ordinary skill in the art obtain under the prerequisite not making creative work, all belongs to the scope of protection of the invention.

Prior art is optimized mainly for the data that are retrieved, and optimal situation has also just carried out precise classification and expansion to the data that are retrieved, and then the descriptor in the inquiry request itself and user's single submitted to is mated.Although this mode greatly enhances the degree of accuracy of retrieval, but it does not embody the difference between user, as long as inquiry request is identical, result for retrieval will be identical, and this user different from actual conditions has the situation of different demands to there is obvious difference.

In an embodiment of the present invention, the potential demand of user is obtained by the retrieval behavior of observation analysis user within longer a period of time, user's request and Data classification are combined, dominant relevant feedback and recessive Relevance Feedback are dissolved in Optimization of Information Retrieval, and the demand difference accurately having embodied user also effectively improves whole efficiency and the degree of accuracy of data retrieval.

See Fig. 1, in one embodiment of the invention, the intelligent search method based on preference comprises step:

Below the various optimal ways of above-described embodiment are done and further expand explanation, in preferred embodiment hereafter, in order to give prominence to Technique Rule of the present invention and actual effect further, the data area be retrieved is limited in technical information information, but relevant technical staff in the field should be appreciated that, technical information information is a specifically classification in total data, technical scheme of the present invention obviously can directly apply in various numerical information, and following preferred embodiment should not regard as limitation of the present invention.

There is potential theme demand in the acquisition of user to data resource, for scientific and technical literature, the demand of user to same keyword of different field has significant difference, makes the theme demand of this recessiveness show more obvious.In a preferred embodiment of the invention, in step S1, use descriptor category table to map user's request, find that user is in the classificatory preference of document resource, thus provide good basis for intelligent retrieval.Subject matter preferences is mainly considered from following two aspects:

One, the predefine of user's subject matter preferences

Different users has different features, wherein has the potential demand much reflecting user, therefore, and can according to the subject matter preferences of the more pre-defined users of user characteristics (region of such as user, functional information or post scope of item etc.).Specifically, the such as user in power industry mesohigh test post, the document resource relevant to power transformer, isolating switch, mutual inductor etc. has specific demand, thus descriptor can be extracted from these post documents, in conjunction with Position Responsibility descriptor, be mapped on the subject category of specification, as the Demand perference predefine of user.More preferably, the subject matter preferences of user is represented in step S1 with vector space model:

First, analyze theme distribution situation, set up N and tie up theme vector space [(k ₁, w ₁), (k ₂, w ₂) ... (k _n, w _n)]; Wherein, k _ibe i-th theme, w _ifor user is at k _ion preference, i ∈ 1,2 ..., N.

Then, from user characteristics (as Position Responsibility descriptor, post document etc.), extract descriptor, add up the frequency of these descriptor calculate its probability distribution; Wherein, be the theme word sub _iword frequency, freq _{sub_total}total word frequency of the set of words that is the theme.

Finally, will be used for characterizing consumer at each descriptor sub after certain system call interception _ion preference, thus obtain predefined user's subject matter preferences vector W _pre=(w ' ₁, w ' ₂..., w ' _n); Wherein, i=1,2 ..., n, represents that user is at theme k _nupper predefined preference.

Two, from User operation log, user's subject matter preferences is found

The retrieval behavior of user is the part in the global behavior of user's obtaining information; The relevant user that has clicks, downloads, collects the operations such as document from system, and these operations all can be recorded in system journal.Thus the subject matter preferences of user can be excavated from a large amount of Operation Log information of user, for intelligent retrieval provides shoring of foundation.In the step S1 of said method, also set up complete Operation Log collection mechanism, utilize Operation Log to determine user's subject matter preferences.

Particularly, collect and analyze daily record, obtaining the set D of user operation document _op={ d _op1, d _op2..., d _opN.Right counting user is to d _ithe operation such as click, download, the collection frequency, and give different operating weight, calculate user after weighting to d _iaccess frequency.According to the subject indexing of document, d can be obtained _idistribution in descriptor, then in conjunction with d _iaccess frequency, the access frequency of user in each descriptor can be obtained, it can be used as the subject matter preferences degree of user, correspond in theme vector space, thus obtain user subject matter preferences vector W _op=(w ₁, w ₂..., w _n).

Finally, by above two kinds of subject matter preferences being weighted, thus the subject matter preferences W=α of user is determined ₁w _pre+ α ₂w _op; Wherein α ₁, α ₂be two kinds of vectors weights separately, carry out presetting or adjusting according to the degree of laying particular stress on.It should be noted that and obtain user preference along with time variations according to log analysis, need to upgrade accordingly according to the update status of daily record.

Inquiry request is the direct reaction of user's query demand, wherein contains potential theme demand equally, and this theme demand has reacted user to a certain extent to the abstract of required document and summary, more can reflect the demand of user.Descriptor as the mark of document resource, can have been reacted content core and the classified information of document, can better express the essence of document simultaneously.Comprehensive these two aspects is considered, in step S2 of the present invention, choosing a topic word carries out query expansion, and from the effect improving retrieval to a great extent, its algorithm flow as shown in Figure 2.

If the retrieval input of user is exactly directly the descriptor of specification, can, by the incidence relation such as hypernym, hyponym in subject category list, relevant descriptor be found to carry out query expansion.But many times, the inquiry request of user's input does not have dominant associating with between potential theme demand, can be at this moment its relation that is associated by history searching document and subject indexing document.As shown in Figure 2, basic thought is as follows:

The collection of document that note user search request Q is corresponding is: D _qrery={ d _q1, d _q2..., d _qN.By to D _queryin each document carry out participle, obtain one group of Term set, be designated as T _query={ t _q1, t _q2..., t _qN.Right statistical probability thus obtain D _querycorresponding set T _queryprobability distribution, be designated as wherein, for t _qiword frequency, freq _totalfor T _querythe word frequency sum of middle Term.

For the descriptor in theme vector space, also can obtain one group of collection of document by the subject indexing of document, be designated as D _subject={ d _s1, d _s2..., d _sN.Similarly, obtain entry set by collection of document, then by the calculating of corresponding word frequency, can D be obtained _subjectthe probability distribution of corresponding entry set, is designated as

F_{subject} = (p_{{st}_{1}}, p_{{st}_{2}}, . . ., p_{{st}_{N}}) .

After obtaining the probability distribution of these two aspects, the similarity that can be distributed by calculating probability, finds the maximally related descriptor with term, and then is used for doing the query expansion of descriptor.

When calculating the probability distribution similarity of term and two groups of documents corresponding to descriptor, preferred consideration uses Kullback-Leibler divergence (abbreviation of Kullback-Leibler Divergence is also called relative entropy Relative Entropy) to calculate.

Like this, D is passed through _kL(F _subject|| F _query) can F be calculated _subjectrelative to F _queryprobability distribution variances, get difference compared with for little descriptor build query expansion

D_{KL} (F_{subject} | | F_{query}) = Σ p_{{st}_{i}} \log \frac{p_{{st}_{i}}}{p_{{qt}_{i}}};

For obtaining better query expansion effect, further study inquiry request and the distribution situation of descriptor on the document vector of acceptance system, accordingly above-mentioned calculating being optimized further, select Jensen-Shannon divergence to carry out smoothing computation, by calculating D _jS(F _subject|| F _query) weigh F _subjectand F _querymutual difference

D_{JS} (F_{subject} | | F_{query}) = \frac{1}{2} D_{KL} (F_{subject} | | R) + \frac{1}{2} D_{KL} (F_{query} | | R);

Wherein,

R = \frac{1}{2} (F_{subject} + F_{query}) .

After selecting the descriptor that probability distribution variances is less, added in retrieval vector with certain weight, built the query vector of expansion, to improve recall precision.

In the step S3 of said method, on the basis of file correlation sequence, consider that the subject matter preferences of user is weighted the core that sequence is personalized retrieval sequence.From user's subject matter preferences model, obtain the subject matter preferences vector W of user.For retrieving the collection of document obtained, according to document subject index situation, the theme distribution vector V=(v of every section of document can be obtained ₁, v ₂..., v _n).Like this, by calculating the vector similarity sim (V, W) of W and V, the score of document on the theme of user preference retrieved can be passed judgment on.The document that sim (V, W) calculated value is high, its preference-score is also higher.Wherein,

sim (V, W) = \frac{Σ_{k = 1}^{n} v_{k} \times w_{k}}{\sqrt{(Σ_{k = 1}^{n} v_{k}^{2}) (Σ_{k = 1}^{n} w_{k}^{2})}} .

After considering the weighting of user's subject matter preferences, the quality of document is also an important Weighted Guidelines.The evaluation of the quality of literature a lot of because have, herein main be cited from paper the factor, the frequency be downloaded, deliver periodical rank, whether be the factor of these 4 aspects of self-built resource, bonus point evaluation is carried out to document.Wherein self-built resource mainly consider our unit by be purchased to resource buy and voluntarily gather two kinds of modes collect document resource.And have passed through manual examination and verification according to the resource that specialty gathers voluntarily, therefore there is higher quality.Shared by each factor, weight is in table 1.

The factor	f _Quote	f _Download	f _Periodical	f _Self-built
					Weight	0.5	0.1	0.2	0.2

Table 1 the quality of literature evaluation factors weight table

Calculated by the normalization of relevant field in document metadata, draw the score of each factor of document.The quality score G of document is obtained after weighting _factor=0.45f _quote+ 0.15f _download+ 0.2f _periodical+ 0.2f _self-built.

By the weighting to above two aspect scores and document and retrieval similarity score, sort the end calculating result for retrieval document score G _sort=β ₁g _query+ β ₂sim (V, W)+β ₃g _factor; Wherein, G _querythe score value drawn based on particular user inquiry (query) that LUCENE returns, β ₁, β ₂and β ₃be the weight that each score value is corresponding, computation process considers the setting of different weight, specifically determines according to system service condition and Document metrology situation.

Relevant feedback supplementing as retrieval request, effectively can improve the accuracy of retrieval.In the step S4 of said method, relevant feedback and pseudo-linear filter are combined, and by subject category classification and the Operation Log analysis of user, effectively define the scope of relevant documentation and uncorrelated document, thus make the effect that feedback reaches more excellent, the specific algorithm flow process of relevant feedback as shown in Figure 3:

User, after primary retrieval, carries out correlativity mark to result for retrieval.According to the mark situation of user, set up relevant documentation vector set D _rwith uncorrelated document vector set D _nr.After acquisition relevant documentation and uncorrelated document, can consider, under the guidance of Rocchio algorithm idea, to set up relevance feedback retrieval vector

{\overset{&RightArrow;}{q}}_{m} = γ_{1} {\overset{&RightArrow;}{q}}_{0} + γ_{2} \frac{1}{| D_{r} |} \underset{{\overset{&RightArrow;}{d}}_{j} &Element; D_{r}}{Σ} {\overset{&RightArrow;}{d}}_{j} - γ_{3} \frac{1}{| D_{nr} |} \underset{{\overset{&RightArrow;}{d}}_{j} &Element; D_{nr}}{Σ} {\overset{&RightArrow;}{d}}_{j};

Wherein, original query vector, D _rand D _nrknown relevant and uncorrelated collection of document, γ ₁, γ ₂, γ ₃it is respective weights.

But under the use scenes of native system, directly use above-mentioned formula, relevant feedback effect cannot reach optimum.Consider to improve from following two aspects model: relevant documentation set D _rand uncorrelated collection of document D _nrdefine with filter, the document that feeds back vector combines to set up with subject matter preferences vector and feeds back query vector afterwards.

Consider that user is after primary retrieval, limited to the feedback labeling operation of document, need the angle from user search history and the distribution of theme interest, helping which defines is relevant documentation, and which is uncorrelated document.The correlativity of the document of user's Direct Mark and judgement is explicit relevant feedback, and this part is the basis of relevant feedback, in relevant feedback calculates, give higher weight.And in result for retrieval Top-N, the document that user does not mark, can by calculating the similarity of document subject matter vector and user preference theme vector, get similarity high add in relevant documentation, what similarity was low adds in uncorrelated document, this two-part document, when user's relevant feedback calculates, can be considered with the scoring of preference topic similarity as its weight l _j.Like this, while alleviation user operation burden, the document sets needed for feedback searching is effectively obtained.

Determining D _rand D _nrdocument scope after, note for the set of relevant documentation vector, note for the vector set of uncorrelated document.Right get high frequency entry and word frequency thereof, set up document vector, be designated as wherein, freq _tifor the word frequency in document.

After determining feedback document vector, further its wooden fork is heavily adjusted.The document weight that user directly marks composes 1, and other document calculates according to document subject matter vector and user's subject matter preferences vector similarity score.Thus feedback document is joined in feedback searching vector with corresponding weight.Also the subject matter preferences of user vector is joined in feedback vector with weight δ simultaneously.According to Using statistics analysis, δ get 0.2 ~ 0.3 between effect more excellent.In addition, due to uncorrelated document mainly system automatically to select from the document that user does not mark, uncertain high.For strengthening the stability of feedback searching, get the most incoherent documents representative D by Similarity measures _nr, join in calculating.

Namely only get in uncorrelated collection of document calculate.

Comprehensive above consideration, the feedback query formula be improved

{\overset{&RightArrow;}{q}}_{m} = {\overset{&RightArrow;}{q}}_{0} + \underset{{\overset{&RightArrow;}{d}}_{j} &Element; D_{r}}{Σ} l_{j} {\overset{&RightArrow;}{d}}_{j} + δ \cdot W - \arg \max_{\overset{&RightArrow;}{d} &Element; D_{nr}} \sin (\overset{&RightArrow;}{d}, {\overset{&RightArrow;}{q}}_{0})

Carry out the query expansion of feedback searching.

Wherein, original query vector, D _rand D _nrit is known relevant and uncorrelated collection of document.L _jit is the weight of each relevant documentation.W is the subject matter preferences vector of user, and δ is the weight of W.Obtain query expansion by this formulae discovery and carry out secondary feedback searching, improve retrieval rate and recall rate.

One of ordinary skill in the art will appreciate that, the all or part of step realized in above-described embodiment method is that the hardware that can carry out instruction relevant by program has come, described program can be stored in a computer read/write memory medium, this program is when performing, comprise each step of above-described embodiment method, and described storage medium can be: ROM/RAM, magnetic disc, CD, storage card etc.Therefore, with said method accordingly, the present invention also also discloses a kind of intelligent retrieval system based on preference, comprising:

As the example of the present invention's typical apply scene, technique scheme is adopted to establish the subsystem of south electric network information center system, intelligent retrieval system makes full use of the user journal information of comprehensive collection, and descriptor category table, the Demand perference of degree of depth digging user, and as support, realize the demand of user individual retrieval, improve accuracy and the satisfaction of retrieval.System adopts Lucene4.3 as bottom retrieval technique, provides unified retrieval entrance.Designing user subject matter preferences identification module, related subject intelligent prompt and query expansion module, based on theme relevant feedback module, merge the personalized retrieval order module of theme, thus build individualized intelligent searching system.See Fig. 4, specifically from following some carry out system module design:

(1) user's subject matter preferences identification module: systematic analysis User operation log, by the number of operations such as click, download, collection that subject classification statistics is corresponding, and press the access temperature score of each theme of weight computing of action type, as the preference of user to theme.This calculating relates to the log analysis of big data quantity, and unit operation is difficult to support.System uses Hadoop platform, by MapReduce Distributed Calculation, realizes the analysis of daily record.

(2) query expansion module: when user submits retrieval request to, system uses ICTCLAS segmenter to carry out participle to retrieve statement.Calculate the degree of correlation between retrieval participle and descriptor by Jensen-Shannon divergence balancing method, get the high descriptor of the degree of correlation and carry out query expansion, build new retrieval vector.Also can, by being prompted to the mode of user, user be helped definitely to represent the Search Requirement of oneself.

(3) retrieval ordering module: system provides multiple sequence interface.In integrated ordered, using the degree of correlation of document and term as the basis of sorting.Consider the preference of user's theme, be expressed as subject matter preferences vector.Calculate the space vector of document on theme and the distance of user's subject matter preferences vector, as the weight score of document on user preference, be added in global weight score.In addition, the quality score of document is calculated.To the mean cited times of document, download the frequency, factors affecting periodicals etc. and be normalized respectively, after being multiplied by corresponding weight, obtaining the quality score of document, then be added in global weight score with certain weight.

(4) feedback searching module: based on the relevant feedback module of theme for first time ranking results, is marked by user and points out which is relevant document.From the results page that user leafed through, collect unlabelled document, as initial uncorrelated document.Again according to the probability results that user journal is analyzed, therefrom filter out the document of don't know.To select relevant documentation and uncorrelated document, carry out feedback query expansion by above-mentioned algorithm, carry out secondary feedback searching, focus on the result for retrieval that user wants most further.

Above-mentioned explanation illustrate and describes some preferred embodiments of the present invention, but as previously mentioned, be to be understood that the present invention is not limited to the form disclosed by this paper, should not regard the eliminating to other embodiments as, and can be used for other combinations various, amendment and environment, and can in invention contemplated scope described herein, changed by the technology of above-mentioned instruction or association area or knowledge.And the change that those skilled in the art carry out and change do not depart from the spirit and scope of the present invention, then all should in the protection domain of claims of the present invention.

Claims

1. based on an intelligent search method for preference, it is characterized in that, described method comprises step:

S1, based on Data subject classification, user characteristics and Operation Log, sets up user's subject matter preferences model; Wherein, set up theme vector space according to described subject classification, determine the predefine subject matter preferences of user vector according to described user characteristics, determine that according to described Operation Log the history subject matter preferences of user is vectorial, history subject matter preferences vector described in described predefine subject matter preferences vector sum is weighted, obtains described user's subject matter preferences model;

S2, utilizes user's subject matter preferences model and user search input, carries out query expansion and obtain preliminary search result; Wherein, described query expansion of carrying out comprises: the probability distribution calculating each lexical item in the term corresponding data set in the input of described user search, calculate the probability distribution of each lexical item in the corresponding data acquisition of each descriptor in the vector space of described user's subject matter preferences model, weigh the mutual difference of above-mentioned two kinds of probability distribution, the descriptor that select probability distributional difference is less, adds in retrieval vector by it with certain weight;

S3, utilizes user's subject matter preferences model and the distribution situation of data on each theme, carries out the subject matter preferences marking of data, carries out sorting based on the personalized retrieval of subject matter preferences to preliminary search result; Wherein, in described preliminary search result document sort be divided into: G _sort=β ₁g _query+ β ₂sim (V, W)+β ₃g _factor; G _querythe score value inquiring about to draw based on a particular user that LUCENE returns, β ₁, β ₂and β ₃the weight that each score value is corresponding, G _factorbe the document quality scoring after weighting, sim (V, W) is the vector similarity of the subject matter preferences vector W of user and the theme distribution vector V of every section of document, has

n is the dimension of vectorial W, V, v _k, w _krepresent a kth element of vectorial V, W respectively; According to described sequence score, each result in described preliminary search result is sorted subsequently;

S4, utilizes relevant feedback and pseudo-linear filter unified model to carry out secondary feedback searching to the preliminary search result after sequence and obtains final result for retrieval; Wherein, described relevant feedback is utilized to determine the vector set of the correlated results in described preliminary search result; Described pseudo-linear filter is utilized to determine the vector set of the uncorrelated result in described preliminary search result; The vector set of described user's subject matter preferences model, described correlated results, the vector set of described uncorrelated result and original query vector are combined and carries out feedback query.

2. based on an intelligent retrieval system for preference, it is characterized in that, described system comprises:

User's subject matter preferences identification module, for based on Data subject classification, user characteristics and Operation Log, sets up user's subject matter preferences model; Wherein, set up theme vector space according to described subject classification, determine the predefine subject matter preferences of user vector according to described user characteristics, determine that according to described Operation Log the history subject matter preferences of user is vectorial, history subject matter preferences vector described in described predefine subject matter preferences vector sum is weighted, obtains described user's subject matter preferences model;

Query expansion module, utilizes user's subject matter preferences model and user search input, carries out query expansion and obtain preliminary search result; Wherein, described query expansion of carrying out comprises: the probability distribution calculating each lexical item in the term corresponding data set in the input of described user search, calculate the probability distribution of each lexical item in the corresponding data acquisition of each descriptor in the vector space of described user's subject matter preferences model, weigh the mutual difference of above-mentioned two kinds of probability distribution, the descriptor that select probability distributional difference is less, adds in retrieval vector by it with certain weight;

Retrieval ordering module, utilizes user's subject matter preferences model and the distribution situation of data on each theme, carries out the subject matter preferences marking of data, carries out sorting based on the personalized retrieval of subject matter preferences to preliminary search result; Wherein, in described preliminary search result document sort be divided into: G _sort=β ₁g _query+ β ₂sim (V, W)+β ₃g _factor; G _querythe score value inquiring about to draw based on a particular user that LUCENE returns, β ₁, β ₂and β ₃the weight that each score value is corresponding, G _factorbe the document quality scoring after weighting, sim (V, W) is the vector similarity of the subject matter preferences vector W of user and the theme distribution vector V of every section of document, has

Feedback searching module, utilizes relevant feedback and pseudo-linear filter unified model to carry out secondary feedback searching to the preliminary search result after sequence and obtains final result for retrieval; Wherein, described relevant feedback is utilized to determine the vector set of the correlated results in described preliminary search result; Described pseudo-linear filter is utilized to determine the vector set of the uncorrelated result in described preliminary search result; The vector set of described user's subject matter preferences model, described correlated results, the vector set of described uncorrelated result and original query vector are combined and carries out feedback query.