CN103020164A

CN103020164A - Semantic search method based on multi-semantic analysis and personalized sequencing

Info

Publication number: CN103020164A
Application number: CN201210488572XA
Authority: CN
Inventors: 马应龙; 张潇澜; 于潇
Original assignee: North China Electric Power University
Current assignee: North China Electric Power University
Priority date: 2012-11-26
Filing date: 2012-11-26
Publication date: 2013-04-03
Anticipated expiration: 2032-11-26
Also published as: CN103020164B

Abstract

The invention discloses a semantic search method based on multi-semantic analysis and personalized sequencing, and belongs to the field of information search. The semantic search method adopts the technical scheme comprising the following steps: firstly, by a crawler technology and other technologies, acquiring webpage documents from the Internet, classifying the webpage documents by using a support vector machine, establishing a word vector library by a multi-semantic analysis method, and writing multi-classification results into an index to form an index library; secondly, based on the word vector library, forming search keywords input by a user into a query vector, performing class matching query with the index library to obtain an initial sequencing result; and finally, according to personalized information and history access information of the user, optimizing the initial sequencing result, and returning the optimized result to the user. By the semantic search method based on the multi-semantic analysis and the personalized sequencing, the word vector library and the index library with rich semantemes are formed; and through the personalized information and the history access information, a search result can meet a search demand of the user better and search satisfaction of the user can be improved.

Description

A kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering

Technical field

The invention belongs to information retrieval field, relate in particular to a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering.

Background technology

Search engine is the certain strategy of a basis, use specific computer program to gather information and information organized and process from the internet after, for the user provides retrieval service and with the relative information displaying of the user search system to the user.In order to tackle the rapid growth of the information capacity on the internet, search engine arises at the historic moment.Even to this day, it has become the requisite approach of people's obtaining information from the network.But, current main flow based on the search engine of key word such as Google, Baidu, Bing, Yahoo etc., some thorny problems of ubiquity.Result such as user search understands a large amount of incoherent links of ubiquity; Because user crowd's diversity, single result can not satisfy each user's special requirement targetedly; Search procedure is not considered the semantic relevancy between the word, and Search Results do not organize by certain mode effectively, and the user must not be wasted time and energy and be browsed and select.

Semantic search is a kind of novel way of search that is different from based on keyword search.In general, the work of semantic search no longer sticks to the key word of user institute input request statement itself, and can capture comparatively exactly the potential intention of user institute read statement, thereby can return the result who meets its demand most to the user more accurately, compare traditional search and have higher retrieval precision and original advantage.Ramesh Singh and Myungjin Lee attempt Search Results is reorganized in its research, improve user's search experience.Lien-Fu Lai and Huanhuan Cao utilize concealed Markov tree or other models to calculate the degree of correlation that concerns between Different Results, thereby increase the face of containing of Search Results.FangLiu and Jaime Teevan etc. have proposed the method that the historical visit information of the various users of utilization carries out personalized search, in order to improve the precision of search.Suitable improvement has all been carried out in above-mentioned these researchs aspect semantic search, but these researchs can carry out the Extraordinary condition relatively harsher, and the increase of time loss control are bad based on the user being inquired about in the personalization of classification; Secondly, do not consider from user-dependent different information to have different weights in the process.Therefore, the ordering processing mode to final Search Results is still unsatisfactory.

Summary of the invention

In the problem that exists aspect retrieval precision and the user search experience, the present invention proposes a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering for the existing information retrieval.

A kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering is characterized in that, specifically may further comprise the steps:

Step 1: a part of utilizing crawler technology to obtain web document from the internet is carried out manual sort as training pattern, in conjunction with multi-semantic meaning analytical approach MSA structure term vector storehouse, with the web document vector representation, and training pattern is put in the support vector machines sorter document vector is trained, new webpage utilizes this model to classify by SVM; The classification information of all webpages is write in the index database as an attribute;

Step 2: based on the term vector storehouse that step 1 forms, the search key structure term vector separately with user's input forms final query vector, and query vector and index database is carried out the classification matching inquiry, obtains initial web search result.

Step 3: personal customization information and historical visit information according to the user are optimized ordering to the initial retrieval result, and final result for retrieval is returned to the user.

In the step 1, construct the term vector storehouse based on multi-semantic meaning analytical approach MSA, and the classification results of web document is write in the index, form the process of index database; Specifically comprise following step:

Step 11: structuring concept space; Setting space of the present invention is the m dimension.

The basic dimension of concept space is the set of some class labels, the information that can represent whole corpus, general m the class label that directly extracts from the corpus tag along sort consists of m dimension of vector, then the semantic information of each word is described by a m dimensional vector in the web document, is called term vector;

Step 12: the determining of term vector component value:

Word is to extract from the web document of training pattern, and the size of each component value of term vector is decided by all documents of training pattern, and each component value computing formula of term vector is:

w (c_{i}, t_{j}) = Σ_{k = 1}^{k = | D |} H (c_{i}, d_{k}) \frac{\log_{2} (1 + tf (d_{k}, t_{j}))}{\log_{2} (1 + length (d_{k}))}

(1)

Wherein, t _jRepresent j word in the term vector storehouse, w (c _i, t _j) represent word t _jWith i dimension c in the equivalent vector _iRelation, namely be word t _jEquivalent is to measuring i component value; | D| is the quantity of training document; Tf (d _k, t _j) refer to word t _jAt document d _kThe frequency of middle appearance; H (c _i, d _k) be a discriminant function: if document d _kBelong to dimension c _iDescribed field, then H (c _i, d _k) value is 1, otherwise is 0; Length (d _k) be document d _kLength, i.e. document d _kThrough the number of the word that obtains after the participle denoising, when some words repeatedly occur in document, then repeat count, i.e. length (d _k) 〉=n; K is the quantity of document.

Step 13: the formation in term vector unitization processing and term vector storehouse.

With term vector unitization processing, make its component value scope be [0,1], thereby have better versatility.Term vector behind a plurality of units just forms the term vector storehouse; The computing formula of term vector unitization is:

w^{'} (c_{i}, t_{j}) = \frac{w (c_{i}, t_{j})}{Σ_{i = 0}^{m} w (c_{i}, t_{j})} - - - (2)

Wherein, the term vector behind the unit is designated as W ' (c _i, t _j) be I component value, then the term vector storehouse is:

\overset{&RightArrow;}{t_{j}} = {(w^{'} (c_{1}, t_{j}), w^{'} (c_{2}, t_{j}), . . ., w^{'} (c_{m}, t_{j}))}^{T} - - - (3)

Step 14: obtain the weights of each word in the document and these weights are carried out the unit processing by the TFIDF method.TFIDF weights method be popular for many years and be proved to be one of effective weights method, it does not consider the classification situation to the determining only to depend on the overall condition of corpus of weights, therefore have very strong versatility, the weights that can be applied to the word of many classifying texts in representing are determined.The TFIDF weights determine that the computing formula of method is:

weight (t_{g}, d_{k}) = TFIDF (t_{g}, d_{k}) = tf (t_{g}, d_{k}) \times \lg \frac{| D |}{| D^{'} |} - - - (4)

Wherein, t _gBe document d _kG participle, weight (t _g, d _k) represent word t _gAt document d _kIn shared weights, the set of D representative training document, d _kRepresent k document.| D| is the quantity of training document; The D' representative contains word t _gCollection of document, | D ' | be the quantity of the middle document of set D '.

In like manner unit processing, so that the weights span of word is [0,1] behind the document participle, the computing formula of the weights of word is behind the document participle:

{weight}^{'} (t_{g}, d_{k}) = \frac{weight (t_{g}, d_{k})}{\sqrt{Σ_{j = 1}^{n} weight (t_{g}, d_{k})}} - - - (5)

Wherein, weight'(t _g, d _k) be word t behind the unit _gAt document d _kIn shared weights, n is the participle kind sum of document.

Step 15: the document vector forms.After adopting the TFIDF method to represent weights, the document vector of multi-semantic meaning analysis (MSA) has just formed, document d _kCorresponding document vector

In the computing formula of i component value be:

wd (c_{i}, d_{k}) = Σ_{g = 1}^{n} {w^{'} (c_{i}, t_{g}) \times weigh t^{'} (t_{g}, d_{k})} - - - (6)

Document d _kThe document vector form be designated as:

\overset{&RightArrow;}{d_{k}} = weig h^{'} t (t_{1}, d_{k}) \times \overset{&RightArrow;}{t_{1}} + {weigh}^{'} t (t_{2}, d_{k}) \times \overset{&RightArrow;}{t_{2}} + . . . + {weigh}^{'} t (t_{n}, d_{k}) \times \overset{&RightArrow;}{t_{n}}

= Σ_{g = 1}^{n} {{weigh}^{'} t (t_{g}, d_{k}) \times \overset{&RightArrow;}{t_{g}}} - - - (7)

Wherein, n is the participle kind sum of document,

Be t _gVector form in the term vector storehouse.

This document vector, each component value has directly represented the degree of correlation of this document with corresponding dimension (classification), has very strong Semantic, is the basis of matching inquiry.Afterwards by the m that a pre-defines class label, use the support vector machine technology document vector to be classified and as the criteria for classification of new webpage, and the classification of all webpages is write in the index database as an attribute.

In the step 2, described query vector and index database carry out classification matching inquiry step and comprise:

Step 21: based on the term vector storehouse, with the searching key word vector representation of user's input.

Note searched key set of words is: KEY={key ₁, key ₂..., key _n, the term vector of corresponding each word of extraction makes up each word key from the term vector storehouse that has established _iVector form

Then all keywords can form the query vector set

Wherein, in the term vector storehouse, there is not key _iThe time,

Step 22: on the basis of step 21, form the query vector of searching key word in the m gt: the query vector formula is:

\overset{&RightArrow;}{Q} = Σ_{i = 0}^{n} \overset{&RightArrow;}{T_{i}} = (\overset{&RightArrow;}{T_{1}} + \overset{&RightArrow;}{T_{2}} + . . . + \overset{&RightArrow;}{T_{n}}) = {(α_{1}, α_{2}, . . ., α_{m})}^{T} - - - (8)

Amount of orientation

First three component of component value maximum be designated as: α _p, α _q, α _r, the dimension classification of their correspondences is designated as: c _P, c _q, c _r, the weight vector of classification is designated as:

This weight vector can used in the user profile coupling afterwards.Based on this three kind { c _P, c _q, c _rIn index database, carry out match query, and filter out the webpage that belongs to these three classifications, adopt Lucene basis sort algorithm, obtain initial ranking results.

Described Lucene basic score algorithmic formula is:

score (q, d) = coord (q, d) \cdot queryNorm (q) \cdot \underset{tinq}{Σ} (tf (t, d) \cdot idf {(t)}^{2} \cdot t . getboost () \cdot norm (t, d))

Wherein, q is the demand of retrieval;

Tf (t, d) represents the word frequency that entry t occurs in document d;

Idf (t) represents entry t and arrange word frequency in document;

T.getBoost (): the weight of each word in the query statement, can in inquiry, set certain word more important;

Norm (t, d): normalization factor, it comprises three parameters: (1) Document boost: this value is larger, illustrates that this document is more important.(2) Field boost: this territory is larger, illustrates that this territory is more important.(3) lengthNorm (feld): the Term sum that comprises in territory is more, also is that document is longer, and this value is less, and document is shorter, and this value is larger;

Coord (q, d) represents coordinating factor, and its calculating is based on the entry quantity of all Gong inquiries that comprise among the document d;

QueryNorm (q) representative the variance that provides each query entries and after, calculate the standardized value of certain inquiry.

In the step 3, according to user's personal customization information initial ranking results is optimized processing and specifically may further comprise the steps;

Step 301: collect three kinds of the highest personal customization information of user query frequency: the first customized information u, the second customized information v and the 3rd customized information s, and determine that the weights of these three kinds of personal customization information are A, B and E;

Step 302: the Query coupling when user customized information is determined; At this moment, because the classification of every personal information of user is all definite, therefore, the Lucene basis score of document in the initial ranking results is made amendment:

If I. u, v and s are 0, then this document scores is constant;

If II. there is one not to be 0 among u, v and the s, then:

newscore = score \cdot (1 + A \cdot u + B \cdot v + E \cdot s) \cdot (1 + \frac{topscore - score}{topscore - lastscore})

Wherein,

U=1 represents this webpage classification and conforms to the first customized information, and 0 representative is not inconsistent;

V=1 represents this webpage classification and conforms to user's the second customized information, and 0 representative is not inconsistent;

S=1 represents this webpage classification and conforms to user's the 3rd customized information, and 0 representative is not inconsistent;

Topscore is the top score in the result document, and lastscore is minimum score;

Step 303: the Query coupling when user customized information is fuzzy:

The personal customization information of inputting as the user does not belong to given default category scope, and the personal customization information of input is searched corresponding classification in the term vector storehouse, obtains corresponding new term vector; Set in the present invention term vector collection corresponding to user's the first customized information

Corresponding weight vector is

Corresponding classification is c ₁, c ₂, c ₃The term vector collection that user's the second customized information is corresponding is

Corresponding weight vector is

Corresponding classification is c ₄, c ₅, c ₆New classification set is designated as: C={c ₁, c ₂, c ₃∪ { c ₄, c ₅, c ₆; For each web document, if document d _kBelong to classification c _i, and c _i∈ C, then the document score of this webpage becomes:

newscore = score \cdot (1 + A \cdot w u_{i} + B \cdot w v_{i} + E \cdot s) \cdot (1 + \frac{topscore - score}{topscore - lastscore})

Wherein, wu _iAnd wv _iThe corresponding c of weight vector _iIf the value of that dimension of class is the not corresponding c of weight vector _iClass, this is 0 years old.

In the step 3, according to the historical visit information of user initial ranking results is optimized processing; The historical visit information the matching analysis of described user is according to the optimization to initial ranking results of user's history access record.Owing to repeatedly in the webpage search afterwards of accessed mistake very important effect being arranged in the access history, and the maximum page of all user selections has very large directiveness to the search tendency of unique user, so, this method utilizes user's historical visit information that initial ranking results is optimized, and promotes the page rank high with user's degree of correlation.The present invention proposes the excellent row's algorithm of following webpage and may further comprise the steps:

Step 311: then carry out following algorithm if document d is history or hot link hotlink, otherwise skip this step;

Step 312: establishing debut ranking is r, and then the new rank of d is:

r^{'} = \frac{\sqrt{r}}{s^{'} \cdot \log (2 + n_{1}) + h \cdot \log (2 + n_{2})} - - - (9)

Wherein,

S ': be that historical record is 1, no is 0;

H: be that hot link hotlink is 1, no is 0;

n ₁: the user is to this historical number of clicks;

n ₂: the number of clicks of hot link hotlink.

As can be known, the minimum value of r' is 0.

The present invention at first is optimized existing algorithm, adds the multi-semantic meaning analysis, proposes semantic information more abundant term vector storehouse and index database.Then based on the term vector storehouse search key that the user inputs is carried out semantic analysis, carry out the Query coupling with index database, form initial ranking results.In conjunction with individual subscriber customized information and historical visit information, utilize semantic analysis that initial ranking results is optimized at last, thereby more met the result for retrieval of user's tendency, improve user's retrieval and experience.

Description of drawings

Fig. 1 is the algorithm flow chart of a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering provided by the invention;

Fig. 2 is the contrast distribution figure of three kinds of search methods (LB, YH and OURS) retrieval precision of providing of the embodiment of the invention.

Embodiment

Below in conjunction with accompanying drawing, preferred embodiment is elaborated.Should be emphasized that following explanation only is exemplary, rather than in order to limit the scope of the invention and to use.

Below describe detailed process of the present invention in detail by a specific embodiment:

Step 1: corpus is prepared

Utilize crawler technology to obtain webpage from the internet.Climb to approximately 6000 up-to-date webpages from main websites such as Sina website (sina.com), Zhong Guan-cun online (zol.com), select a part as training set, utilize the SVM processing of classifying.According to source and the actual conditions of these webpages, adopt direct derivation mode, finally determine 7 class label: sport, agriculture, automobile, IT, food, lady and finance, normal, training pattern namely thus 7 class labels portray.Wherein normal is not for belonging to other classifications of the first six class label.Utilize this training pattern to the test set processing of classifying.This method is if put into commercial the use, then can be according to ODP(Open Directory Project) taxonomical hierarchy set each other class label of level because the search engine in the reality has huge widely web page source.

Step 2: the selection of relevant control methods

Select two representative searching method: Lucene and YD (Yahoo Directory) to contrast the size of the retrieval precision of this method in the present embodiment.

Test search effect of the present invention 2.1:Lucene search for the index that these webpages set up by Lucene Searcher as first contrast test.

2.2:Yahoo Directory is an online English website split catalog search, the Search Results on it all posts class label.Simultaneously, the training pattern that these webpages are set up can be used to the webpage of classifying and climbing down from Yahoo Directory, as second contrast test.

2.3: in order to satisfy most of user, the result that returns of each key word for test, lower front 30 result of its rank of this method crawl, classify by the training pattern of having set up, and in search, reorganize according to the present invention, search effect of the present invention is tested in test as a comparison.

Lucene and Yahoo Directory are the methods of present industry Information Organization, processing and the retrieval comparatively paid close attention to, so the present invention selects to carry out the contrast of index of correlation with these two kinds of methods.

Step 3: Experimental Comparison target setting

Statistics shows, the page that arrives for search engine retrieving has only been checked front results page more than 100 less than 0.1% user, and the user more than 80% has only browsed front 30 results page.Because the present invention has certain screening, in order to make the user more selection space is arranged, in the contrast of Lucene search, front 200 webpages that this method is chosen initial search result are optimized Ordination and processing.

The present embodiment is randomly drawed 7 users and is experienced.In order to weigh retrieval effectiveness, set the standard of an assessment: accuracy rate R.For each inquiry, get 10 documents of Search Results, the accuracy rate R of each inquiry is defined as:

D wherein _rQuantity for the document relevant with searching keyword.Repeatedly averaging after the inquiry, is exactly the retrieval precision of the method.

Step 4: experimental result and correlation analysis

The search accuracy rate of three kinds of methods and search precision contrast are as shown in Figure 2.Wherein:

Lucene:Lucene?Searcher

YD:Yahoo?Directory

OURS: method proposed by the invention

As can be seen from the figure, the web document proportion that meets the user search tendency that the present invention returns all is better than front two kinds of methods, accuracy rate and retrieval precision will significantly be better than the search method of Lucene, though Yahoo Directory is more approaching with retrieval precision of the present invention, but weaker.The multi-semantic meaning analysis that this explanation the present invention proposes and the semantic retrieving method of Optimal scheduling can promote retrieval accurately and precision, in present search engine, has obvious superiority, the search tendency that can more be close to the users satisfies user's retrieval and experiences.

The above; only for the better embodiment of the present invention, but protection scope of the present invention is not limited to this, anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims

1. the semantic retrieving method based on multi-semantic meaning analysis and personalized ordering is characterized in that, specifically may further comprise the steps:

Step 2: based on the term vector storehouse that step 1 forms, the search key structure term vector separately with user's input forms final query vector, and query vector and index database is carried out the classification matching inquiry, obtains initial web search result;

2. a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering according to claim 1, it is characterized in that, in the described step 1, based on multi-semantic meaning analytical approach MSA structure term vector storehouse, and the classification results of web document write in the index, form the process of index database; Specifically comprise step:

Step 11: structuring concept space; Setting space of the present invention is the m dimension;

Step 12: the determining of term vector component value:

Word is to extract from the web document of training pattern, and the size of each component value of term vector is decided by all documents of training pattern; Each component value computing formula of term vector is:

w (c_{i}, t_{j}) = Σ_{k = 1}^{k = | D |} H (c_{i}, d_{k}) \frac{\log_{2} (1 + tf (d_{k}, t_{j}))}{\log_{2} (1 + length (d_{k}))}

Wherein, t _jRepresent j word in the term vector storehouse, w (c _i, t _j) represent word t _jWith i dimension c in the equivalent vector _iRelation, namely be word t _jEquivalent is to measuring i component value; | D| is the quantity of training document; Tf (d _k, t _j) refer to word t _jAt document d _kThe frequency of middle appearance; H (c _i, d _k) be a discriminant function: if document d _kBelong to dimension c _iDescribed field, then H (c _i, d _k) value is 1, otherwise is 0; Length (d _k) be document d _kLength, i.e. document d _kThrough the number of the word that obtains after the participle denoising, when some words repeatedly occur in document, then repeat count, i.e. length (d _k) 〉=n; K is the quantity of document;

Step 13: the formation in term vector unitization processing and term vector storehouse:

With term vector unitization processing, make its component value scope be [0,1], thereby have better versatility; Term vector behind a plurality of units just forms the term vector storehouse; The computing formula of term vector unitization is:

w^{'} (c_{i}, t_{j}) = \frac{w (c_{i}, t_{j})}{Σ_{i = 0}^{m} w (c_{i}, t_{j})}

Wherein, the term vector behind the unit is designated as

W ' (c _i, t _j) be

I component value, then the term vector storehouse is:

\overset{&RightArrow;}{t_{j}} = {(w^{'} (c_{1}, t_{j}), w^{'} (c_{2}, t_{j}), . . ., w^{'} (c_{m}, t_{j}))}^{T}

Step 14: obtain the weights of each word in the document and these weights are carried out the unit processing by the TFIDF method; TFIDF weights method be popular for many years and be proved to be one of effective weights method, it does not consider the classification situation to the determining only to depend on the overall condition of corpus of weights, therefore have very strong versatility, the weights that can be applied to the word of many classifying texts in representing are determined; The TFIDF weights determine that the computing formula of method is:

weight (t_{g}, d_{k}) = TFIDF (t_{g}, d_{k}) = tf (t_{g}, d_{k}) \times \lg \frac{| D |}{| D^{'} |}

Wherein, t _gBe document d _kG participle, weight (t _g, d _k) represent word t _gAt document d _kIn shared weights, the set of D representative training document, d _kRepresent k document; | D| is the quantity of training document; The D' representative contains word t _gCollection of document, | D ' | be the quantity of the middle document of set D ';

{weight}^{'} (t_{g}, d_{k}) = \frac{weight (t_{g}, d_{k})}{\sqrt{Σ_{j = 1}^{n} weight (t_{g}, d_{k})}}

Wherein, weight'(t _g, d _k) be word t behind the unit _gAt document d _kIn shared weights, n is the participle kind sum of document;

Step 15: the document vector forms; After adopting the TFIDF method to represent weights, the document vector of multi-semantic meaning analysis (MSA) has just formed, document d _kCorresponding document vector

In the computing formula of i component value be:

wd (c_{i}, d_{k}) = Σ_{g = 1}^{n} {w^{'} (c_{i}, t_{g}) \times weigh t^{'} (t_{g}, d_{k})}

Document d _kThe document vector form be designated as:

\overset{&RightArrow;}{d_{k}} = weig h^{'} t (t_{1}, d_{k}) \times \overset{&RightArrow;}{t_{1}} + {weigh}^{'} t (t_{2}, d_{k}) \times \overset{&RightArrow;}{t_{2}} + . . . + {weigh}^{'} t (t_{n}, d_{k}) \times \overset{&RightArrow;}{t_{n}}

= Σ_{g = 1}^{n} {{weigh}^{'} t (t_{g}, d_{k}) \times \overset{&RightArrow;}{t_{g}}}

Wherein, n is the participle kind sum of document, Be t _gVector form in the term vector storehouse;

This document vector, each component value has directly represented the degree of correlation of this document with corresponding dimension (classification), has very strong Semantic, is the basis of matching inquiry; Afterwards by the m that a pre-defines class label, use the support vector machine technology document vector to be classified and as the criteria for classification of new webpage, and the classification of all webpages is write in the index database as an attribute.

3. a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering according to claim 1 is characterized in that, in the described step 2, the matching analysis of query vector and index database comprises substep:

Step 21: based on the term vector storehouse, with the searching key word vector representation of user's input;

Then all keywords can form the query vector set

Wherein, in the term vector storehouse, there is not key _iThe time,

\overset{&RightArrow;}{Q} = Σ_{i = 0}^{n} \overset{&RightArrow;}{T_{i}} = (\overset{&RightArrow;}{T_{1}} + \overset{&RightArrow;}{T_{2}} + . . . + \overset{&RightArrow;}{T_{n}}) = {(α_{1}, α_{2}, . . ., α_{m})}^{T}

Amount of orientation

This weight vector can used in the user profile coupling afterwards; Based on this three kind { c _P, c _q, c _rIn index database, carry out match query, and filter out the webpage that belongs to these three classifications, adopt Lucene basis sort algorithm, obtain initial ranking results.

4. a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering according to claim 3 is characterized in that, described Lucene basis sort algorithm such as formula are:

score (q, d) = coord (q, d) \cdot queryNorm (q) \cdot \underset{tinq}{Σ} (tf (t, d) \cdot idf {(t)}^{2} \cdot t . getboost () \cdot norm (t, d))

Wherein, q is the demand of retrieval;

Tf (t, d) represents the word frequency that entry t occurs in document d;

Idf (t) represents entry t and arrange word frequency in document;

T.getBoost (): the weight of each word in the query statement;

Norm (t, d): normalization factor;

Coord (q, d) represents coordinating factor;

QueryNorm (q) representative the variance that provides each query entries and after, the standardized value of inquiry.

5. a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering according to claim 1 is characterized in that, according to user's personal customization information initial ranking results is optimized processing and specifically may further comprise the steps:

If I. u, v and s are 0, then this document scores is constant;

If II. there is one not to be 0 among u, v and the s, then:

newscore = score \cdot (1 + A \cdot u + B \cdot v + E \cdot s) \cdot (1 + \frac{topscore - score}{topscore - lastscore})

Wherein,

Step 303: the Query coupling when user customized information is fuzzy:

The personal customization information of inputting as the user does not belong to given default category scope, and the personal customization information of input is searched corresponding classification in the term vector storehouse, obtains corresponding new term vector; Set in the present invention term vector collection corresponding to user's the first customized information Corresponding weight vector is

Corresponding weight vector is

newscore = score \cdot (1 + A \cdot w u_{i} + B \cdot w v_{i} + E \cdot s) \cdot (1 + \frac{topscore - score}{topscore - lastscore})

6. a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering according to claim 1 is characterized in that, according to the historical visit information of user initial ranking results is optimized processing procedure and specifically may further comprise the steps;

Step 312: establishing debut ranking is r, and then the new rank of d is:

r^{'} = \frac{\sqrt{r}}{s^{'} \cdot \log (2 + n_{1}) + h \cdot \log (2 + n_{2})}

Wherein,

S ': be that historical record is 1, no is 0;

H: be that hot link hotlink is 1, no is 0;

n ₁: the user is to this historical number of clicks;

n ₂: the number of clicks of hot link hotlink;

As can be known, the minimum value of r' is 0.