CN103020164A - Semantic search method based on multi-semantic analysis and personalized sequencing - Google Patents

Semantic search method based on multi-semantic analysis and personalized sequencing Download PDF

Info

Publication number
CN103020164A
CN103020164A CN201210488572XA CN201210488572A CN103020164A CN 103020164 A CN103020164 A CN 103020164A CN 201210488572X A CN201210488572X A CN 201210488572XA CN 201210488572 A CN201210488572 A CN 201210488572A CN 103020164 A CN103020164 A CN 103020164A
Authority
CN
China
Prior art keywords
amp
document
vector
user
word
Prior art date
Application number
CN201210488572XA
Other languages
Chinese (zh)
Other versions
CN103020164B (en
Inventor
马应龙
张潇澜
于潇
Original Assignee
华北电力大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华北电力大学 filed Critical 华北电力大学
Priority to CN201210488572.XA priority Critical patent/CN103020164B/en
Publication of CN103020164A publication Critical patent/CN103020164A/en
Application granted granted Critical
Publication of CN103020164B publication Critical patent/CN103020164B/en

Links

Abstract

The invention discloses a semantic search method based on multi-semantic analysis and personalized sequencing, and belongs to the field of information search. The semantic search method adopts the technical scheme comprising the following steps: firstly, by a crawler technology and other technologies, acquiring webpage documents from the Internet, classifying the webpage documents by using a support vector machine, establishing a word vector library by a multi-semantic analysis method, and writing multi-classification results into an index to form an index library; secondly, based on the word vector library, forming search keywords input by a user into a query vector, performing class matching query with the index library to obtain an initial sequencing result; and finally, according to personalized information and history access information of the user, optimizing the initial sequencing result, and returning the optimized result to the user. By the semantic search method based on the multi-semantic analysis and the personalized sequencing, the word vector library and the index library with rich semantemes are formed; and through the personalized information and the history access information, a search result can meet a search demand of the user better and search satisfaction of the user can be improved.

Description

A kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering

Technical field

The invention belongs to information retrieval field, relate in particular to a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering.

Background technology

Search engine is the certain strategy of a basis, use specific computer program to gather information and information organized and process from the internet after, for the user provides retrieval service and with the relative information displaying of the user search system to the user.In order to tackle the rapid growth of the information capacity on the internet, search engine arises at the historic moment.Even to this day, it has become the requisite approach of people's obtaining information from the network.But, current main flow based on the search engine of key word such as Google, Baidu, Bing, Yahoo etc., some thorny problems of ubiquity.Result such as user search understands a large amount of incoherent links of ubiquity; Because user crowd's diversity, single result can not satisfy each user's special requirement targetedly; Search procedure is not considered the semantic relevancy between the word, and Search Results do not organize by certain mode effectively, and the user must not be wasted time and energy and be browsed and select.

Semantic search is a kind of novel way of search that is different from based on keyword search.In general, the work of semantic search no longer sticks to the key word of user institute input request statement itself, and can capture comparatively exactly the potential intention of user institute read statement, thereby can return the result who meets its demand most to the user more accurately, compare traditional search and have higher retrieval precision and original advantage.Ramesh Singh and Myungjin Lee attempt Search Results is reorganized in its research, improve user's search experience.Lien-Fu Lai and Huanhuan Cao utilize concealed Markov tree or other models to calculate the degree of correlation that concerns between Different Results, thereby increase the face of containing of Search Results.FangLiu and Jaime Teevan etc. have proposed the method that the historical visit information of the various users of utilization carries out personalized search, in order to improve the precision of search.Suitable improvement has all been carried out in above-mentioned these researchs aspect semantic search, but these researchs can carry out the Extraordinary condition relatively harsher, and the increase of time loss control are bad based on the user being inquired about in the personalization of classification; Secondly, do not consider from user-dependent different information to have different weights in the process.Therefore, the ordering processing mode to final Search Results is still unsatisfactory.

Summary of the invention

In the problem that exists aspect retrieval precision and the user search experience, the present invention proposes a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering for the existing information retrieval.

A kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering is characterized in that, specifically may further comprise the steps:

Step 1: a part of utilizing crawler technology to obtain web document from the internet is carried out manual sort as training pattern, in conjunction with multi-semantic meaning analytical approach MSA structure term vector storehouse, with the web document vector representation, and training pattern is put in the support vector machines sorter document vector is trained, new webpage utilizes this model to classify by SVM; The classification information of all webpages is write in the index database as an attribute;

Step 2: based on the term vector storehouse that step 1 forms, the search key structure term vector separately with user's input forms final query vector, and query vector and index database is carried out the classification matching inquiry, obtains initial web search result.

Step 3: personal customization information and historical visit information according to the user are optimized ordering to the initial retrieval result, and final result for retrieval is returned to the user.

In the step 1, construct the term vector storehouse based on multi-semantic meaning analytical approach MSA, and the classification results of web document is write in the index, form the process of index database; Specifically comprise following step:

Step 11: structuring concept space; Setting space of the present invention is the m dimension.

The basic dimension of concept space is the set of some class labels, the information that can represent whole corpus, general m the class label that directly extracts from the corpus tag along sort consists of m dimension of vector, then the semantic information of each word is described by a m dimensional vector in the web document, is called term vector;

Step 12: the determining of term vector component value:

Word is to extract from the web document of training pattern, and the size of each component value of term vector is decided by all documents of training pattern, and each component value computing formula of term vector is:

w ( c i , t j ) = Σ k = 1 k = | D | H ( c i , d k ) log 2 ( 1 + tf ( d k , t j ) ) log 2 ( 1 + length ( d k ) ) ( 1 )

Wherein, t jRepresent j word in the term vector storehouse, w (c i, t j) represent word t jWith i dimension c in the equivalent vector iRelation, namely be word t jEquivalent is to measuring i component value; | D| is the quantity of training document; Tf (d k, t j) refer to word t jAt document d kThe frequency of middle appearance; H (c i, d k) be a discriminant function: if document d kBelong to dimension c iDescribed field, then H (c i, d k) value is 1, otherwise is 0; Length (d k) be document d kLength, i.e. document d kThrough the number of the word that obtains after the participle denoising, when some words repeatedly occur in document, then repeat count, i.e. length (d k) 〉=n; K is the quantity of document.

Step 13: the formation in term vector unitization processing and term vector storehouse.

With term vector unitization processing, make its component value scope be [0,1], thereby have better versatility.Term vector behind a plurality of units just forms the term vector storehouse; The computing formula of term vector unitization is:

w ′ ( c i , t j ) = w ( c i , t j ) Σ i = 0 m w ( c i , t j ) - - - ( 2 )

Wherein, the term vector behind the unit is designated as W ' (c i, t j) be I component value, then the term vector storehouse is:

t j → = ( w ′ ( c 1 , t j ) , w ′ ( c 2 , t j ) , . . . , w ′ ( c m , t j ) ) T - - - ( 3 )

Step 14: obtain the weights of each word in the document and these weights are carried out the unit processing by the TFIDF method.TFIDF weights method be popular for many years and be proved to be one of effective weights method, it does not consider the classification situation to the determining only to depend on the overall condition of corpus of weights, therefore have very strong versatility, the weights that can be applied to the word of many classifying texts in representing are determined.The TFIDF weights determine that the computing formula of method is:

weight ( t g , d k ) = TFIDF ( t g , d k ) = tf ( t g , d k ) × lg | D | | D ′ | - - - ( 4 )

Wherein, t gBe document d kG participle, weight (t g, d k) represent word t gAt document d kIn shared weights, the set of D representative training document, d kRepresent k document.| D| is the quantity of training document; The D' representative contains word t gCollection of document, | D ' | be the quantity of the middle document of set D '.

In like manner unit processing, so that the weights span of word is [0,1] behind the document participle, the computing formula of the weights of word is behind the document participle:

weight ′ ( t g , d k ) = weight ( t g , d k ) Σ j = 1 n weight ( t g , d k ) - - - ( 5 )

Wherein, weight'(t g, d k) be word t behind the unit gAt document d kIn shared weights, n is the participle kind sum of document.

Step 15: the document vector forms.After adopting the TFIDF method to represent weights, the document vector of multi-semantic meaning analysis (MSA) has just formed, document d kCorresponding document vector In the computing formula of i component value be:

wd ( c i , d k ) = Σ g = 1 n { w ′ ( c i , t g ) × weigh t ′ ( t g , d k ) } - - - ( 6 )

Document d kThe document vector form be designated as:

d k → = weig h ′ t ( t 1 , d k ) × t 1 → + weigh ′ t ( t 2 , d k ) × t 2 → + . . . + weigh ′ t ( t n , d k ) × t n →

= Σ g = 1 n { weigh ′ t ( t g , d k ) × t g → } - - - ( 7 )

Wherein, n is the participle kind sum of document, Be t gVector form in the term vector storehouse.

This document vector, each component value has directly represented the degree of correlation of this document with corresponding dimension (classification), has very strong Semantic, is the basis of matching inquiry.Afterwards by the m that a pre-defines class label, use the support vector machine technology document vector to be classified and as the criteria for classification of new webpage, and the classification of all webpages is write in the index database as an attribute.

In the step 2, described query vector and index database carry out classification matching inquiry step and comprise:

Step 21: based on the term vector storehouse, with the searching key word vector representation of user's input.

Note searched key set of words is: KEY={key 1, key 2..., key n, the term vector of corresponding each word of extraction makes up each word key from the term vector storehouse that has established iVector form Then all keywords can form the query vector set Wherein, in the term vector storehouse, there is not key iThe time,

Step 22: on the basis of step 21, form the query vector of searching key word in the m gt: the query vector formula is:

Q → = Σ i = 0 n T i → = ( T 1 → + T 2 → + . . . + T n → ) = ( α 1 , α 2 , . . . , α m ) T - - - ( 8 )

Amount of orientation First three component of component value maximum be designated as: α p, α q, α r, the dimension classification of their correspondences is designated as: c P, c q, c r, the weight vector of classification is designated as: This weight vector can used in the user profile coupling afterwards.Based on this three kind { c P, c q, c rIn index database, carry out match query, and filter out the webpage that belongs to these three classifications, adopt Lucene basis sort algorithm, obtain initial ranking results.

Described Lucene basic score algorithmic formula is:

score ( q , d ) = coord ( q , d ) · queryNorm ( q ) · Σ tinq ( tf ( t , d ) · idf ( t ) 2 · t . getboost ( ) · norm ( t , d ) )

Wherein, q is the demand of retrieval;

Tf (t, d) represents the word frequency that entry t occurs in document d;

Idf (t) represents entry t and arrange word frequency in document;

T.getBoost (): the weight of each word in the query statement, can in inquiry, set certain word more important;

Norm (t, d): normalization factor, it comprises three parameters: (1) Document boost: this value is larger, illustrates that this document is more important.(2) Field boost: this territory is larger, illustrates that this territory is more important.(3) lengthNorm (feld): the Term sum that comprises in territory is more, also is that document is longer, and this value is less, and document is shorter, and this value is larger;

Coord (q, d) represents coordinating factor, and its calculating is based on the entry quantity of all Gong inquiries that comprise among the document d;

QueryNorm (q) representative the variance that provides each query entries and after, calculate the standardized value of certain inquiry.

In the step 3, according to user's personal customization information initial ranking results is optimized processing and specifically may further comprise the steps;

Step 301: collect three kinds of the highest personal customization information of user query frequency: the first customized information u, the second customized information v and the 3rd customized information s, and determine that the weights of these three kinds of personal customization information are A, B and E;

Step 302: the Query coupling when user customized information is determined; At this moment, because the classification of every personal information of user is all definite, therefore, the Lucene basis score of document in the initial ranking results is made amendment:

If I. u, v and s are 0, then this document scores is constant;

If II. there is one not to be 0 among u, v and the s, then:

newscore = score · ( 1 + A · u + B · v + E · s ) · ( 1 + topscore - score topscore - lastscore )

Wherein,

U=1 represents this webpage classification and conforms to the first customized information, and 0 representative is not inconsistent;

V=1 represents this webpage classification and conforms to user's the second customized information, and 0 representative is not inconsistent;

S=1 represents this webpage classification and conforms to user's the 3rd customized information, and 0 representative is not inconsistent;

Topscore is the top score in the result document, and lastscore is minimum score;

Step 303: the Query coupling when user customized information is fuzzy:

The personal customization information of inputting as the user does not belong to given default category scope, and the personal customization information of input is searched corresponding classification in the term vector storehouse, obtains corresponding new term vector; Set in the present invention term vector collection corresponding to user's the first customized information Corresponding weight vector is Corresponding classification is c 1, c 2, c 3The term vector collection that user's the second customized information is corresponding is Corresponding weight vector is Corresponding classification is c 4, c 5, c 6New classification set is designated as: C={c 1, c 2, c 3∪ { c 4, c 5, c 6; For each web document, if document d kBelong to classification c i, and c i∈ C, then the document score of this webpage becomes:

newscore = score · ( 1 + A · w u i + B · w v i + E · s ) · ( 1 + topscore - score topscore - lastscore )

Wherein, wu iAnd wv iThe corresponding c of weight vector iIf the value of that dimension of class is the not corresponding c of weight vector iClass, this is 0 years old.

In the step 3, according to the historical visit information of user initial ranking results is optimized processing; The historical visit information the matching analysis of described user is according to the optimization to initial ranking results of user's history access record.Owing to repeatedly in the webpage search afterwards of accessed mistake very important effect being arranged in the access history, and the maximum page of all user selections has very large directiveness to the search tendency of unique user, so, this method utilizes user's historical visit information that initial ranking results is optimized, and promotes the page rank high with user's degree of correlation.The present invention proposes the excellent row's algorithm of following webpage and may further comprise the steps:

Step 311: then carry out following algorithm if document d is history or hot link hotlink, otherwise skip this step;

Step 312: establishing debut ranking is r, and then the new rank of d is:

r ′ = r s ′ · log ( 2 + n 1 ) + h · log ( 2 + n 2 ) - - - ( 9 )

Wherein,

S ': be that historical record is 1, no is 0;

H: be that hot link hotlink is 1, no is 0;

n 1: the user is to this historical number of clicks;

n 2: the number of clicks of hot link hotlink.

As can be known, the minimum value of r' is 0.

The present invention at first is optimized existing algorithm, adds the multi-semantic meaning analysis, proposes semantic information more abundant term vector storehouse and index database.Then based on the term vector storehouse search key that the user inputs is carried out semantic analysis, carry out the Query coupling with index database, form initial ranking results.In conjunction with individual subscriber customized information and historical visit information, utilize semantic analysis that initial ranking results is optimized at last, thereby more met the result for retrieval of user's tendency, improve user's retrieval and experience.

Description of drawings

Fig. 1 is the algorithm flow chart of a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering provided by the invention;

Fig. 2 is the contrast distribution figure of three kinds of search methods (LB, YH and OURS) retrieval precision of providing of the embodiment of the invention.

Embodiment

Below in conjunction with accompanying drawing, preferred embodiment is elaborated.Should be emphasized that following explanation only is exemplary, rather than in order to limit the scope of the invention and to use.

Below describe detailed process of the present invention in detail by a specific embodiment:

Step 1: corpus is prepared

Utilize crawler technology to obtain webpage from the internet.Climb to approximately 6000 up-to-date webpages from main websites such as Sina website (sina.com), Zhong Guan-cun online (zol.com), select a part as training set, utilize the SVM processing of classifying.According to source and the actual conditions of these webpages, adopt direct derivation mode, finally determine 7 class label: sport, agriculture, automobile, IT, food, lady and finance, normal, training pattern namely thus 7 class labels portray.Wherein normal is not for belonging to other classifications of the first six class label.Utilize this training pattern to the test set processing of classifying.This method is if put into commercial the use, then can be according to ODP(Open Directory Project) taxonomical hierarchy set each other class label of level because the search engine in the reality has huge widely web page source.

Step 2: the selection of relevant control methods

Select two representative searching method: Lucene and YD (Yahoo Directory) to contrast the size of the retrieval precision of this method in the present embodiment.

Test search effect of the present invention 2.1:Lucene search for the index that these webpages set up by Lucene Searcher as first contrast test.

2.2:Yahoo Directory is an online English website split catalog search, the Search Results on it all posts class label.Simultaneously, the training pattern that these webpages are set up can be used to the webpage of classifying and climbing down from Yahoo Directory, as second contrast test.

2.3: in order to satisfy most of user, the result that returns of each key word for test, lower front 30 result of its rank of this method crawl, classify by the training pattern of having set up, and in search, reorganize according to the present invention, search effect of the present invention is tested in test as a comparison.

Lucene and Yahoo Directory are the methods of present industry Information Organization, processing and the retrieval comparatively paid close attention to, so the present invention selects to carry out the contrast of index of correlation with these two kinds of methods.

Step 3: Experimental Comparison target setting

Statistics shows, the page that arrives for search engine retrieving has only been checked front results page more than 100 less than 0.1% user, and the user more than 80% has only browsed front 30 results page.Because the present invention has certain screening, in order to make the user more selection space is arranged, in the contrast of Lucene search, front 200 webpages that this method is chosen initial search result are optimized Ordination and processing.

The present embodiment is randomly drawed 7 users and is experienced.In order to weigh retrieval effectiveness, set the standard of an assessment: accuracy rate R.For each inquiry, get 10 documents of Search Results, the accuracy rate R of each inquiry is defined as: D wherein rQuantity for the document relevant with searching keyword.Repeatedly averaging after the inquiry, is exactly the retrieval precision of the method.

Step 4: experimental result and correlation analysis

The search accuracy rate of three kinds of methods and search precision contrast are as shown in Figure 2.Wherein:

Lucene:Lucene?Searcher

YD:Yahoo?Directory

OURS: method proposed by the invention

As can be seen from the figure, the web document proportion that meets the user search tendency that the present invention returns all is better than front two kinds of methods, accuracy rate and retrieval precision will significantly be better than the search method of Lucene, though Yahoo Directory is more approaching with retrieval precision of the present invention, but weaker.The multi-semantic meaning analysis that this explanation the present invention proposes and the semantic retrieving method of Optimal scheduling can promote retrieval accurately and precision, in present search engine, has obvious superiority, the search tendency that can more be close to the users satisfies user's retrieval and experiences.

The above; only for the better embodiment of the present invention, but protection scope of the present invention is not limited to this, anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims (6)

1. the semantic retrieving method based on multi-semantic meaning analysis and personalized ordering is characterized in that, specifically may further comprise the steps:
Step 1: a part of utilizing crawler technology to obtain web document from the internet is carried out manual sort as training pattern, in conjunction with multi-semantic meaning analytical approach MSA structure term vector storehouse, with the web document vector representation, and training pattern is put in the support vector machines sorter document vector is trained, new webpage utilizes this model to classify by SVM; The classification information of all webpages is write in the index database as an attribute;
Step 2: based on the term vector storehouse that step 1 forms, the search key structure term vector separately with user's input forms final query vector, and query vector and index database is carried out the classification matching inquiry, obtains initial web search result;
Step 3: personal customization information and historical visit information according to the user are optimized ordering to the initial retrieval result, and final result for retrieval is returned to the user.
2. a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering according to claim 1, it is characterized in that, in the described step 1, based on multi-semantic meaning analytical approach MSA structure term vector storehouse, and the classification results of web document write in the index, form the process of index database; Specifically comprise step:
Step 11: structuring concept space; Setting space of the present invention is the m dimension;
The basic dimension of concept space is the set of some class labels, the information that can represent whole corpus, general m the class label that directly extracts from the corpus tag along sort consists of m dimension of vector, then the semantic information of each word is described by a m dimensional vector in the web document, is called term vector;
Step 12: the determining of term vector component value:
Word is to extract from the web document of training pattern, and the size of each component value of term vector is decided by all documents of training pattern; Each component value computing formula of term vector is:
w ( c i , t j ) = Σ k = 1 k = | D | H ( c i , d k ) log 2 ( 1 + tf ( d k , t j ) ) log 2 ( 1 + length ( d k ) )
Wherein, t jRepresent j word in the term vector storehouse, w (c i, t j) represent word t jWith i dimension c in the equivalent vector iRelation, namely be word t jEquivalent is to measuring i component value; | D| is the quantity of training document; Tf (d k, t j) refer to word t jAt document d kThe frequency of middle appearance; H (c i, d k) be a discriminant function: if document d kBelong to dimension c iDescribed field, then H (c i, d k) value is 1, otherwise is 0; Length (d k) be document d kLength, i.e. document d kThrough the number of the word that obtains after the participle denoising, when some words repeatedly occur in document, then repeat count, i.e. length (d k) 〉=n; K is the quantity of document;
Step 13: the formation in term vector unitization processing and term vector storehouse:
With term vector unitization processing, make its component value scope be [0,1], thereby have better versatility; Term vector behind a plurality of units just forms the term vector storehouse; The computing formula of term vector unitization is:
w ′ ( c i , t j ) = w ( c i , t j ) Σ i = 0 m w ( c i , t j )
Wherein, the term vector behind the unit is designated as W ' (c i, t j) be I component value, then the term vector storehouse is:
t j → = ( w ′ ( c 1 , t j ) , w ′ ( c 2 , t j ) , . . . , w ′ ( c m , t j ) ) T
Step 14: obtain the weights of each word in the document and these weights are carried out the unit processing by the TFIDF method; TFIDF weights method be popular for many years and be proved to be one of effective weights method, it does not consider the classification situation to the determining only to depend on the overall condition of corpus of weights, therefore have very strong versatility, the weights that can be applied to the word of many classifying texts in representing are determined; The TFIDF weights determine that the computing formula of method is:
weight ( t g , d k ) = TFIDF ( t g , d k ) = tf ( t g , d k ) × lg | D | | D ′ |
Wherein, t gBe document d kG participle, weight (t g, d k) represent word t gAt document d kIn shared weights, the set of D representative training document, d kRepresent k document; | D| is the quantity of training document; The D' representative contains word t gCollection of document, | D ' | be the quantity of the middle document of set D ';
In like manner unit processing, so that the weights span of word is [0,1] behind the document participle, the computing formula of the weights of word is behind the document participle:
weight ′ ( t g , d k ) = weight ( t g , d k ) Σ j = 1 n weight ( t g , d k )
Wherein, weight'(t g, d k) be word t behind the unit gAt document d kIn shared weights, n is the participle kind sum of document;
Step 15: the document vector forms; After adopting the TFIDF method to represent weights, the document vector of multi-semantic meaning analysis (MSA) has just formed, document d kCorresponding document vector In the computing formula of i component value be:
wd ( c i , d k ) = Σ g = 1 n { w ′ ( c i , t g ) × weigh t ′ ( t g , d k ) }
Document d kThe document vector form be designated as:
d k → = weig h ′ t ( t 1 , d k ) × t 1 → + weigh ′ t ( t 2 , d k ) × t 2 → + . . . + weigh ′ t ( t n , d k ) × t n →
= Σ g = 1 n { weigh ′ t ( t g , d k ) × t g → }
Wherein, n is the participle kind sum of document, Be t gVector form in the term vector storehouse;
This document vector, each component value has directly represented the degree of correlation of this document with corresponding dimension (classification), has very strong Semantic, is the basis of matching inquiry; Afterwards by the m that a pre-defines class label, use the support vector machine technology document vector to be classified and as the criteria for classification of new webpage, and the classification of all webpages is write in the index database as an attribute.
3. a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering according to claim 1 is characterized in that, in the described step 2, the matching analysis of query vector and index database comprises substep:
Step 21: based on the term vector storehouse, with the searching key word vector representation of user's input;
Note searched key set of words is: KEY={key 1, key 2..., key n, the term vector of corresponding each word of extraction makes up each word key from the term vector storehouse that has established iVector form Then all keywords can form the query vector set Wherein, in the term vector storehouse, there is not key iThe time,
Step 22: on the basis of step 21, form the query vector of searching key word in the m gt: the query vector formula is:
Q → = Σ i = 0 n T i → = ( T 1 → + T 2 → + . . . + T n → ) = ( α 1 , α 2 , . . . , α m ) T
Amount of orientation First three component of component value maximum be designated as: α p, α q, α r, the dimension classification of their correspondences is designated as: c P, c q, c r, the weight vector of classification is designated as: This weight vector can used in the user profile coupling afterwards; Based on this three kind { c P, c q, c rIn index database, carry out match query, and filter out the webpage that belongs to these three classifications, adopt Lucene basis sort algorithm, obtain initial ranking results.
4. a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering according to claim 3 is characterized in that, described Lucene basis sort algorithm such as formula are:
score ( q , d ) = coord ( q , d ) · queryNorm ( q ) · Σ tinq ( tf ( t , d ) · idf ( t ) 2 · t . getboost ( ) · norm ( t , d ) )
Wherein, q is the demand of retrieval;
Tf (t, d) represents the word frequency that entry t occurs in document d;
Idf (t) represents entry t and arrange word frequency in document;
T.getBoost (): the weight of each word in the query statement;
Norm (t, d): normalization factor;
Coord (q, d) represents coordinating factor;
QueryNorm (q) representative the variance that provides each query entries and after, the standardized value of inquiry.
5. a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering according to claim 1 is characterized in that, according to user's personal customization information initial ranking results is optimized processing and specifically may further comprise the steps:
Step 301: collect three kinds of the highest personal customization information of user query frequency: the first customized information u, the second customized information v and the 3rd customized information s, and determine that the weights of these three kinds of personal customization information are A, B and E;
Step 302: the Query coupling when user customized information is determined; At this moment, because the classification of every personal information of user is all definite, therefore, the Lucene basis score of document in the initial ranking results is made amendment:
If I. u, v and s are 0, then this document scores is constant;
If II. there is one not to be 0 among u, v and the s, then:
newscore = score · ( 1 + A · u + B · v + E · s ) · ( 1 + topscore - score topscore - lastscore )
Wherein,
U=1 represents this webpage classification and conforms to the first customized information, and 0 representative is not inconsistent;
V=1 represents this webpage classification and conforms to user's the second customized information, and 0 representative is not inconsistent;
S=1 represents this webpage classification and conforms to user's the 3rd customized information, and 0 representative is not inconsistent;
Topscore is the top score in the result document, and lastscore is minimum score;
Step 303: the Query coupling when user customized information is fuzzy:
The personal customization information of inputting as the user does not belong to given default category scope, and the personal customization information of input is searched corresponding classification in the term vector storehouse, obtains corresponding new term vector; Set in the present invention term vector collection corresponding to user's the first customized information Corresponding weight vector is Corresponding classification is c 1, c 2, c 3The term vector collection that user's the second customized information is corresponding is Corresponding weight vector is Corresponding classification is c 4, c 5, c 6New classification set is designated as: C={c 1, c 2, c 3∪ { c 4, c 5, c 6; For each web document, if document d kBelong to classification c i, and c i∈ C, then the document score of this webpage becomes:
newscore = score · ( 1 + A · w u i + B · w v i + E · s ) · ( 1 + topscore - score topscore - lastscore )
Wherein, wu iAnd wv iThe corresponding c of weight vector iIf the value of that dimension of class is the not corresponding c of weight vector iClass, this is 0 years old.
6. a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering according to claim 1 is characterized in that, according to the historical visit information of user initial ranking results is optimized processing procedure and specifically may further comprise the steps;
Step 311: then carry out following algorithm if document d is history or hot link hotlink, otherwise skip this step;
Step 312: establishing debut ranking is r, and then the new rank of d is:
r ′ = r s ′ · log ( 2 + n 1 ) + h · log ( 2 + n 2 )
Wherein,
S ': be that historical record is 1, no is 0;
H: be that hot link hotlink is 1, no is 0;
n 1: the user is to this historical number of clicks;
n 2: the number of clicks of hot link hotlink;
As can be known, the minimum value of r' is 0.
CN201210488572.XA 2012-11-26 2012-11-26 Semantic search method based on multi-semantic analysis and personalized sequencing CN103020164B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210488572.XA CN103020164B (en) 2012-11-26 2012-11-26 Semantic search method based on multi-semantic analysis and personalized sequencing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210488572.XA CN103020164B (en) 2012-11-26 2012-11-26 Semantic search method based on multi-semantic analysis and personalized sequencing

Publications (2)

Publication Number Publication Date
CN103020164A true CN103020164A (en) 2013-04-03
CN103020164B CN103020164B (en) 2015-06-10

Family

ID=47968768

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210488572.XA CN103020164B (en) 2012-11-26 2012-11-26 Semantic search method based on multi-semantic analysis and personalized sequencing

Country Status (1)

Country Link
CN (1) CN103020164B (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593336A (en) * 2013-10-30 2014-02-19 中国运载火箭技术研究院 Knowledge pushing system and method based on semantic analysis
CN103646017A (en) * 2013-12-11 2014-03-19 南京大学 Acronym generating system for naming and working method thereof
CN104008169A (en) * 2014-05-30 2014-08-27 中国测绘科学研究院 Semanteme based geographical label content safe checking method and device
CN104050240A (en) * 2014-05-26 2014-09-17 北京奇虎科技有限公司 Method and device for determining categorical attribute of search query word
CN104408036A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Correlated topic recognition method and device
CN104516897A (en) * 2013-09-29 2015-04-15 国际商业机器公司 Method and device for sorting application objects
CN105247517A (en) * 2013-04-23 2016-01-13 谷歌公司 Ranking signals in mixed corpora environments
WO2016082406A1 (en) * 2014-11-28 2016-06-02 华为技术有限公司 Method and apparatus for determining semantic matching degree
CN105786782A (en) * 2016-03-25 2016-07-20 北京搜狗科技发展有限公司 Word vector training method and device
CN105893397A (en) * 2015-06-30 2016-08-24 北京爱奇艺科技有限公司 Video recommendation method and apparatus
CN106095983A (en) * 2016-06-20 2016-11-09 北京百度网讯科技有限公司 A kind of similarity based on personalized deep neural network determines method and device
CN106156071A (en) * 2015-03-31 2016-11-23 北京奇虎科技有限公司 Intranet Intranet searching method, device and server
CN106528595A (en) * 2016-09-23 2017-03-22 中国农业科学院农业信息研究所 Website homepage content based field information collection and association method
CN106610972A (en) * 2015-10-21 2017-05-03 阿里巴巴集团控股有限公司 Query rewriting method and apparatus
CN106844343A (en) * 2017-01-20 2017-06-13 上海傲硕信息科技有限公司 Instruction results screening plant
CN106980634A (en) * 2016-01-18 2017-07-25 维布络有限公司 System and method for classifying and solving software production accident list
CN103744905B (en) * 2013-12-25 2018-03-30 新浪网技术(中国)有限公司 Method for judging rubbish mail and device
WO2018157625A1 (en) * 2017-02-28 2018-09-07 华为技术有限公司 Reinforcement learning-based method for learning to rank and server
CN109189910A (en) * 2018-09-18 2019-01-11 哈尔滨工程大学 A kind of label auto recommending method towards mobile application problem report

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070106659A1 (en) * 2005-03-18 2007-05-10 Yunshan Lu Search engine that applies feedback from users to improve search results
CN101398839A (en) * 2008-10-23 2009-04-01 浙江大学 Personalized push method for vocal web page news
CN102495872A (en) * 2011-11-30 2012-06-13 中国科学技术大学 Method and device for conducting personalized news recommendation to mobile device users

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070106659A1 (en) * 2005-03-18 2007-05-10 Yunshan Lu Search engine that applies feedback from users to improve search results
CN101398839A (en) * 2008-10-23 2009-04-01 浙江大学 Personalized push method for vocal web page news
CN102495872A (en) * 2011-11-30 2012-06-13 中国科学技术大学 Method and device for conducting personalized news recommendation to mobile device users

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YINGLONG MA 等: "Using multi-categorization semantic analysis and personalization for semantic search", 《HTTP://ARXIV.ORG》 *
马应龙 等: "一种基于多分类语义分析和个性化的语义检索方法", 《东南大学学报(自然科学版)》 *

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105247517A (en) * 2013-04-23 2016-01-13 谷歌公司 Ranking signals in mixed corpora environments
CN104516897A (en) * 2013-09-29 2015-04-15 国际商业机器公司 Method and device for sorting application objects
CN104516897B (en) * 2013-09-29 2018-03-02 国际商业机器公司 A kind of method and apparatus being ranked up for application
CN103593336A (en) * 2013-10-30 2014-02-19 中国运载火箭技术研究院 Knowledge pushing system and method based on semantic analysis
CN103593336B (en) * 2013-10-30 2017-05-10 中国运载火箭技术研究院 Knowledge pushing system and method based on semantic analysis
CN103646017B (en) * 2013-12-11 2017-01-04 南京大学 Acronym generating system for naming and working method thereof
CN103646017A (en) * 2013-12-11 2014-03-19 南京大学 Acronym generating system for naming and working method thereof
CN103744905B (en) * 2013-12-25 2018-03-30 新浪网技术(中国)有限公司 Method for judging rubbish mail and device
CN104050240A (en) * 2014-05-26 2014-09-17 北京奇虎科技有限公司 Method and device for determining categorical attribute of search query word
CN104008169B (en) * 2014-05-30 2017-02-22 中国测绘科学研究院 Semanteme based geographical label content safe checking method and device
CN104008169A (en) * 2014-05-30 2014-08-27 中国测绘科学研究院 Semanteme based geographical label content safe checking method and device
US10467342B2 (en) 2014-11-28 2019-11-05 Huawei Technologies Co., Ltd. Method and apparatus for determining semantic matching degree
WO2016082406A1 (en) * 2014-11-28 2016-06-02 华为技术有限公司 Method and apparatus for determining semantic matching degree
CN104408036A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Correlated topic recognition method and device
CN106156071A (en) * 2015-03-31 2016-11-23 北京奇虎科技有限公司 Intranet Intranet searching method, device and server
CN105893397A (en) * 2015-06-30 2016-08-24 北京爱奇艺科技有限公司 Video recommendation method and apparatus
CN105893397B (en) * 2015-06-30 2019-03-15 北京爱奇艺科技有限公司 A kind of video recommendation method and device
CN106610972A (en) * 2015-10-21 2017-05-03 阿里巴巴集团控股有限公司 Query rewriting method and apparatus
CN106980634A (en) * 2016-01-18 2017-07-25 维布络有限公司 System and method for classifying and solving software production accident list
CN105786782A (en) * 2016-03-25 2016-07-20 北京搜狗科技发展有限公司 Word vector training method and device
CN105786782B (en) * 2016-03-25 2018-10-19 北京搜狗信息服务有限公司 A kind of training method and device of term vector
CN106095983B (en) * 2016-06-20 2019-11-26 北京百度网讯科技有限公司 A kind of similarity based on personalized deep neural network determines method and device
CN106095983A (en) * 2016-06-20 2016-11-09 北京百度网讯科技有限公司 A kind of similarity based on personalized deep neural network determines method and device
CN106528595B (en) * 2016-09-23 2019-08-06 中国农业科学院农业信息研究所 Realm information based on website homepage content is collected and correlating method
CN106528595A (en) * 2016-09-23 2017-03-22 中国农业科学院农业信息研究所 Website homepage content based field information collection and association method
CN106844343A (en) * 2017-01-20 2017-06-13 上海傲硕信息科技有限公司 Instruction results screening plant
CN106844343B (en) * 2017-01-20 2019-11-19 上海傲硕信息科技有限公司 Instruction results screening plant
WO2018157625A1 (en) * 2017-02-28 2018-09-07 华为技术有限公司 Reinforcement learning-based method for learning to rank and server
CN109189910B (en) * 2018-09-18 2019-09-10 哈尔滨工程大学 A kind of label auto recommending method towards mobile application problem report
CN109189910A (en) * 2018-09-18 2019-01-11 哈尔滨工程大学 A kind of label auto recommending method towards mobile application problem report

Also Published As

Publication number Publication date
CN103020164B (en) 2015-06-10

Similar Documents

Publication Publication Date Title
US10157233B2 (en) Search engine that applies feedback from users to improve search results
CN102831234B (en) Personalized news recommendation device and method based on news content and theme feature
US20170357723A1 (en) Systems for and methods of finding relevant documents by analyzing tags
Zhou et al. Improving search via personalized query expansion using social media
US8650198B2 (en) Systems and methods for facilitating the gathering of open source intelligence
JP5572596B2 (en) Personalize the ordering of place content in search results
He et al. Context-aware citation recommendation
Raman et al. Toward whole-session relevance: exploring intrinsic diversity in web search
CN102760138B (en) Classification method and device for user network behaviors and search method and device for user network behaviors
CN102279851B (en) Intelligent navigation method, device and system
US8812559B2 (en) Methods and systems for creating an advertising database
CN103177090B (en) A kind of topic detection method and device based on big data
US8407229B2 (en) Systems and methods for aggregating search results
US9652537B2 (en) Identifying terms associated with queries
Chapelle et al. Intent-based diversification of web search results: metrics and algorithms
CN101551806B (en) Personalized website navigation method and system
CN103064945B (en) Based on the Situational searching method of body
Teevan et al. To personalize or not to personalize: modeling queries with variation in user intent
CN102831199B (en) Method and device for establishing interest model
Dou et al. Evaluating the effectiveness of personalized web search
CN1882943B (en) Systems and methods for search processing using superunits
JP5563836B2 (en) System and method for providing default hierarchy training for social indexing
US9576251B2 (en) Method and system for processing web activity data
CN103226578B (en) Towards the website identification of medical domain and the method for webpage disaggregated classification
CN101692223B (en) Refined Search space is inputted in response to user

Legal Events

Date Code Title Description
PB01 Publication
C06 Publication
SE01 Entry into force of request for substantive examination
C10 Entry into substantive examination
GR01 Patent grant
C14 Grant of patent or utility model
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150610

Termination date: 20151126

CF01 Termination of patent right due to non-payment of annual fee