CN103020164A - Semantic search method based on multi-semantic analysis and personalized sequencing - Google Patents

Semantic search method based on multi-semantic analysis and personalized sequencing Download PDF

Info

Publication number
CN103020164A
CN103020164A CN201210488572XA CN201210488572A CN103020164A CN 103020164 A CN103020164 A CN 103020164A CN 201210488572X A CN201210488572X A CN 201210488572XA CN 201210488572 A CN201210488572 A CN 201210488572A CN 103020164 A CN103020164 A CN 103020164A
Authority
CN
China
Prior art keywords
document
vector
user
word
term vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201210488572XA
Other languages
Chinese (zh)
Other versions
CN103020164B (en
Inventor
马应龙
张潇澜
于潇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North China Electric Power University
Original Assignee
North China Electric Power University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China Electric Power University filed Critical North China Electric Power University
Priority to CN201210488572.XA priority Critical patent/CN103020164B/en
Publication of CN103020164A publication Critical patent/CN103020164A/en
Application granted granted Critical
Publication of CN103020164B publication Critical patent/CN103020164B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a semantic search method based on multi-semantic analysis and personalized sequencing, and belongs to the field of information search. The semantic search method adopts the technical scheme comprising the following steps: firstly, by a crawler technology and other technologies, acquiring webpage documents from the Internet, classifying the webpage documents by using a support vector machine, establishing a word vector library by a multi-semantic analysis method, and writing multi-classification results into an index to form an index library; secondly, based on the word vector library, forming search keywords input by a user into a query vector, performing class matching query with the index library to obtain an initial sequencing result; and finally, according to personalized information and history access information of the user, optimizing the initial sequencing result, and returning the optimized result to the user. By the semantic search method based on the multi-semantic analysis and the personalized sequencing, the word vector library and the index library with rich semantemes are formed; and through the personalized information and the history access information, a search result can meet a search demand of the user better and search satisfaction of the user can be improved.

Description

A kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering
Technical field
The invention belongs to information retrieval field, relate in particular to a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering.
Background technology
Search engine is the certain strategy of a basis, use specific computer program to gather information and information organized and process from the internet after, for the user provides retrieval service and with the relative information displaying of the user search system to the user.In order to tackle the rapid growth of the information capacity on the internet, search engine arises at the historic moment.Even to this day, it has become the requisite approach of people's obtaining information from the network.But, current main flow based on the search engine of key word such as Google, Baidu, Bing, Yahoo etc., some thorny problems of ubiquity.Result such as user search understands a large amount of incoherent links of ubiquity; Because user crowd's diversity, single result can not satisfy each user's special requirement targetedly; Search procedure is not considered the semantic relevancy between the word, and Search Results do not organize by certain mode effectively, and the user must not be wasted time and energy and be browsed and select.
Semantic search is a kind of novel way of search that is different from based on keyword search.In general, the work of semantic search no longer sticks to the key word of user institute input request statement itself, and can capture comparatively exactly the potential intention of user institute read statement, thereby can return the result who meets its demand most to the user more accurately, compare traditional search and have higher retrieval precision and original advantage.Ramesh Singh and Myungjin Lee attempt Search Results is reorganized in its research, improve user's search experience.Lien-Fu Lai and Huanhuan Cao utilize concealed Markov tree or other models to calculate the degree of correlation that concerns between Different Results, thereby increase the face of containing of Search Results.FangLiu and Jaime Teevan etc. have proposed the method that the historical visit information of the various users of utilization carries out personalized search, in order to improve the precision of search.Suitable improvement has all been carried out in above-mentioned these researchs aspect semantic search, but these researchs can carry out the Extraordinary condition relatively harsher, and the increase of time loss control are bad based on the user being inquired about in the personalization of classification; Secondly, do not consider from user-dependent different information to have different weights in the process.Therefore, the ordering processing mode to final Search Results is still unsatisfactory.
Summary of the invention
In the problem that exists aspect retrieval precision and the user search experience, the present invention proposes a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering for the existing information retrieval.
A kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering is characterized in that, specifically may further comprise the steps:
Step 1: a part of utilizing crawler technology to obtain web document from the internet is carried out manual sort as training pattern, in conjunction with multi-semantic meaning analytical approach MSA structure term vector storehouse, with the web document vector representation, and training pattern is put in the support vector machines sorter document vector is trained, new webpage utilizes this model to classify by SVM; The classification information of all webpages is write in the index database as an attribute;
Step 2: based on the term vector storehouse that step 1 forms, the search key structure term vector separately with user's input forms final query vector, and query vector and index database is carried out the classification matching inquiry, obtains initial web search result.
Step 3: personal customization information and historical visit information according to the user are optimized ordering to the initial retrieval result, and final result for retrieval is returned to the user.
In the step 1, construct the term vector storehouse based on multi-semantic meaning analytical approach MSA, and the classification results of web document is write in the index, form the process of index database; Specifically comprise following step:
Step 11: structuring concept space; Setting space of the present invention is the m dimension.
The basic dimension of concept space is the set of some class labels, the information that can represent whole corpus, general m the class label that directly extracts from the corpus tag along sort consists of m dimension of vector, then the semantic information of each word is described by a m dimensional vector in the web document, is called term vector;
Step 12: the determining of term vector component value:
Word is to extract from the web document of training pattern, and the size of each component value of term vector is decided by all documents of training pattern, and each component value computing formula of term vector is:
w ( c i , t j ) = Σ k = 1 k = | D | H ( c i , d k ) log 2 ( 1 + tf ( d k , t j ) ) log 2 ( 1 + length ( d k ) ) ( 1 )
Figure BDA00002469381300033
Wherein, t jRepresent j word in the term vector storehouse, w (c i, t j) represent word t jWith i dimension c in the equivalent vector iRelation, namely be word t jEquivalent is to measuring i component value; | D| is the quantity of training document; Tf (d k, t j) refer to word t jAt document d kThe frequency of middle appearance; H (c i, d k) be a discriminant function: if document d kBelong to dimension c iDescribed field, then H (c i, d k) value is 1, otherwise is 0; Length (d k) be document d kLength, i.e. document d kThrough the number of the word that obtains after the participle denoising, when some words repeatedly occur in document, then repeat count, i.e. length (d k) 〉=n; K is the quantity of document.
Step 13: the formation in term vector unitization processing and term vector storehouse.
With term vector unitization processing, make its component value scope be [0,1], thereby have better versatility.Term vector behind a plurality of units just forms the term vector storehouse; The computing formula of term vector unitization is:
w ′ ( c i , t j ) = w ( c i , t j ) Σ i = 0 m w ( c i , t j ) - - - ( 2 )
Wherein, the term vector behind the unit is designated as W ' (c i, t j) be I component value, then the term vector storehouse is:
t j → = ( w ′ ( c 1 , t j ) , w ′ ( c 2 , t j ) , . . . , w ′ ( c m , t j ) ) T - - - ( 3 )
Step 14: obtain the weights of each word in the document and these weights are carried out the unit processing by the TFIDF method.TFIDF weights method be popular for many years and be proved to be one of effective weights method, it does not consider the classification situation to the determining only to depend on the overall condition of corpus of weights, therefore have very strong versatility, the weights that can be applied to the word of many classifying texts in representing are determined.The TFIDF weights determine that the computing formula of method is:
weight ( t g , d k ) = TFIDF ( t g , d k ) = tf ( t g , d k ) × lg | D | | D ′ | - - - ( 4 )
Wherein, t gBe document d kG participle, weight (t g, d k) represent word t gAt document d kIn shared weights, the set of D representative training document, d kRepresent k document.| D| is the quantity of training document; The D' representative contains word t gCollection of document, | D ' | be the quantity of the middle document of set D '.
In like manner unit processing, so that the weights span of word is [0,1] behind the document participle, the computing formula of the weights of word is behind the document participle:
weight ′ ( t g , d k ) = weight ( t g , d k ) Σ j = 1 n weight ( t g , d k ) - - - ( 5 )
Wherein, weight'(t g, d k) be word t behind the unit gAt document d kIn shared weights, n is the participle kind sum of document.
Step 15: the document vector forms.After adopting the TFIDF method to represent weights, the document vector of multi-semantic meaning analysis (MSA) has just formed, document d kCorresponding document vector
Figure BDA00002469381300052
In the computing formula of i component value be:
wd ( c i , d k ) = Σ g = 1 n { w ′ ( c i , t g ) × weigh t ′ ( t g , d k ) } - - - ( 6 )
Document d kThe document vector form be designated as:
d k → = weig h ′ t ( t 1 , d k ) × t 1 → + weigh ′ t ( t 2 , d k ) × t 2 → + . . . + weigh ′ t ( t n , d k ) × t n →
= Σ g = 1 n { weigh ′ t ( t g , d k ) × t g → } - - - ( 7 )
Wherein, n is the participle kind sum of document,
Figure BDA00002469381300056
Be t gVector form in the term vector storehouse.
This document vector, each component value has directly represented the degree of correlation of this document with corresponding dimension (classification), has very strong Semantic, is the basis of matching inquiry.Afterwards by the m that a pre-defines class label, use the support vector machine technology document vector to be classified and as the criteria for classification of new webpage, and the classification of all webpages is write in the index database as an attribute.
In the step 2, described query vector and index database carry out classification matching inquiry step and comprise:
Step 21: based on the term vector storehouse, with the searching key word vector representation of user's input.
Note searched key set of words is: KEY={key 1, key 2..., key n, the term vector of corresponding each word of extraction makes up each word key from the term vector storehouse that has established iVector form
Figure BDA00002469381300061
Then all keywords can form the query vector set
Figure BDA00002469381300062
Wherein, in the term vector storehouse, there is not key iThe time,
Figure BDA00002469381300063
Step 22: on the basis of step 21, form the query vector of searching key word in the m gt: the query vector formula is:
Q → = Σ i = 0 n T i → = ( T 1 → + T 2 → + . . . + T n → ) = ( α 1 , α 2 , . . . , α m ) T - - - ( 8 )
Amount of orientation
Figure BDA00002469381300065
First three component of component value maximum be designated as: α p, α q, α r, the dimension classification of their correspondences is designated as: c P, c q, c r, the weight vector of classification is designated as:
Figure BDA00002469381300066
This weight vector can used in the user profile coupling afterwards.Based on this three kind { c P, c q, c rIn index database, carry out match query, and filter out the webpage that belongs to these three classifications, adopt Lucene basis sort algorithm, obtain initial ranking results.
Described Lucene basic score algorithmic formula is:
score ( q , d ) = coord ( q , d ) · queryNorm ( q ) · Σ tinq ( tf ( t , d ) · idf ( t ) 2 · t . getboost ( ) · norm ( t , d ) )
Wherein, q is the demand of retrieval;
Tf (t, d) represents the word frequency that entry t occurs in document d;
Idf (t) represents entry t and arrange word frequency in document;
T.getBoost (): the weight of each word in the query statement, can in inquiry, set certain word more important;
Norm (t, d): normalization factor, it comprises three parameters: (1) Document boost: this value is larger, illustrates that this document is more important.(2) Field boost: this territory is larger, illustrates that this territory is more important.(3) lengthNorm (feld): the Term sum that comprises in territory is more, also is that document is longer, and this value is less, and document is shorter, and this value is larger;
Coord (q, d) represents coordinating factor, and its calculating is based on the entry quantity of all Gong inquiries that comprise among the document d;
QueryNorm (q) representative the variance that provides each query entries and after, calculate the standardized value of certain inquiry.
In the step 3, according to user's personal customization information initial ranking results is optimized processing and specifically may further comprise the steps;
Step 301: collect three kinds of the highest personal customization information of user query frequency: the first customized information u, the second customized information v and the 3rd customized information s, and determine that the weights of these three kinds of personal customization information are A, B and E;
Step 302: the Query coupling when user customized information is determined; At this moment, because the classification of every personal information of user is all definite, therefore, the Lucene basis score of document in the initial ranking results is made amendment:
If I. u, v and s are 0, then this document scores is constant;
If II. there is one not to be 0 among u, v and the s, then:
newscore = score · ( 1 + A · u + B · v + E · s ) · ( 1 + topscore - score topscore - lastscore )
Wherein,
U=1 represents this webpage classification and conforms to the first customized information, and 0 representative is not inconsistent;
V=1 represents this webpage classification and conforms to user's the second customized information, and 0 representative is not inconsistent;
S=1 represents this webpage classification and conforms to user's the 3rd customized information, and 0 representative is not inconsistent;
Topscore is the top score in the result document, and lastscore is minimum score;
Step 303: the Query coupling when user customized information is fuzzy:
The personal customization information of inputting as the user does not belong to given default category scope, and the personal customization information of input is searched corresponding classification in the term vector storehouse, obtains corresponding new term vector; Set in the present invention term vector collection corresponding to user's the first customized information
Figure BDA00002469381300082
Corresponding weight vector is
Figure BDA00002469381300083
Corresponding classification is c 1, c 2, c 3The term vector collection that user's the second customized information is corresponding is
Figure BDA00002469381300084
Corresponding weight vector is
Figure BDA00002469381300085
Corresponding classification is c 4, c 5, c 6New classification set is designated as: C={c 1, c 2, c 3∪ { c 4, c 5, c 6; For each web document, if document d kBelong to classification c i, and c i∈ C, then the document score of this webpage becomes:
newscore = score · ( 1 + A · w u i + B · w v i + E · s ) · ( 1 + topscore - score topscore - lastscore )
Wherein, wu iAnd wv iThe corresponding c of weight vector iIf the value of that dimension of class is the not corresponding c of weight vector iClass, this is 0 years old.
In the step 3, according to the historical visit information of user initial ranking results is optimized processing; The historical visit information the matching analysis of described user is according to the optimization to initial ranking results of user's history access record.Owing to repeatedly in the webpage search afterwards of accessed mistake very important effect being arranged in the access history, and the maximum page of all user selections has very large directiveness to the search tendency of unique user, so, this method utilizes user's historical visit information that initial ranking results is optimized, and promotes the page rank high with user's degree of correlation.The present invention proposes the excellent row's algorithm of following webpage and may further comprise the steps:
Step 311: then carry out following algorithm if document d is history or hot link hotlink, otherwise skip this step;
Step 312: establishing debut ranking is r, and then the new rank of d is:
r ′ = r s ′ · log ( 2 + n 1 ) + h · log ( 2 + n 2 ) - - - ( 9 )
Wherein,
S ': be that historical record is 1, no is 0;
H: be that hot link hotlink is 1, no is 0;
n 1: the user is to this historical number of clicks;
n 2: the number of clicks of hot link hotlink.
As can be known, the minimum value of r' is 0.
The present invention at first is optimized existing algorithm, adds the multi-semantic meaning analysis, proposes semantic information more abundant term vector storehouse and index database.Then based on the term vector storehouse search key that the user inputs is carried out semantic analysis, carry out the Query coupling with index database, form initial ranking results.In conjunction with individual subscriber customized information and historical visit information, utilize semantic analysis that initial ranking results is optimized at last, thereby more met the result for retrieval of user's tendency, improve user's retrieval and experience.
Description of drawings
Fig. 1 is the algorithm flow chart of a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering provided by the invention;
Fig. 2 is the contrast distribution figure of three kinds of search methods (LB, YH and OURS) retrieval precision of providing of the embodiment of the invention.
Embodiment
Below in conjunction with accompanying drawing, preferred embodiment is elaborated.Should be emphasized that following explanation only is exemplary, rather than in order to limit the scope of the invention and to use.
Below describe detailed process of the present invention in detail by a specific embodiment:
Step 1: corpus is prepared
Utilize crawler technology to obtain webpage from the internet.Climb to approximately 6000 up-to-date webpages from main websites such as Sina website (sina.com), Zhong Guan-cun online (zol.com), select a part as training set, utilize the SVM processing of classifying.According to source and the actual conditions of these webpages, adopt direct derivation mode, finally determine 7 class label: sport, agriculture, automobile, IT, food, lady and finance, normal, training pattern namely thus 7 class labels portray.Wherein normal is not for belonging to other classifications of the first six class label.Utilize this training pattern to the test set processing of classifying.This method is if put into commercial the use, then can be according to ODP(Open Directory Project) taxonomical hierarchy set each other class label of level because the search engine in the reality has huge widely web page source.
Step 2: the selection of relevant control methods
Select two representative searching method: Lucene and YD (Yahoo Directory) to contrast the size of the retrieval precision of this method in the present embodiment.
Test search effect of the present invention 2.1:Lucene search for the index that these webpages set up by Lucene Searcher as first contrast test.
2.2:Yahoo Directory is an online English website split catalog search, the Search Results on it all posts class label.Simultaneously, the training pattern that these webpages are set up can be used to the webpage of classifying and climbing down from Yahoo Directory, as second contrast test.
2.3: in order to satisfy most of user, the result that returns of each key word for test, lower front 30 result of its rank of this method crawl, classify by the training pattern of having set up, and in search, reorganize according to the present invention, search effect of the present invention is tested in test as a comparison.
Lucene and Yahoo Directory are the methods of present industry Information Organization, processing and the retrieval comparatively paid close attention to, so the present invention selects to carry out the contrast of index of correlation with these two kinds of methods.
Step 3: Experimental Comparison target setting
Statistics shows, the page that arrives for search engine retrieving has only been checked front results page more than 100 less than 0.1% user, and the user more than 80% has only browsed front 30 results page.Because the present invention has certain screening, in order to make the user more selection space is arranged, in the contrast of Lucene search, front 200 webpages that this method is chosen initial search result are optimized Ordination and processing.
The present embodiment is randomly drawed 7 users and is experienced.In order to weigh retrieval effectiveness, set the standard of an assessment: accuracy rate R.For each inquiry, get 10 documents of Search Results, the accuracy rate R of each inquiry is defined as:
Figure BDA00002469381300121
D wherein rQuantity for the document relevant with searching keyword.Repeatedly averaging after the inquiry, is exactly the retrieval precision of the method.
Step 4: experimental result and correlation analysis
The search accuracy rate of three kinds of methods and search precision contrast are as shown in Figure 2.Wherein:
Lucene:Lucene?Searcher
YD:Yahoo?Directory
OURS: method proposed by the invention
As can be seen from the figure, the web document proportion that meets the user search tendency that the present invention returns all is better than front two kinds of methods, accuracy rate and retrieval precision will significantly be better than the search method of Lucene, though Yahoo Directory is more approaching with retrieval precision of the present invention, but weaker.The multi-semantic meaning analysis that this explanation the present invention proposes and the semantic retrieving method of Optimal scheduling can promote retrieval accurately and precision, in present search engine, has obvious superiority, the search tendency that can more be close to the users satisfies user's retrieval and experiences.
The above; only for the better embodiment of the present invention, but protection scope of the present invention is not limited to this, anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims (6)

1. the semantic retrieving method based on multi-semantic meaning analysis and personalized ordering is characterized in that, specifically may further comprise the steps:
Step 1: a part of utilizing crawler technology to obtain web document from the internet is carried out manual sort as training pattern, in conjunction with multi-semantic meaning analytical approach MSA structure term vector storehouse, with the web document vector representation, and training pattern is put in the support vector machines sorter document vector is trained, new webpage utilizes this model to classify by SVM; The classification information of all webpages is write in the index database as an attribute;
Step 2: based on the term vector storehouse that step 1 forms, the search key structure term vector separately with user's input forms final query vector, and query vector and index database is carried out the classification matching inquiry, obtains initial web search result;
Step 3: personal customization information and historical visit information according to the user are optimized ordering to the initial retrieval result, and final result for retrieval is returned to the user.
2. a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering according to claim 1, it is characterized in that, in the described step 1, based on multi-semantic meaning analytical approach MSA structure term vector storehouse, and the classification results of web document write in the index, form the process of index database; Specifically comprise step:
Step 11: structuring concept space; Setting space of the present invention is the m dimension;
The basic dimension of concept space is the set of some class labels, the information that can represent whole corpus, general m the class label that directly extracts from the corpus tag along sort consists of m dimension of vector, then the semantic information of each word is described by a m dimensional vector in the web document, is called term vector;
Step 12: the determining of term vector component value:
Word is to extract from the web document of training pattern, and the size of each component value of term vector is decided by all documents of training pattern; Each component value computing formula of term vector is:
w ( c i , t j ) = Σ k = 1 k = | D | H ( c i , d k ) log 2 ( 1 + tf ( d k , t j ) ) log 2 ( 1 + length ( d k ) )
Figure FDA00002469381200022
Wherein, t jRepresent j word in the term vector storehouse, w (c i, t j) represent word t jWith i dimension c in the equivalent vector iRelation, namely be word t jEquivalent is to measuring i component value; | D| is the quantity of training document; Tf (d k, t j) refer to word t jAt document d kThe frequency of middle appearance; H (c i, d k) be a discriminant function: if document d kBelong to dimension c iDescribed field, then H (c i, d k) value is 1, otherwise is 0; Length (d k) be document d kLength, i.e. document d kThrough the number of the word that obtains after the participle denoising, when some words repeatedly occur in document, then repeat count, i.e. length (d k) 〉=n; K is the quantity of document;
Step 13: the formation in term vector unitization processing and term vector storehouse:
With term vector unitization processing, make its component value scope be [0,1], thereby have better versatility; Term vector behind a plurality of units just forms the term vector storehouse; The computing formula of term vector unitization is:
w ′ ( c i , t j ) = w ( c i , t j ) Σ i = 0 m w ( c i , t j )
Wherein, the term vector behind the unit is designated as
Figure FDA00002469381200031
W ' (c i, t j) be
Figure FDA00002469381200032
I component value, then the term vector storehouse is:
t j → = ( w ′ ( c 1 , t j ) , w ′ ( c 2 , t j ) , . . . , w ′ ( c m , t j ) ) T
Step 14: obtain the weights of each word in the document and these weights are carried out the unit processing by the TFIDF method; TFIDF weights method be popular for many years and be proved to be one of effective weights method, it does not consider the classification situation to the determining only to depend on the overall condition of corpus of weights, therefore have very strong versatility, the weights that can be applied to the word of many classifying texts in representing are determined; The TFIDF weights determine that the computing formula of method is:
weight ( t g , d k ) = TFIDF ( t g , d k ) = tf ( t g , d k ) × lg | D | | D ′ |
Wherein, t gBe document d kG participle, weight (t g, d k) represent word t gAt document d kIn shared weights, the set of D representative training document, d kRepresent k document; | D| is the quantity of training document; The D' representative contains word t gCollection of document, | D ' | be the quantity of the middle document of set D ';
In like manner unit processing, so that the weights span of word is [0,1] behind the document participle, the computing formula of the weights of word is behind the document participle:
weight ′ ( t g , d k ) = weight ( t g , d k ) Σ j = 1 n weight ( t g , d k )
Wherein, weight'(t g, d k) be word t behind the unit gAt document d kIn shared weights, n is the participle kind sum of document;
Step 15: the document vector forms; After adopting the TFIDF method to represent weights, the document vector of multi-semantic meaning analysis (MSA) has just formed, document d kCorresponding document vector
Figure FDA00002469381200041
In the computing formula of i component value be:
wd ( c i , d k ) = Σ g = 1 n { w ′ ( c i , t g ) × weigh t ′ ( t g , d k ) }
Document d kThe document vector form be designated as:
d k → = weig h ′ t ( t 1 , d k ) × t 1 → + weigh ′ t ( t 2 , d k ) × t 2 → + . . . + weigh ′ t ( t n , d k ) × t n →
= Σ g = 1 n { weigh ′ t ( t g , d k ) × t g → }
Wherein, n is the participle kind sum of document, Be t gVector form in the term vector storehouse;
This document vector, each component value has directly represented the degree of correlation of this document with corresponding dimension (classification), has very strong Semantic, is the basis of matching inquiry; Afterwards by the m that a pre-defines class label, use the support vector machine technology document vector to be classified and as the criteria for classification of new webpage, and the classification of all webpages is write in the index database as an attribute.
3. a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering according to claim 1 is characterized in that, in the described step 2, the matching analysis of query vector and index database comprises substep:
Step 21: based on the term vector storehouse, with the searching key word vector representation of user's input;
Note searched key set of words is: KEY={key 1, key 2..., key n, the term vector of corresponding each word of extraction makes up each word key from the term vector storehouse that has established iVector form
Figure FDA00002469381200046
Then all keywords can form the query vector set
Figure FDA00002469381200047
Wherein, in the term vector storehouse, there is not key iThe time,
Figure FDA00002469381200051
Step 22: on the basis of step 21, form the query vector of searching key word in the m gt: the query vector formula is:
Q → = Σ i = 0 n T i → = ( T 1 → + T 2 → + . . . + T n → ) = ( α 1 , α 2 , . . . , α m ) T
Amount of orientation
Figure FDA00002469381200053
First three component of component value maximum be designated as: α p, α q, α r, the dimension classification of their correspondences is designated as: c P, c q, c r, the weight vector of classification is designated as:
Figure FDA00002469381200054
This weight vector can used in the user profile coupling afterwards; Based on this three kind { c P, c q, c rIn index database, carry out match query, and filter out the webpage that belongs to these three classifications, adopt Lucene basis sort algorithm, obtain initial ranking results.
4. a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering according to claim 3 is characterized in that, described Lucene basis sort algorithm such as formula are:
score ( q , d ) = coord ( q , d ) · queryNorm ( q ) · Σ tinq ( tf ( t , d ) · idf ( t ) 2 · t . getboost ( ) · norm ( t , d ) )
Wherein, q is the demand of retrieval;
Tf (t, d) represents the word frequency that entry t occurs in document d;
Idf (t) represents entry t and arrange word frequency in document;
T.getBoost (): the weight of each word in the query statement;
Norm (t, d): normalization factor;
Coord (q, d) represents coordinating factor;
QueryNorm (q) representative the variance that provides each query entries and after, the standardized value of inquiry.
5. a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering according to claim 1 is characterized in that, according to user's personal customization information initial ranking results is optimized processing and specifically may further comprise the steps:
Step 301: collect three kinds of the highest personal customization information of user query frequency: the first customized information u, the second customized information v and the 3rd customized information s, and determine that the weights of these three kinds of personal customization information are A, B and E;
Step 302: the Query coupling when user customized information is determined; At this moment, because the classification of every personal information of user is all definite, therefore, the Lucene basis score of document in the initial ranking results is made amendment:
If I. u, v and s are 0, then this document scores is constant;
If II. there is one not to be 0 among u, v and the s, then:
newscore = score · ( 1 + A · u + B · v + E · s ) · ( 1 + topscore - score topscore - lastscore )
Wherein,
U=1 represents this webpage classification and conforms to the first customized information, and 0 representative is not inconsistent;
V=1 represents this webpage classification and conforms to user's the second customized information, and 0 representative is not inconsistent;
S=1 represents this webpage classification and conforms to user's the 3rd customized information, and 0 representative is not inconsistent;
Topscore is the top score in the result document, and lastscore is minimum score;
Step 303: the Query coupling when user customized information is fuzzy:
The personal customization information of inputting as the user does not belong to given default category scope, and the personal customization information of input is searched corresponding classification in the term vector storehouse, obtains corresponding new term vector; Set in the present invention term vector collection corresponding to user's the first customized information Corresponding weight vector is
Figure FDA00002469381200072
Corresponding classification is c 1, c 2, c 3The term vector collection that user's the second customized information is corresponding is
Figure FDA00002469381200073
Corresponding weight vector is
Figure FDA00002469381200074
Corresponding classification is c 4, c 5, c 6New classification set is designated as: C={c 1, c 2, c 3∪ { c 4, c 5, c 6; For each web document, if document d kBelong to classification c i, and c i∈ C, then the document score of this webpage becomes:
newscore = score · ( 1 + A · w u i + B · w v i + E · s ) · ( 1 + topscore - score topscore - lastscore )
Wherein, wu iAnd wv iThe corresponding c of weight vector iIf the value of that dimension of class is the not corresponding c of weight vector iClass, this is 0 years old.
6. a kind of semantic retrieving method based on multi-semantic meaning analysis and personalized ordering according to claim 1 is characterized in that, according to the historical visit information of user initial ranking results is optimized processing procedure and specifically may further comprise the steps;
Step 311: then carry out following algorithm if document d is history or hot link hotlink, otherwise skip this step;
Step 312: establishing debut ranking is r, and then the new rank of d is:
r ′ = r s ′ · log ( 2 + n 1 ) + h · log ( 2 + n 2 )
Wherein,
S ': be that historical record is 1, no is 0;
H: be that hot link hotlink is 1, no is 0;
n 1: the user is to this historical number of clicks;
n 2: the number of clicks of hot link hotlink;
As can be known, the minimum value of r' is 0.
CN201210488572.XA 2012-11-26 2012-11-26 Semantic search method based on multi-semantic analysis and personalized sequencing Expired - Fee Related CN103020164B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210488572.XA CN103020164B (en) 2012-11-26 2012-11-26 Semantic search method based on multi-semantic analysis and personalized sequencing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210488572.XA CN103020164B (en) 2012-11-26 2012-11-26 Semantic search method based on multi-semantic analysis and personalized sequencing

Publications (2)

Publication Number Publication Date
CN103020164A true CN103020164A (en) 2013-04-03
CN103020164B CN103020164B (en) 2015-06-10

Family

ID=47968768

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210488572.XA Expired - Fee Related CN103020164B (en) 2012-11-26 2012-11-26 Semantic search method based on multi-semantic analysis and personalized sequencing

Country Status (1)

Country Link
CN (1) CN103020164B (en)

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593336A (en) * 2013-10-30 2014-02-19 中国运载火箭技术研究院 Knowledge pushing system and method based on semantic analysis
CN103646017A (en) * 2013-12-11 2014-03-19 南京大学 Acronym generating system for naming and working method thereof
CN104008169A (en) * 2014-05-30 2014-08-27 中国测绘科学研究院 Semanteme based geographical label content safe checking method and device
CN104050240A (en) * 2014-05-26 2014-09-17 北京奇虎科技有限公司 Method and device for determining categorical attribute of search query word
CN104408036A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Correlated topic recognition method and device
CN104516897A (en) * 2013-09-29 2015-04-15 国际商业机器公司 Method and device for sorting application objects
CN105247517A (en) * 2013-04-23 2016-01-13 谷歌公司 Ranking signals in mixed corpora environments
WO2016082406A1 (en) * 2014-11-28 2016-06-02 华为技术有限公司 Method and apparatus for determining semantic matching degree
CN105786782A (en) * 2016-03-25 2016-07-20 北京搜狗科技发展有限公司 Word vector training method and device
CN105893397A (en) * 2015-06-30 2016-08-24 北京爱奇艺科技有限公司 Video recommendation method and apparatus
CN106095983A (en) * 2016-06-20 2016-11-09 北京百度网讯科技有限公司 A kind of similarity based on personalized deep neural network determines method and device
CN106156071A (en) * 2015-03-31 2016-11-23 北京奇虎科技有限公司 Intranet Intranet searching method, device and server
CN106528595A (en) * 2016-09-23 2017-03-22 中国农业科学院农业信息研究所 Website homepage content based field information collection and association method
CN106610972A (en) * 2015-10-21 2017-05-03 阿里巴巴集团控股有限公司 Query rewriting method and apparatus
CN106844343A (en) * 2017-01-20 2017-06-13 上海傲硕信息科技有限公司 Instruction results screening plant
CN106910497A (en) * 2015-12-22 2017-06-30 阿里巴巴集团控股有限公司 A kind of Chinese word pronunciation Forecasting Methodology and device
CN106951422A (en) * 2016-01-07 2017-07-14 腾讯科技(深圳)有限公司 The method and apparatus of webpage training, the method and apparatus of search intention identification
CN106980634A (en) * 2016-01-18 2017-07-25 维布络有限公司 System and method for classifying and solving software production accident list
CN107203526A (en) * 2016-03-16 2017-09-26 高德信息技术有限公司 A kind of query string semantic requirement analysis method and device
CN103744905B (en) * 2013-12-25 2018-03-30 新浪网技术(中国)有限公司 Method for judging rubbish mail and device
CN108111478A (en) * 2017-11-07 2018-06-01 中国互联网络信息中心 A kind of phishing recognition methods and device based on semantic understanding
CN108182229A (en) * 2017-12-27 2018-06-19 上海科大讯飞信息科技有限公司 Information interacting method and device
CN108460067A (en) * 2017-10-30 2018-08-28 上海赛图计算机科技股份有限公司 Tile index structure, index structuring method and data retrieval method based on data
WO2018157625A1 (en) * 2017-02-28 2018-09-07 华为技术有限公司 Reinforcement learning-based method for learning to rank and server
CN109189910A (en) * 2018-09-18 2019-01-11 哈尔滨工程大学 A kind of label auto recommending method towards mobile application problem report
CN109376288A (en) * 2018-09-28 2019-02-22 北京北斗方圆电子科技有限公司 A kind of cloud computing platform and its equalization methods for realizing semantic search
CN110019888A (en) * 2017-12-01 2019-07-16 北京搜狗科技发展有限公司 A kind of searching method and device
CN110287288A (en) * 2019-06-18 2019-09-27 北京百度网讯科技有限公司 Recommend the method and apparatus of document
CN110765368A (en) * 2018-12-29 2020-02-07 北京嘀嘀无限科技发展有限公司 Artificial intelligence system and method for semantic retrieval
CN110895467A (en) * 2018-09-13 2020-03-20 深圳市蓝灯鱼智能科技有限公司 Method and device for updating search model, storage medium and electronic device
CN111753527A (en) * 2020-06-29 2020-10-09 平安科技(深圳)有限公司 Data analysis method and device based on natural language processing and computer equipment
CN113269477A (en) * 2021-07-14 2021-08-17 北京邮电大学 Scientific research project query scoring model training method, query method and device
CN114021019A (en) * 2021-11-10 2022-02-08 中国人民大学 Retrieval method integrating personalized search and search result diversification
CN114969310A (en) * 2022-06-07 2022-08-30 南京云问网络技术有限公司 Multi-dimensional data-oriented sectional type retrieval and sorting system design method
CN115168577A (en) * 2022-06-30 2022-10-11 北京百度网讯科技有限公司 Model updating method and device, electronic equipment and storage medium
WO2023040808A1 (en) * 2021-09-18 2023-03-23 华为技术有限公司 Webpage retrieval method and related device

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862004A (en) * 2017-10-24 2018-03-30 科大讯飞股份有限公司 Intelligent sorting method and device, storage medium, electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070106659A1 (en) * 2005-03-18 2007-05-10 Yunshan Lu Search engine that applies feedback from users to improve search results
CN101398839A (en) * 2008-10-23 2009-04-01 浙江大学 Personalized push method for vocal web page news
CN102495872A (en) * 2011-11-30 2012-06-13 中国科学技术大学 Method and device for conducting personalized news recommendation to mobile device users

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070106659A1 (en) * 2005-03-18 2007-05-10 Yunshan Lu Search engine that applies feedback from users to improve search results
CN101398839A (en) * 2008-10-23 2009-04-01 浙江大学 Personalized push method for vocal web page news
CN102495872A (en) * 2011-11-30 2012-06-13 中国科学技术大学 Method and device for conducting personalized news recommendation to mobile device users

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YINGLONG MA 等: "Using multi-categorization semantic analysis and personalization for semantic search", 《HTTP://ARXIV.ORG》 *
马应龙 等: "一种基于多分类语义分析和个性化的语义检索方法", 《东南大学学报(自然科学版)》 *

Cited By (57)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105247517A (en) * 2013-04-23 2016-01-13 谷歌公司 Ranking signals in mixed corpora environments
CN104516897A (en) * 2013-09-29 2015-04-15 国际商业机器公司 Method and device for sorting application objects
CN104516897B (en) * 2013-09-29 2018-03-02 国际商业机器公司 A kind of method and apparatus being ranked up for application
CN103593336A (en) * 2013-10-30 2014-02-19 中国运载火箭技术研究院 Knowledge pushing system and method based on semantic analysis
CN103593336B (en) * 2013-10-30 2017-05-10 中国运载火箭技术研究院 Knowledge pushing system and method based on semantic analysis
CN103646017B (en) * 2013-12-11 2017-01-04 南京大学 Acronym generating system for naming and working method thereof
CN103646017A (en) * 2013-12-11 2014-03-19 南京大学 Acronym generating system for naming and working method thereof
CN103744905B (en) * 2013-12-25 2018-03-30 新浪网技术(中国)有限公司 Method for judging rubbish mail and device
CN104050240A (en) * 2014-05-26 2014-09-17 北京奇虎科技有限公司 Method and device for determining categorical attribute of search query word
CN104008169B (en) * 2014-05-30 2017-02-22 中国测绘科学研究院 Semanteme based geographical label content safe checking method and device
CN104008169A (en) * 2014-05-30 2014-08-27 中国测绘科学研究院 Semanteme based geographical label content safe checking method and device
US10467342B2 (en) 2014-11-28 2019-11-05 Huawei Technologies Co., Ltd. Method and apparatus for determining semantic matching degree
US11138385B2 (en) 2014-11-28 2021-10-05 Huawei Technologies Co., Ltd. Method and apparatus for determining semantic matching degree
WO2016082406A1 (en) * 2014-11-28 2016-06-02 华为技术有限公司 Method and apparatus for determining semantic matching degree
CN104408036A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Correlated topic recognition method and device
CN106156071A (en) * 2015-03-31 2016-11-23 北京奇虎科技有限公司 Intranet Intranet searching method, device and server
CN105893397A (en) * 2015-06-30 2016-08-24 北京爱奇艺科技有限公司 Video recommendation method and apparatus
CN105893397B (en) * 2015-06-30 2019-03-15 北京爱奇艺科技有限公司 A kind of video recommendation method and device
CN106610972A (en) * 2015-10-21 2017-05-03 阿里巴巴集团控股有限公司 Query rewriting method and apparatus
CN106910497A (en) * 2015-12-22 2017-06-30 阿里巴巴集团控股有限公司 A kind of Chinese word pronunciation Forecasting Methodology and device
CN106951422A (en) * 2016-01-07 2017-07-14 腾讯科技(深圳)有限公司 The method and apparatus of webpage training, the method and apparatus of search intention identification
CN106980634A (en) * 2016-01-18 2017-07-25 维布络有限公司 System and method for classifying and solving software production accident list
CN107203526A (en) * 2016-03-16 2017-09-26 高德信息技术有限公司 A kind of query string semantic requirement analysis method and device
CN105786782B (en) * 2016-03-25 2018-10-19 北京搜狗信息服务有限公司 A kind of training method and device of term vector
CN105786782A (en) * 2016-03-25 2016-07-20 北京搜狗科技发展有限公司 Word vector training method and device
CN106095983B (en) * 2016-06-20 2019-11-26 北京百度网讯科技有限公司 A kind of similarity based on personalized deep neural network determines method and device
CN106095983A (en) * 2016-06-20 2016-11-09 北京百度网讯科技有限公司 A kind of similarity based on personalized deep neural network determines method and device
CN106528595A (en) * 2016-09-23 2017-03-22 中国农业科学院农业信息研究所 Website homepage content based field information collection and association method
CN106528595B (en) * 2016-09-23 2019-08-06 中国农业科学院农业信息研究所 Realm information based on website homepage content is collected and correlating method
CN106844343A (en) * 2017-01-20 2017-06-13 上海傲硕信息科技有限公司 Instruction results screening plant
CN106844343B (en) * 2017-01-20 2019-11-19 上海傲硕信息科技有限公司 Instruction results screening plant
WO2018157625A1 (en) * 2017-02-28 2018-09-07 华为技术有限公司 Reinforcement learning-based method for learning to rank and server
US11500954B2 (en) 2017-02-28 2022-11-15 Huawei Technologies Co., Ltd. Learning-to-rank method based on reinforcement learning and server
CN108460067B (en) * 2017-10-30 2022-08-16 上海赛图计算机科技股份有限公司 Tile index structure based on data, index construction method and data retrieval method
CN108460067A (en) * 2017-10-30 2018-08-28 上海赛图计算机科技股份有限公司 Tile index structure, index structuring method and data retrieval method based on data
CN108111478A (en) * 2017-11-07 2018-06-01 中国互联网络信息中心 A kind of phishing recognition methods and device based on semantic understanding
CN110019888A (en) * 2017-12-01 2019-07-16 北京搜狗科技发展有限公司 A kind of searching method and device
CN108182229B (en) * 2017-12-27 2022-10-28 上海科大讯飞信息科技有限公司 Information interaction method and device
CN108182229A (en) * 2017-12-27 2018-06-19 上海科大讯飞信息科技有限公司 Information interacting method and device
CN110895467A (en) * 2018-09-13 2020-03-20 深圳市蓝灯鱼智能科技有限公司 Method and device for updating search model, storage medium and electronic device
CN109189910A (en) * 2018-09-18 2019-01-11 哈尔滨工程大学 A kind of label auto recommending method towards mobile application problem report
CN109189910B (en) * 2018-09-18 2019-09-10 哈尔滨工程大学 A kind of label auto recommending method towards mobile application problem report
CN109376288B (en) * 2018-09-28 2021-04-23 邦道科技有限公司 Cloud computing platform for realizing semantic search and balancing method thereof
CN109376288A (en) * 2018-09-28 2019-02-22 北京北斗方圆电子科技有限公司 A kind of cloud computing platform and its equalization methods for realizing semantic search
CN110765368A (en) * 2018-12-29 2020-02-07 北京嘀嘀无限科技发展有限公司 Artificial intelligence system and method for semantic retrieval
CN110765368B (en) * 2018-12-29 2020-10-27 滴图(北京)科技有限公司 Artificial intelligence system and method for semantic retrieval
CN110287288A (en) * 2019-06-18 2019-09-27 北京百度网讯科技有限公司 Recommend the method and apparatus of document
CN110287288B (en) * 2019-06-18 2022-02-18 北京百度网讯科技有限公司 Method and device for recommending documents
CN111753527A (en) * 2020-06-29 2020-10-09 平安科技(深圳)有限公司 Data analysis method and device based on natural language processing and computer equipment
CN113269477A (en) * 2021-07-14 2021-08-17 北京邮电大学 Scientific research project query scoring model training method, query method and device
WO2023040808A1 (en) * 2021-09-18 2023-03-23 华为技术有限公司 Webpage retrieval method and related device
CN114021019A (en) * 2021-11-10 2022-02-08 中国人民大学 Retrieval method integrating personalized search and search result diversification
CN114021019B (en) * 2021-11-10 2024-03-29 中国人民大学 Retrieval method integrating personalized search and diversification of search results
CN114969310A (en) * 2022-06-07 2022-08-30 南京云问网络技术有限公司 Multi-dimensional data-oriented sectional type retrieval and sorting system design method
CN114969310B (en) * 2022-06-07 2024-04-05 南京云问网络技术有限公司 Multi-dimensional data-oriented sectional search ordering system design method
CN115168577A (en) * 2022-06-30 2022-10-11 北京百度网讯科技有限公司 Model updating method and device, electronic equipment and storage medium
CN115168577B (en) * 2022-06-30 2023-03-21 北京百度网讯科技有限公司 Model updating method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN103020164B (en) 2015-06-10

Similar Documents

Publication Publication Date Title
CN103020164B (en) Semantic search method based on multi-semantic analysis and personalized sequencing
US9652537B2 (en) Identifying terms associated with queries
CN102902806B (en) A kind of method and system utilizing search engine to carry out query expansion
CN101520785B (en) Information retrieval method and system therefor
JP5632124B2 (en) Rating method, search result sorting method, rating system, and search result sorting system
US8538989B1 (en) Assigning weights to parts of a document
CN106339502A (en) Modeling recommendation method based on user behavior data fragmentation cluster
CN103455487B (en) The extracting method and device of a kind of search term
US20060004753A1 (en) System and method for document analysis, processing and information extraction
US20060155751A1 (en) System and method for document analysis, processing and information extraction
CN101364239A (en) Method for auto constructing classified catalogue and relevant system
CN102831199A (en) Method and device for establishing interest model
CN104484431A (en) Multi-source individualized news webpage recommending method based on field body
CN101283353A (en) Systems for and methods of finding relevant documents by analyzing tags
CN1996316A (en) Search engine searching method based on web page correlation
CN109918563A (en) A method of the book recommendation based on public data
US20070192313A1 (en) Data search method with statistical analysis performed on user provided ratings of the initial search results
Nawazish et al. Integrating “Random Forest” with Indexing and Query Processing for Personalized Search
CN106886577A (en) A kind of various dimensions web page browsing behavior evaluation method
Rajkumar et al. Users’ click and bookmark based personalization using modified agglomerative clustering for web search engine
CN1996280A (en) Method for co-building search engine
Hao et al. An Algorithm for Generating a Recommended Rule Set Based on Learner's Browse Interest
CN104050203A (en) Method for acquiring personalized characteristics of webpages and users
KR101448134B1 (en) an blog prestige ranking method based on weighted indexing of terms
Fathy et al. A Personalized Approach for Re-ranking Search Results Using User Preferences.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150610

Termination date: 20151126

CF01 Termination of patent right due to non-payment of annual fee