CN104484380A

CN104484380A - Personalized search method and personalized search device

Info

Publication number: CN104484380A
Application number: CN201410750996.8A
Authority: CN
Inventors: 张军; 吴先超
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2014-12-09
Filing date: 2014-12-09
Publication date: 2015-04-01

Abstract

An embodiment of the invention discloses a personalized search method and a personalized search device. The personalized search method includes acquiring a plurality of search results corresponding to query sentences inputted in search engines by users; acquiring personalized vocabulary sets of user types of the users; respectively computing similarity degrees between personalized vocabularies in the personalized vocabulary sets and the various search results; sorting the multiple search results according to computation results and showing the multiple sorted search results. According to the technical scheme, the personalized search method and the personalized search device in the embodiment of the invention have the advantage that the reasonable personalized search results can be provided for the users in a simple, high-efficiency and high-safety mode.

Description

Individuation search method and device

Technical field

The embodiment of the present invention relates to field of computer technology, particularly relates to a kind of individuation search method and device.

Background technology

At present, search engine, after the inquiry sentence getting user's input, can carry out search according to this inquiry sentence and obtain some Search Results, show user after then sorting to these Search Results based on basic relevance algorithms in resources bank.But when different users inputs identical inquiry sentence time, probably along with the difference of user, the same meaning representated by inquiry sentence is also different.

For example, when the fan of heroic alliance game and the fan of a smart mobile phone input " S4 " this query word time, the very possible result that the former wants is " heroic alliance LOL_2014 professional tournament _ S4 racing season whole world finals ", and the result that the latter wants is " the smart mobile phone Galaxy S4 of Samsung "; " operation " this query word is under the linguistic context of different users, and the meaning may be students' work, also may be engineer operation; " model " this word, user be algorithm engineering teacher linguistic context and be a toy fan linguistic context under the meaning be also completely different.

Therefore, how personalized Search Results is provided to become a very important problem according to different users.The method of this problem of solution of the main flow of current existence has the following two kinds:

(1) at the inquiry sentence receiving user's input, and carry out after search obtains some Search Results in resources bank according to this inquiry sentence, based on the similarity between the inquiry custom in the middle of the historical search of this user, click custom etc., these Search Results are sorted, then by the Search Results after sequence, the personalized search results as this user represents;

(2) behavioural information (article that such as user issues on social network sites of the other types of user is collected, user is on the client by history text etc. that input method inputs), thus can have on the basis of more understandings to user, return more personalized Search Results with the algorithm of Computer Design to user.

But there is following defect in above-mentioned two kinds of methods: first method needs the historical search record accumulating abundant sole user; The complexity of second method is higher, and is unfavorable for the protection of privacy of user, and security is not high.

Summary of the invention

The embodiment of the present invention provides a kind of individuation search method and device, in a kind of simple, the mode that efficient, degree of safety is high, for user provides more rational personalized search results.

First aspect, embodiments provides individuation search method, and the method comprises:

Obtain the multiple Search Results corresponding with the inquiry sentence that user inputs in a search engine;

Obtain the personalized word finder under described user's owning user type;

By the personalized vocabulary in described personalized word finder, carry out Similarity Measure with each Search Results respectively;

According to result of calculation, described multiple Search Results is sorted, and the described multiple Search Results after sequence is represented.

Second aspect, the embodiment of the present invention additionally provides personalized search device, and this device comprises:

Search Results acquiring unit, for obtaining multiple Search Results corresponding to the inquiry sentence that inputs in a search engine with user;

Personalized bilingual lexicon acquisition unit, for obtaining the personalized word finder under described user's owning user type;

Similarity calculated, for by the personalized vocabulary in described personalized word finder, carries out Similarity Measure with each Search Results respectively;

Sequence represents unit, for sorting to described multiple Search Results according to result of calculation, and is represented by the described multiple Search Results after sequence.

The technical scheme that the embodiment of the present invention provides, by utilizing the similarity between the personalized vocabulary under user's owning user type, the multiple Search Results obtained are sorted, thus the personalized search that can realize user, the program is without the need to accumulating the historical search record of abundant sole user, also without the need to collecting user's other behavioural informations except search behavior, realization is simple, efficient and degree of safety is high.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of a kind of individuation search method that the embodiment of the present invention one provides;

Fig. 2 is the schematic flow sheet of a kind of individuation search method that the embodiment of the present invention two provides;

Fig. 3 is the schematic flow sheet of a kind of individuation search method that the embodiment of the present invention three provides;

Fig. 4 is the schematic flow sheet of a kind of individuation search method that the embodiment of the present invention four provides;

Fig. 5 is the schematic flow sheet of a kind of individuation search method that the embodiment of the present invention five provides;

Fig. 6 is the structural representation of a kind of personalized search device that the embodiment of the present invention six provides.

Embodiment

Below in conjunction with drawings and Examples, the present invention is described in further detail.Be understandable that, specific embodiment described herein is only for explaining the present invention, but not limitation of the invention.It also should be noted that, for convenience of description, illustrate only part related to the present invention in accompanying drawing but not entire infrastructure.

Embodiment one

Fig. 1 is the schematic flow sheet of a kind of individuation search method that the embodiment of the present invention one provides.The method can be performed by personalized search device, and described device can be realized by software, can be used as the part realizing search engine and is built in and has on the terminal device of function of search.Wherein, terminal device can be smart mobile phone, panel computer, notebook computer, desktop computer, personal digital assistant etc.See Fig. 1, the individuation search method that the present embodiment provides specifically comprises following operation:

Operate 110, obtain the multiple Search Results corresponding with the inquiry sentence that user inputs in a search engine.

In the present embodiment, any searching algorithm for obtaining Search Results can be adopted, multiple Search Results that the inquiry sentence obtaining inputting in a search engine with user is corresponding.Such as, can specifically to the acquisition process of multiple Search Results: receive the inquiry sentence that user inputs in a search engine; Error correction is carried out to this inquiry sentence, cuts word, proper name identification, sentence structure analysis, the process such as synonym expansion; Then in resources bank, carry out search according to result and obtain multiple Search Results.Wherein, Search Results can be web page title, also can be that other are for describing the text message (such as web-page summarization) of webpage main contents.Certainly, Search Results also can comprise web page title and web-page summarization simultaneously.

Personalized word finder under operation 120, acquisition user owning user type.

Due to search engine faced by search subscriber group comprise and engage in the user that all trades and professions have different role, therefore can carry out a division of user type to these users.Wherein, user type can be basket ball fan, game enthusiasts, slip-stick artist etc.Consider that the probability that there is identical search need between the user under same user type is larger, and the search need between user under different user types differs greatly, the present embodiment can utilize the historical search behavioral data of the multiple users under same user type, realizes the personalized search for sole user under this user type.Compare and separately isolatedly utilize the sparse historical search behavioral data of unique user to realize personalized search, the present embodiment can play certain cooperative effect, therefore also can provide the personalized search scheme of improvement for user.

For this reason, in the present embodiment, can obtain in advance or determine the user type belonging to user in real time, and the personalized word finder under this user type.Wherein, described personalized word finder is made up of multiple personalized vocabulary.Personalized vocabulary under arbitrary user type is the participle that can characterize this user type, can be to extract to obtain from the historical search behavioral data of the multiple users this user type, may also be artificial and to preset.Such as, the personalized word finder under user type is game enthusiasts can comprise: " game ", " hand trip ", " bird of indignation ", " fighting landlord ", " plant Great War corpse ", " extremely running " etc. everyday.

Concrete, can according to the historical search behavioral data of user, automatically identify which kind of user type in multiple user types that user specifically belongs to default in the mode of intelligence.Certainly, also can require that each user is to set the user type of its correspondence, then sets up the mapping relations table between multiple user and multiple user type accordingly, and then determines the user type belonging to active user according to this mapping relations table in advance.

Operate 130, by the personalized vocabulary in personalized word finder, carry out Similarity Measure with each Search Results respectively.That is, for each Search Results, the similarity of current search result and personalized word finder is calculated respectively; A Similarity Measure result is obtained respectively for each Search Results.

After getting the personalized word finder under user's owning user type, personalized word finder can be utilized to give a mark to each Search Results.Concrete, by by the personalized vocabulary in personalized word finder, carry out Similarity Measure to realize respectively with each Search Results, a result of calculation is a marking value.

Exemplary, can according to the similarity GroupPersonalScore (Ti) between the personalized vocabulary in the personalized word finder of following formulae discovery and i-th Search Results in multiple Search Results:

GroupPersonalScore (Ti) = α \times \frac{ΔN (Ti)}{N (Ti)}

Wherein, α is preset parameter value, and N (Ti) is the number of participle in i-th Search Results Ti, and Δ N (Ti) is the number of the participle belonging to personalized vocabulary in described personalized word finder in i-th Search Results Ti.

In embodiments of the present invention, α can be 1, also can be the degree of confidence of user under owning user type.Wherein, degree of confidence is the probability belonging to described user type for weighing user is how many factors.

Operate 140, according to result of calculation, multiple Search Results sorted, and the multiple Search Results after sequence are represented.

After obtaining the Similarity Measure result (also i.e. marking value) corresponding to each Search Results, according to this Similarity Measure result order from high to low, multiple Search Results can be sorted, then represents the multiple Search Results after sequence.Certainly, the marking value to multiple Search Results that also can obtain in combination with other algorithms, sorts to multiple Search Results.

Exemplary, according to Similarity Measure result, multiple Search Results is sorted, and the multiple Search Results after sequence is represented, comprising:

Respectively for each Search Results in multiple Search Results, perform following operation: using Similarity Measure result corresponding for Search Results as the 3rd candidate's marking value, with first candidate's marking value of the Search Results that algorithm determines of giving a mark based on set first, and be weighted fusion based on second candidate's marking value of Search Results that the second marking algorithm of setting is determined, obtain the final marking value of Search Results;

Each Search Results in multiple Search Results is sorted according to final marking value order from high to low, and the multiple Search Results after sequence are represented.

The technical scheme that the present embodiment provides, by utilizing the similarity between the personalized vocabulary under user's owning user type, the multiple Search Results obtained are sorted, thus the personalized search that can realize user, the program is without the need to accumulating the historical search record of abundant sole user, also without the need to collecting user's other behavioural informations except search behavior, realization is simple, efficient and degree of safety is high.

Embodiment two

Fig. 2 is the schematic flow sheet of a kind of individuation search method that the embodiment of the present invention two provides.The present embodiment, on the basis of above-described embodiment one, further increased the operation identifying the operation of user's owning user type and the personalized word finder under determining user type before the operation obtaining the personalized word finder under user's owning user type.See Fig. 2, the individuation search method that the present embodiment provides specifically comprises following operation:

Operation 210, historical search behavioral data according to user, identify the user type belonging to user.

Operation 220, historical search behavioral data according to the particular group under described user type, determine the personalized word finder under described user type.

Wherein, the historical search behavioral data of user comprises historical query word data and history click data.The query word that before historical query word data can be the user of search engine statistics, (or recently in setting-up time section, such as nearest 6 months) inputted; Before history click data can be the user of search engine statistics (or recently in setting-up time section, such as nearest 6 months) the Search Results clicked.

In the present embodiment, can according to the historical search behavioral data of user, the polytypic Machine learning classifiers of training out based on the document sets marked in advance with, carries out the identification of user type to user.Concrete, a large amount of document samples and the user type corresponding with each document sample can be obtained in advance, then feature extraction is carried out to each document sample, and then according to feature extraction result and the user type corresponding with each document sample, algorithm based on machine learning is trained, and generates one for identifying the sorter of user type.

When carrying out the identification of user type to active user, first can carry out merging to the historical search behavioral data of active user and obtain a document, then feature extraction is carried out to the document, and using the input of feature extraction result as sorter, to adopt this sorter identification user type.

Certainly, also the user type belonging to user is identified by other modes.Such as, the historical search behavioral data of user is combined into a document, and word process is cut to the document, obtain each participle; Based on each participle obtained, the personalized word finder under each user type that traversal is preset, identifies in personalized word finder which participle included in document; If in the personalized word finder under certain user type, the participle in the document comprised is maximum, then judge user type belonging to user just user type for this reason.

After identifying the user type belonging to user, can according to the historical search behavioral data of particular group under this user type, realize the personalized search for sole user under this user type, with the problem that the historical search behavioral data solving sole user is comparatively sparse.Wherein, particular group is can multiple users of representative described user type in all users under described user type.This particular group can manually preset, and may also be to obtain in the mode of intelligence.For example, under the application scenarios that can obtain the degree of confidence of each user under owning user type, particular group can be the user that degree of confidence is greater than setting threshold value.

Exemplary, according to the historical search behavioral data of the particular group under described user type, determine the personalized word finder under described user type, comprising:

By in the historical search behavioral data of the particular group under described user type, the frequency of occurrences exceedes the frequency threshold of setting, and is the participle of non-stop words, is added in personalized word finder as the personalized vocabulary under user type.

In the examples described above not using similar " ", " " and punctuation mark etc. the interior frequency of occurrences may be very high stop words as personalized vocabulary, can ensure that the personalized vocabulary under counted different user types has very strong discrimination like this.

Operate 230, obtain the multiple Search Results corresponding with the inquiry sentence that user inputs in a search engine.

Operate 240, obtain determined personalized word finder.

Operate 250, by the personalized vocabulary in personalized word finder, carry out Similarity Measure with each Search Results respectively.

Operate 260, according to result of calculation, described multiple Search Results sorted, and the multiple Search Results after sequence are represented.

It should be noted that, operation 210-in the present embodiment operates 220, and priority execution sequence between operation 230 is interchangeable.Also namely, first executable operations 230, then executable operations 210, redo 220.

The technical scheme that the present embodiment provides, in a kind of mode of intelligence, automatically according to the historical search behavioral data of user, the user type belonging to user can be identified, and automatically according to the historical search behavioral data of the particular group under user type, determine the personalized word finder under user type.And by the mapping relations table between multiple user of setting up and multiple user type, determine that the scheme of the user type belonging to active user is compared, the present embodiment without the need to require user manually dare to, preset the user type of its correspondence, thus make personalized search more intelligent.

Embodiment three

Fig. 3 is the schematic flow sheet of a kind of individuation search method that the embodiment of the present invention three provides.The present embodiment on the basis of above-described embodiment two, the operation of further Statistical error user owning user type.See Fig. 3, the individuation search method that the present embodiment provides specifically comprises following operation:

Operate 310, feature extraction is carried out to the historical search behavioral data of user; And using feature extraction result as the input being used for the sorter carrying out user type identification, obtain the degree of confidence of user respectively under the various user types preset based on sorter.

Wherein, historical search behavioral data comprises historical query word data and history click data.

Exemplary, feature extraction is carried out to the historical search behavioral data of user; And using feature extraction result as the input being used for the sorter carrying out user type identification, obtain the degree of confidence of user respectively under the various user types preset based on sorter, comprising:

Feature extraction is carried out to the historical query word data of user, and using feature extraction result as the input being used for the sorter carrying out user type identification, obtains first degree of confidence of user respectively under the various user types preset based on sorter;

Feature extraction is carried out to the history click data of user, and using feature extraction result as the input being used for the sorter carrying out user type identification, obtains second degree of confidence of user respectively under the various user types preset based on sorter.

Certainly, those of ordinary skill in the art should understand, also can by historical query word data and history click data integrally, feature extraction is carried out to it, and using feature extraction result as the input being used for the sorter carrying out user type identification, obtain the degree of confidence of user respectively under the various user types preset based on sorter.

The degree of confidence that operation 320, basis obtain, determines the user type belonging to user.

For the user type belonging to user can be predicted more accurately, in a kind of embodiment of the present embodiment, according to the degree of confidence obtained, determine the user type belonging to user, can be specially:

First degree of confidence and second degree of confidence of user respectively under the various user types preset are weighted, to obtain the new degree of confidence of user respectively under default various user types;

Choose new degree of confidence the highest time corresponding user type, as the user type belonging to user.

Such as, be preset with 3 kinds of user types, first degree of confidence of user under the first user type A is a1, and the second degree of confidence is a2; First degree of confidence of user under the second user type B is b1, and the second degree of confidence is b2; First degree of confidence of user under the third user type C is c1, and the second degree of confidence is c2.Then: the new degree of confidence of user under the first user type A is β × a1+ (1-β) × a2; The new degree of confidence of user under the second user type B is β × b1+ (1-β) × b2; The new degree of confidence of user under the third user type C is β × c1+ (1-β) × c2.Wherein, β is default weighting coefficient.

In the another kind of embodiment of the present embodiment, first according to first degree of confidence of user respectively under the various user types preset, can determine which kind of user type user specifically belongs to; Then, then according to user respectively preset various user types under the second degree of confidence, determine which kind of user type user specifically belongs to; When determined two kinds of user types are inconsistent, choose degree of confidence the highest time corresponding user type.Therefore, according to the degree of confidence obtained, determine the user type belonging to user, can be specially:

Search the first degree of confidence that the first degree of confidence intermediate value of user respectively under the various user types preset is maximum, as the first candidate value;

Search the second degree of confidence that the second degree of confidence intermediate value of user respectively under the various user types preset is maximum, as the second candidate value;

Choose the user type corresponding to candidate value larger in the first candidate value and the second candidate value, as the user type belonging to user.

Operation 330, historical search behavioral data according to the particular group under user type, determine the personalized word finder under user type.

Operate 340, obtain the multiple Search Results corresponding with the inquiry sentence that user inputs in a search engine.

Operate 350, obtain determined personalized word finder.

Operate 360, by the personalized vocabulary in personalized word finder, carry out Similarity Measure with each Search Results respectively.

Operate 370, according to result of calculation, multiple Search Results sorted, and the multiple Search Results after sequence are represented.

The present embodiment can according to the historical search behavioral data of user, adopt the method for machine learning to identify user type belonging to user, its accuracy identified will far above by mating between historical search behavioral data with the personalized word finder under each default user type the scheme identifying user type.

Embodiment four

Fig. 4 is the schematic flow sheet of a kind of individuation search method that the embodiment of the present invention four provides.The present embodiment, on the basis of above-described embodiment three, further increases the operation of the sorter for carrying out user type identification.See Fig. 4, the individuation search method that the present embodiment provides specifically comprises following operation:

Operation 400, acquisition Training document collection, wherein Training document is concentrated and is comprised: multiple Training document, and the user type of each Training document.

In the present embodiment, in order to according to the historical search behavioral data of user, the user type belonging to user can be identified, can adopt the algorithm of machine learning, generate one for identifying the sorter of user type.

For this reason, Training document collection need first be obtained.Concrete, assuming that be preset with N (N be greater than 1 natural number) plant user type: C1, C2 ... CN.For often kind of user type, document conventional under can preparing out one group of this kind of user type is as Training document.Exemplary, all Training document under i-th kind of user type Ci comprise:

{D_{i}^{1}, D_{i}^{2}, . . ., D_{i}^{m_{i}}}

Wherein, represent first section of Training document under i-th kind of user type Ci, represent second section of Training document under i-th kind of user type Ci ..., represent the m under i-th kind of user type Ci _isection Training document.

For any section Training document by segmentation sequence w1, w2 ... wh formed.That is, one section of Training document is made up of the participle of a sequence.

In the present embodiment, the title of user type and user type, can by manually setting.Training document under often kind of user type, can be one group of document by hand picking, also can be collected by the mode of heuristic rule.Such as, all documents under the football column of algorithm collection news website are crawled according to the webpage of setting, as all Training document under this user type of football fan.

Operation 410, the eigenwert of each participle respectively in each Training document determined in multiple participles of presetting.

In the present embodiment, the vocabulary that is similar to dictionary can be generated in advance.Be made up of the multiple participles preset in this vocabulary.Which kind of user type these participles specifically do not correspond to, and have ubiquity.Exemplary, for each Training document, the eigenwert of each participle in Training document in default multiple participles, Ke Yiwei: the TFIDF of each participle in Training document in default multiple participles.Concrete, the TFIDF of arbitrary participle in Training document is defined as follows:

TF (term frequency, word frequency) is: the number of times that this participle occurs in Training document is divided by the number of the total participle in this section of Training document.

IDF (inverse document frequency, inverse document frequency) is: the number of all Training document that Training document is concentrated, divided by the number of Training document occurring this participle, then gets log value.

TFIDF is the product of TF and IDF.

To sum up, if total M participle in supposition vocabulary (w1, w2 ... wM), then the eigenwert of these participles respectively in each Training document and the user type belonging to each Training document as follows:

Training document 1:w1:tfidf11, w2:tfidf12 ... wM:tfidf1M; Owning user type is Ci;

Training document 2:w1:tfidf21, w2:tfidf22 ... wM:tfidf2M; Owning user type is Ci;

……

Operation 420, based on maximum entropy model, and determined eigenwert, trains the sorter obtained for carrying out user type identification.

After obtaining above-mentioned determined eigenwert, based on maximum entropy model, namely based on the training method used during training maximum entropy model, the sorter obtained for carrying out user type identification can be trained according to these eigenwerts.

In the present embodiment, assuming that always have N kind user type, the participle number preset in vocabulary is M.Then can build following equation: W × X+B=Y.Wherein:

W is that line number is N and columns is the parameter matrix of M, the different weighting coefficient of the participle under user type not of the same race (the i-th row j column element in such as W represents: the weighting coefficient of a jth participle under i-th kind of user type) of the different element representations in described parameter matrix;

X is the input column vector of M dimension;

B is the offset rows vector of N dimension, and the different element representations of described offset rows vector belong to the biased coefficient under user type not of the same race;

Y is the output column vector of N dimension, and the different element representations in described output column vector belong to the probability (the jth element representation in such as Y belongs to the probability of jth kind user type) of user type not of the same race.

And then, utilize determined eigenwert, can W × X be obtained _l+ B=Y _l; L gets 1 to L (L concentrate the number of comprised Training document for Training document); Wherein:

X _lfor the proper vector that the M be made up of the eigenwert of each participle in l Training document in M the participle preset ties up, the jth eigenwert of participle in l Training document in M the participle that the jth element representation in this vector is preset;

Y _lbe the output column vector of the N dimension that l Training document is corresponding, at Y _lin, can be the element assignment corresponding with the user type of l Training document is 1, and the equal assignment of other elements is 0.

For this reason, can based on maximum entropy model, and X _land Y _l, training obtains W and B, and then generates the sorter for identifying user type.The input column vector X that this sorter can be tieed up at the M receiving the eigenwert composition of each participle in a default M participle respectively in arbitrary document _mafterwards, W × X is calculated _m+ B, to obtain and to input X _mthe output column vector Y that the corresponding N be made up of the probability belonging to user type not of the same race ties up _m.

Exemplary, for training obtains W and B, following cost function J (W, B) first can be built:

J (W, B) = - Σ_{i = 1}^{L} \log \frac{\exp (W_{C_{real, i}} \times X_{i} + B_{C_{real, i}})}{Σ_{k = 1}^{N} \exp (W_{k} \times X_{i} + B_{k})}

In this cost function, W and B is the parameter that need be obtained by training, wherein:

L concentrates the number of comprised Training document for Training document;

N is user type kind number;

X _irepresent each participle in M the participle preset, the input column vector that the M concentrating the eigenwert in i-th Training document to form at Training document respectively ties up;

C _{real, i}represent the user type belonging to i-th Training document of Training document centralized recording;

represent that line number is N and columns is correspond to C in the parameter matrix W of M _{real, i}row vector;

represent in the offset rows vector B of N dimension and correspond to C _{real, i}element;

W _krepresent that line number is N and columns is the kth row vector in the parameter matrix W of M;

B _krepresent the kth element in the offset rows vector B of N dimension.

Like this, by a SGD (Stochastic Gradient Descent, stochastic gradient descent method) algorithm, optimizing above-mentioned cost function, obtaining model parameter W and the B of next step the sorter for identifying user type.

The thought of SGD (Stochastic Gradient Descent, stochastic gradient descent method) algorithm is:

By calculating the gradient (cost function is to the partial derivative of parameter W, B) of a certain group of (being called mini-batch size) training sample, carrying out iteration and upgrading the parameter W that crosses of random initializtion and B; At every turn the method upgraded allows W, B deduct a set learning rate (learning rate) be multiplied by the gradient calculated, until meet iteration stopping condition.Sorter for carrying out user type identification according to the parameter W obtained during iteration stopping and B, can identify the user type belonging to any user.

Operation 430, the eigenwert calculated in document that each participle in multiple participles of presetting forms at historical search behavioral data, as the feature extraction result of the historical search behavioral data to user; And using feature extraction result as the input being used for the sorter carrying out user type identification, obtain the degree of confidence of user respectively under the various user types preset based on sorter.

One is had to the user of historical search behavioral data, the sorter obtained in last action can be utilized to determine which kind of user type this user specifically belongs to, and the degree of confidence of user respectively under the various user types preset.

Exemplary, all historical search behavioral datas simply can be combined, regard a document Q as.And adopt the method identical with calculating eigenwert in previous step, obtain the TFIDF value of each participle in the document Q in M the participle preset.Wherein: the TF value in the document Q can directly obtain; For the IDF value in the document Q, can calculate in conjunction with the multiple documents preset).In the present embodiment, for identifying the sorter of user type except directly by W _c× X _q+ B _cthis probable value of result of calculation, outside the degree of confidence as any number of user type C of document Q under the multiple user type preset, also obtain the degree of confidence of any number of user type C of document Q under the multiple user type preset by other modes.

For example, according to following formula, the final degree of confidence P (C|Q) calculating any number of user type C of document Q under the N kind user type preset is:

P (C | Q) = \frac{\exp (W_{C} \times X_{Q} + B_{C})}{Σ_{k = 1}^{N} \exp (W_{k} \times X_{Q} + B_{k})}

In the present embodiment:

X _qrepresent the input column vector of the M dimension of the eigenwert composition of each participle respectively in document Q in M the participle preset;

W _crepresent the row vector corresponding to user type C in the parameter matrix W training and obtain;

B _crepresent the element corresponding to user type C in the offset rows vector B training and obtain;

W _krepresent the kth row vector in the parameter matrix W training and obtain;

B _krepresent the kth element in the offset rows vector B training and obtain.

The degree of confidence that operation 440, basis obtain, determines the user type belonging to user.

Operation 450, historical search behavioral data according to the particular group under user type, determine the personalized word finder under user type.

Operate 460, obtain the multiple Search Results corresponding with the inquiry sentence that user inputs in a search engine.

Operate 470, obtain determined personalized word finder.

Operate 480, by the personalized vocabulary in personalized word finder, carry out Similarity Measure with each Search Results respectively.

Operate 490, according to result of calculation, described multiple Search Results sorted, and the described multiple Search Results after sequence is represented.

Embodiment five

Fig. 5 is the schematic flow sheet of a kind of individuation search method that the embodiment of the present invention five provides.The present embodiment, based on above-mentioned all embodiments, provides a kind of preferred embodiment.See Fig. 5, the individuation search method that the present embodiment five provides specifically comprises following operation:

Operation 510, training one are for identifying the sorter of user type.

Operate 520, use this sorter, identify the user type belonging to user, and obtain the degree of confidence of user under owning user type.

Use this sorter, identify the user type belonging to user, and obtain the degree of confidence of user under owning user type, specifically comprise:

Feature extraction is carried out to the history click data of user, and using feature extraction result as the input being used for the sorter carrying out user type identification, obtains second degree of confidence of user respectively under the various user types preset based on sorter;

In operation 530, all users under user's owning user type, choose multiple users that degree of confidence is greater than setting threshold value, as the particular group under this user type.

Operation 540, historical search behavioral data according to described particular group, determine personalized word finder.

Concrete, by the historical search behavioral data of the particular group under user's owning user type, the frequency of occurrences exceedes the frequency threshold of setting, and is the participle of non-stop words, is added in personalized word finder as the personalized vocabulary under this user type.

Operate 550, obtain the multiple Search Results corresponding with the inquiry sentence that user inputs in a search engine.

Operate 560, according to the similarity GroupPersonalScore (Ti) between the personalized vocabulary in the personalized word finder of following formulae discovery and i-th Search Results in multiple Search Results:

GroupPersonalScore (Ti) = α \times \frac{ΔN (Ti)}{N (Ti)}

Wherein, α is the degree of confidence of user under owning user type, N (Ti) is the number of participle in i-th Search Results Ti, and Δ N (Ti) is the number of the participle belonging to personalized vocabulary in described personalized word finder in i-th Search Results Ti.

Operate 570, respectively for each Search Results in multiple Search Results, perform following operation: using Similarity Measure result corresponding for Search Results as the 3rd candidate's marking value, with first candidate's marking value of the Search Results that algorithm determines of giving a mark based on set first, and be weighted fusion based on second candidate's marking value of Search Results that the second marking algorithm of setting is determined, obtain the final marking value of Search Results.

Wherein, the first marking algorithm of setting is the similarity between historical search behavioral data by calculating each Search Results and this user, determines the algorithm of the marking value of each bar Search Results; Second marking algorithm is except first gives a mark any other except algorithm algorithm for giving a mark to multiple Search Results.

Operate 580, each Search Results in multiple Search Results is sorted according to final marking value order from high to low, and the multiple Search Results after sequence are represented.

The technical scheme that the present embodiment provides, compare and separately isolatedly utilize the sparse historical search behavior of unique user to realize the scheme of personalized search, the historical search behavior of other users under same user type can be fully utilized, certain cooperative effect can be played; Further, the multiple algorithm for giving a mark to Search Results can be merged by the present embodiment, sorts to Search Results according to this fusion results, can make ranking results more close to the search need of user.

Embodiment six

Fig. 6 is the structural representation of a kind of personalized search device that the embodiment of the present invention six provides.See Fig. 6, the concrete structure of this device is as follows:

Search Results acquiring unit 610, for obtaining multiple Search Results corresponding to the inquiry sentence that inputs in a search engine with user;

Personalized bilingual lexicon acquisition unit 620, for obtaining the personalized word finder under described user's owning user type;

Similarity calculated 630, for by the personalized vocabulary in described personalized word finder, carries out Similarity Measure with each Search Results respectively;

Sequence represents unit 640, for sorting to described multiple Search Results according to result of calculation, and is represented by the described multiple Search Results after sequence.

Exemplary, this device also comprises:

User type recognition unit 600, before obtaining the personalized word finder under described user's owning user type at described personalized bilingual lexicon acquisition unit 620, according to the historical search behavioral data of described user, identifies the user type belonging to described user;

Personalized vocabulary determining unit 605, for the historical search behavioral data according to the particular group under described user type, determines described personalized word finder.

Exemplary, described personalized vocabulary determining unit 605, specifically for:

By in the historical search behavioral data of the particular group under described user type, the frequency of occurrences exceedes the frequency threshold of setting, and is the participle of non-stop words, is added in described personalized word finder as personalized vocabulary.

Exemplary, described user type recognition unit 600, comprising:

Degree of confidence generates subelement 6002, for carrying out feature extraction to the historical search behavioral data of described user; And using described feature extraction result as the input being used for the sorter carrying out user type identification, obtain the degree of confidence of described user respectively under the various user types preset based on described sorter;

User type determination subelement 6004, for according to the degree of confidence obtained, determines the user type belonging to described user.

Exemplary, described historical search behavioral data comprises historical query word data and history click data;

Described degree of confidence generates subelement 6002, specifically for:

Feature extraction is carried out to the historical query word data of described user, and using described feature extraction result as the input being used for the sorter carrying out user type identification, obtain first degree of confidence of described user respectively under the various user types preset based on described sorter;

Feature extraction is carried out to the history click data of described user, and using described feature extraction result as the input being used for the sorter carrying out user type identification, obtain second degree of confidence of described user respectively under the various user types preset based on described sorter.

Exemplary, described user type determination subelement 6004, specifically for:

First degree of confidence and second degree of confidence of described user respectively under the various user types preset are weighted, to obtain the new degree of confidence of described user respectively under default various user types;

Choose new degree of confidence the highest time corresponding user type, as the user type belonging to described user.

Exemplary, described user type recognition unit 600, also comprises sorter and generates subelement 6001, for:

Before the historical search behavioral data of described degree of confidence generation subelement 6002 to described user carries out feature extraction, obtain Training document collection, wherein said Training document is concentrated and is comprised: multiple Training document, and the user type of each Training document;

Determine the eigenwert of each participle respectively in each Training document in the multiple participles preset;

Based on maximum entropy model, and determined eigenwert, train the sorter obtained for carrying out user type identification;

Described degree of confidence generates subelement 6002, specifically for:

Calculate the eigenwert in the document that each participle in described default multiple participles forms at described historical search behavioral data, as the feature extraction result of the historical search behavioral data to described user.

Exemplary, described similarity calculated 630, specifically for:

Similarity GroupPersonalScore (Ti) according between i-th Search Results in the personalized vocabulary in personalized word finder described in following formulae discovery and described multiple Search Results:

GroupPersonalScore (Ti) = α \times \frac{ΔN (Ti)}{N (Ti)}

Wherein, α is preset parameter value, and N (Ti) is the number of participle in described i-th Search Results Ti, and Δ N (Ti) is for belonging to the number of the participle of personalized vocabulary in described personalized word finder in described i-th Search Results Ti.

Exemplary, described sequence represents unit 640, specifically for:

Respectively for each Search Results in described multiple Search Results, perform following operation: using Similarity Measure result corresponding for Search Results as the 3rd candidate's marking value, with first candidate's marking value of the Search Results that algorithm determines of giving a mark based on set first, and be weighted fusion based on second candidate's marking value of Search Results that the second marking algorithm of setting is determined, obtain the final marking value of Search Results;

Each Search Results in described multiple Search Results is sorted according to final marking value order from high to low, and the described multiple Search Results after sequence is represented.

The said goods can perform the method that any embodiment of the present invention provides, and possesses the corresponding functional module of manner of execution and beneficial effect.

Note, above are only preferred embodiment of the present invention and institute's application technology principle.Skilled person in the art will appreciate that and the invention is not restricted to specific embodiment described here, various obvious change can be carried out for a person skilled in the art, readjust and substitute and can not protection scope of the present invention be departed from.Therefore, although be described in further detail invention has been by above embodiment, the present invention is not limited only to above embodiment, when not departing from the present invention's design, can also comprise other Equivalent embodiments more, and scope of the present invention is determined by appended right.

Claims

1. an individuation search method, is characterized in that, comprising:

Obtain the personalized word finder under described user's owning user type;

2. individuation search method according to claim 1, is characterized in that, before obtaining the personalized word finder under described user's owning user type, also comprises:

According to the historical search behavioral data of described user, identify the user type belonging to described user;

According to the historical search behavioral data of the particular group under described user type, determine described personalized word finder.

3. individuation search method according to claim 2, is characterized in that, according to the historical search behavioral data of the particular group under described user type, determines described personalized word finder, comprising:

4. individuation search method according to claim 2, is characterized in that, according to the historical search behavioral data of described user, identifies the user type belonging to described user, comprising:

Feature extraction is carried out to the historical search behavioral data of described user; And using described feature extraction result as the input being used for the sorter carrying out user type identification, obtain the degree of confidence of described user respectively under the various user types preset based on described sorter;

According to the degree of confidence obtained, determine the user type belonging to described user.

5. individuation search method according to claim 4, is characterized in that, described historical search behavioral data comprises historical query word data and history click data;

Feature extraction is carried out to the historical search behavioral data of described user; And using described feature extraction result as the input being used for the sorter carrying out user type identification, obtain the degree of confidence of described user respectively under the various user types preset based on described sorter, comprising:

6. individuation search method according to claim 5, is characterized in that, according to the degree of confidence obtained, determines the user type belonging to described user, comprising:

7. individuation search method according to claim 4, is characterized in that, before carrying out feature extraction to the historical search behavioral data of described user, also comprises:

Obtain Training document collection, wherein said Training document is concentrated and is comprised: multiple Training document, and the user type of each Training document;

Feature extraction is carried out to the historical search behavioral data of described user, comprising:

8. the individuation search method according to any one of claim 1-7, is characterized in that, by the personalized vocabulary in described personalized word finder, carries out Similarity Measure respectively, comprising with each Search Results:

GroupPersonalScore (Ti) = α \times \frac{ΔN (Ti)}{N (Ti)}

9. the individuation search method according to any one of claim 1-7, is characterized in that, sorts to described multiple Search Results according to Similarity Measure result, and is represented by the described multiple Search Results after sequence, comprising:

10. a personalized search device, is characterized in that, comprising:

11. personalized search devices according to claim 10, is characterized in that, also comprise:

User type recognition unit, before obtaining the personalized word finder under described user's owning user type at described personalized bilingual lexicon acquisition unit, according to the historical search behavioral data of described user, identifies the user type belonging to described user;

Personalized vocabulary determining unit, for the historical search behavioral data according to the particular group under described user type, determines described personalized word finder.

12. personalized search devices according to claim 11, is characterized in that, described personalized vocabulary determining unit, specifically for:

13. personalized search devices according to claim 11, is characterized in that, described user type recognition unit, comprising:

Degree of confidence generates subelement, for carrying out feature extraction to the historical search behavioral data of described user; And using described feature extraction result as the input being used for the sorter carrying out user type identification, obtain the degree of confidence of described user respectively under the various user types preset based on described sorter;

User type determination subelement, for according to the degree of confidence obtained, determines the user type belonging to described user.

14. personalized search devices according to claim 13, is characterized in that, described historical search behavioral data comprises historical query word data and history click data;

Described degree of confidence generates subelement, specifically for:

15. personalized search devices according to claim 14, is characterized in that, described user type determination subelement, specifically for:

16. personalized search devices according to claim 13, is characterized in that, described user type recognition unit, also comprise sorter and generate subelement, for:

Before the historical search behavioral data of described degree of confidence generation subelement to described user carries out feature extraction, obtain Training document collection, wherein said Training document is concentrated and is comprised: multiple Training document, and the user type of each Training document;

Described degree of confidence generates subelement, specifically for:

17. personalized search devices according to any one of claim 10-16, is characterized in that, described similarity calculated, specifically for:

GroupPersonalScore (Ti) = α \times \frac{ΔN (Ti)}{N (Ti)}

18. personalized search devices according to any one of claim 10-16, it is characterized in that, described sequence represents unit, specifically for: