CN102591966A

CN102591966A - Filtering method of search results in mobile environment

Info

Publication number: CN102591966A
Application number: CN2011104581556A
Authority: CN
Inventors: 金海�; 赵峰; 袁平鹏; 严奉伟; 方飞; 谢海洋
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2011-12-31
Filing date: 2011-12-31
Publication date: 2012-07-18
Anticipated expiration: 2031-12-31
Also published as: CN102591966B

Abstract

The invention discloses a filtering method of search results in a mobile environment. The method comprises the steps of: finely dividing users into different groups according to history position information characteristics of the users; characteristically modeling the users according to the history query records of the users; analyzing history call records of the users, establishing a social intercourse relation network of the users and calculating the social intercourse relation importance among the users; and during search, firstly, filtering the search results based on contents by using an established user characteristic model, secondly, cooperatively filtering the search results with the finely divided user group information and the excavated information of the social intercourse relation network of the users, and thirdly, returning the search results to the users. With the method for excavating the user characteristics and filtering the information, the search results can be better filtered in a personalized way, a mass of unrelated search results can be removed, a result set can be simplified, and the personalized precise search in the mobile environment can be realized.

Description

Search Results filter method under a kind of mobile scene

Technical field

The invention belongs to information retrieval field, be specifically related to the Search Results filter method under a kind of mobile scene, this method is applicable to the personalized search that moves under the scene.

Background technology

In the more than ten years in past, search engine technique has been obtained develop rapidly, and traditional internet hunt is implemented to quite ripe that business model has all developed from technology, and has obtained immense success.In recent years, be that the emerging technology and the application of representative continues to bring out with the mobile Internet, mobile search is one of mobile Internet important application.

Mobile search is because the portable terminal movability; Portable; And limitation such as screen size, processing power and available bandwidth, make it can not directly indiscriminately imitate the implementation of existing internet hunt, main cause has following 2 points: (1) traditional internet search engine returns to a large amount of result of user usually; In fact in most cases these results are as far as the user, and it is incoherent having over half.One of them chief reason search engine is just carrying out coupling simple to search key; Do not consider that other information are (like user context information; Individual's preference etc.), add the surge of internet information, caused the generation of a lot " rubbish result "; The user has in Search Results, oneself screen, and this has increased the weight of user's burden greatly.Moving under the scene; Because limitation such as mobile terminal screen keyboard size, processing power and available bandwidths, said circumstances is that the user is flagrant, the one, and a large amount of rubbish results waste valuable flow; The 2nd, the user carries out the page turning screening to Search Results on portable terminal be very inconvenient; This has determined that mobile search must be to search for accurately, return to the user try one's best few, result accurately; (2) to same search key, that the internet search engine of system returns all users is machine-made result, yet different user is because its background knowledge is different; Hobby is different; Information requirement is different, and same key word is to different people, in different fields; Different time all possibly expressed the different meanings with the place, and what the user needed often is all very little subclass in Search Results the inside.The movability of portable terminal; Portability and individual's property; Make the user to obtain information needed anywhere or anytime, make that the personalized search demand is stronger, this has determined that mobile search is that a kind of and individual subscriber characteristic (like interest etc.) and user's context are (like the time; Place, factors such as weather) search of relevant personalization.

What therefore, mobile search need realize is personalized accurate search.At present, domestic mobile search research still is in the starting stage, and the existing internet search technology of the technology that realizes is all still immature; Technology early has the vertical search technology, like mobile phone music search, novel search etc.; Adopting more implementation at present is to combine existing internet search technology and relevant ancillary technique, like the information filtering technology, earlier the user is carried out feature modeling; With this model Search Results is carried out personalization then and filter, filter out uncorrelated result, realize personalized precisely search.

User characteristics modeling common technology directed quantity spatial model and ontology model, vector space model is simple because of its principle, realizes using extensive relatively easily.

What information filtering technology was commonly used has content-based filtering technique and collaborative filtering technological; Content-based filtering technique is that the result is carried out feature extraction; The similarity of result of calculation and filtering profile (user model) is pressed setting threshold and is filtered, and analyzes with resultant content because be; Usually can reach filter effect preferably, but calculated amount is bigger.The collaborative filtering technology then has this thought of same interest preference according to the people of same type usually; Come user's Search Results is carried out collaborative filtering through the user similar with active user's interest, this technology has obtained good development and application in e-commerce field.

Summary of the invention

The purpose of this invention is to provide the Search Results filter method under a kind of mobile scene; This method is through digging user data (user's historical position information; Historical message registration etc.) set up user characteristics model and user social contact network; And Search Results is carried out content-based filtration and collaborative filtering respectively according to user characteristics model and user social contact network; Filter out incoherent Search Results, realize moving the accurate search of the personalization under the scene, this is of great value to improving mobile search user experience and user's viscosity.

Search Results filter method under a kind of mobile scene provided by the invention, this method comprises the steps:

The 1st step is to user U _i, i=1,2 ..., the initial results collection R to be filtered of N ₁, R ₂..., R _Z, utilize the d gt to treat filter result and set up proper vector, R _rProper vector be expressed as f _Rr={ (q ₁, v ₁), (q ₂, v ₂) ..., (q _d, v _d), v _aRepresent the weights on each dimension; Utilize word frequency/contrary document frequency TF/IDF Model Calculation f _Rr, the weights v on each dimension _a, to q ₁, q ₂... q _dIn each speech q _aIf it does not appear at R _r, in, then its weights are 0, otherwise are its TF/IDF value, TF is that it is at R _rThe middle number of times that occurs, the promptly contrary document frequency of IDF is added up the z of number as a result that those comprise this speech;

Wherein, the IDF value is log (Z/z), and Z is the number of initial results to be filtered, and the TF/IDF value is the product of TF and IDF, r=1, and 2 ..., Z, a=1,2 ..., d;

The 2nd step was sought active user U _i, similar users, from following two users set, choose, the one, the G of colony under the user _g, g is the sequence number of the colony under the user, its span is 1 to m, and the 2nd, the user's in the user social contact network set merges these two set and obtains S set, remembers that the user in this set is U _Is, utilize the vectorial cosine angle formula shown in the formula I to calculate user U _iWith each the user U in the S set _IsBetween similarity, shown in II, vector angle is more little, cosine value is big more, similarity is big more, vice versa; I representes user's sequence number, and N representes number of users, i=1, and 2 ..., N, f _UiAnd f _UisRepresent U respectively _iAnd U _IsProper vector, ψ (U _i, U _Is) represent U _iWith U _IsBetween degree of relationship, if U _IsAt U _iSocial networks in, ψ (U then _i, U _Is) get corresponding value, otherwise get null value; Choose preceding η user U from high to low by similarity _I1, U _I2..., U _{I η}, if not enough η, then choose all users among the S; η is a preset value;

Sim (U_{i}, U_{Is}) = (1 + ψ (U_{i}, U_{Is})) \cdot Cos (f_{U_{i}}, f_{U_{Is}})

Formula I

Cos (f_{U_{i}}, f_{U_{Is}}) = \frac{f_{U_{i}} \cdot f_{U_{Is}}}{| | f_{U_{Is}} | | \cdot | | f_{U_{Is}} | |}

Formula II

The 3rd content-based filtration of step:

To each bar initial results R to be filtered _r, adopt formula III to calculate itself and user U successively _iBetween similarity, f _UiAnd f _RrRepresent U respectively _iAnd R _rProper vector; Filter by pre-set threshold ζ according to similarity, the initial results of similarity less than threshold value ζ filtered out, obtain intermediate result collection R _r, r=1,2 ..., Z _ζ, filter the intermediate result that obtains and arrange by original sequencing;

Sim (U_{i}, R_{r}) = Cos (f_{U_{i}}, f_{R_{r}})

Formula III

Wherein,

Cos (f_{U_{i}}, f_{R_{r}}) = \frac{f_{U_{i}} \cdot f_{R_{r}}}{| | f_{U_{i}} | | \cdot | | f_{R_{r}} | |}

The 2nd step is to middle result set R _r, r=1,2 ..., Z _ζCarry out collaborative filtering, utilize user U _iThe similar users U of η _I1, U _I2..., U _{I η}, to middle R as a result _r,, calculate similarity sim ' (U by formula IV _i, R _r) carry out collaborative filtering, in the formula,

With

Represent U respectively _IsWith U _i, U _IsWith R _rBetween similarity;

{Sim}^{'} (U_{i}, R_{r}) = Σ_{s = 1}^{η} (Cos (f_{U_{Is}}, f_{U_{i}}) \cdot Cos (f_{U_{Is}}, f_{R_{r}}))

Formula IV

Rank _r=θ r+ (1-θ) sim ' (U _i, R _r) formula V

According to sim ' (U _i, R _r) carry out collaborative filtering by pre-set threshold ε, the intermediate result of similarity less than ε is filtered out, obtain interim result set R _r, r=1,2 ..., Z _ε, r represents its sequencing ordering in interim result set, is followed successively by 1,2 ..., Z _ε, to interim R _r,, utilize formula V to calculate its order r and sim ' (U with predefined weighting coefficient θ _i, R _r) weighted sum, as net result rank Rank _r, with this rank to interim result set R _r, rearrangement obtains net result, returns to the user, and filter process finishes.

Search Results filter method under the mobile scene provided by the invention has comprehensively adopted data digging method (classification, cluster), content-based filter algorithm and collaborative filtering.Particularly, the present invention has following effect and advantage:

(1) accuracy is high, and collaborative filtering is carried out in the novelty of the present invention user social contact network information is analyzed simultaneously on the basis of traditional content-based filtration, largely improved accuracy.

(2) adaptability is strong, and the present invention considers mobile subscriber colony and individual's diversity, can adapt to various user groups and individual's individual demand well.

(3) extensibility is high, and filter method provided by the invention also can be used for its mobile Internet and use except being used for mobile search, accurate advertisement input etc., and the user characteristics modeling method also can be applied to Customer Relation Management (CRM) etc.

Description of drawings

Fig. 1 is the overall flow figure of the inventive method;

Fig. 2 is mobile subscriber's historical position change frequency sketch;

Fig. 3 is the process flow diagram of mobile subscriber's opsition dependent cluster;

Fig. 4 is mobile subscriber's social networks structural drawing;

Fig. 5 is the detailed filtering process figure of mobile search results.

Embodiment

Below in conjunction with accompanying drawing the present invention is elaborated.

Search Results filter method under a kind of mobile scene provided by the invention; As shown in Figure 1, filtered pretreatment stage before this, mainly comprise subscriber segmentation; Make up the user characteristics model and make up user's community network; Respectively corresponding following step (1) is to step (3), is filtration stage as a result then, corresponding following step (4).Concrete treatment step is following:

1, filters pretreatment stage, comprise the steps (1) to step (3).

(1) subscriber segmentation adopts data mining method the user to be segmented the user data set that existing telecom operators provide; Collected inside a large amount of user data, like user's historical position information, historical message registration; Record is write down and browses in user's historical query; Historical business datum etc., the present invention mainly comes the user is segmented with user's historical position information, and concrete steps are following:

(a) according to user's historical position change frequency the user is divided, user's historical position information has write down user's historical position L and corresponding temporal information T, and positional information L is recorded in the data set with the form of longitude and latitude; As (30.2332; 114.3243), temporal information T is with the form record of time point, the longitude and latitude of adjacent twice historical position of known users; Adopt longitude and latitude range formula (formula (1)) to be easy to calculate its distance, establish first position L ₁Longitude and latitude be (lon ₁, lat ₁), second position L ₂Longitude and latitude be (lon ₂, lat ₂), according to the benchmark of 0 degree warp, east longitude get on the occasion of, west longitude is got negative value, north latitude is by (90 °-lat) bring calculating into, south latitude is by (90 °+lat) bring calculating into then can be calculated the distance between 2 with formula (1).

C＝sin(lat ₁)·sin(lat ₂)·cos(lon ₁-lon ₂)+cos(lat ₁)·cos(lat ₂)

Dis (L_{1}, L_{2}) = R \cdot \arccos (C) \cdot \frac{π}{180} - - - (1)

To each user U _i, (i=1,2 ..., N), calculate the historical position accumulative total change frequency F in its nearest a period of time Δ T (as one month) _i, (i=1,2 ..., N), wherein, N representes number of users.

F_{i} = \frac{1}{ΔT} Σ_{1}^{M} | \frac{Dis (L_{k}, L_{k - 1})}{T_{k} - T_{k - 1}} | - - - (2)

Shown in (2), (L ₁, T ₁), (L ₂, T ₂) ..., (L _M, T _M) be user U _i, (i=1,2 ..., N) the historical position information in nearest a period of time Δ T, (L _K-1, T _K-1) and (L _k, T _k) be twice adjacent historical position of user and temporal information, Dis (L _k, L _K-1) and T _k-T _K-1Be respectively the poor of adjacent twice historical position distance and time.M representes active user's historical position quantity, and k representes the sequence number of historical position.

Add up all users' F, obtain the interval Ω of overall range of F, Ω is divided into the interval Ω of plurality of sub ₁, Ω ₂..., Ω _n, n representes user group's quantity, and these sub-ranges characterize different user groups with F, and the user is divided in the corresponding sub-range according to its F, and as shown in Figure 2, the F of user A is higher, possibly be the business people who often goes on business.The F of user B is lower, then possibly often be the long period all in a certain fixed position, as being a certain college student, according to the change frequency F of position, the user is carried out a preliminary division like this, the different Ω of colony that the user is divided into ₁, Ω ₂..., Ω _nΩ divided to adopt the mode of dividing equally, also can preestablish a criteria for classifying by system.

(b) next to each Ω _j, (j=1,2; ..., n, j represent the sequence number of colony) user of lining carries out cluster by historical position information; It is one type that the user that the position is contiguous gathers; Correlation study research shows that the contiguous user in geographic position has similar user characteristics to a certain extent, adopts the k means clustering algorithm to each Ω _j, (j=1,2 ..., the user in n) carries out cluster, and step is following:

(b1) at first calculate each user U _i, (i=1,2 ..., N) the center O of the historical position in the Δ T time _i, according to O _iThe user is carried out cluster; I representes user's sequence number;

(b2) from Ω _j, (j=1,2 ..., a n) middle picked at random k user, each user U _q, (q=1,2 ..., k) represent an initial user bunch C _q, (q=1,2 ... k), its O _q, (q=1,2 ..., the k) initial center of representative of consumer bunch;

(b3) to Ω _j, (j=1,2 ..., n) in remaining each user, calculate itself and each user bunch C _q, (q=1,2 ... k) center O _q, (q=1,2 ..., distance k) (longitude and latitude range formula) assign to be given nearest user bunch with it;

(b4) recomputate each user's bunch new central value O then _q, (q=1,2 ..., k), replace old central value.By formula (3) calculation criterion function E _jValue, if E _jValue restrain then cluster process and finish, otherwise, change step b3.

E_{j} = Σ_{q = 1}^{k} \underset{U &Element; Ω_{j}}{Σ} Dis (U, C_{q}),

(j＝1，2，....n) (3)

Shown in (3), Dis (U, C _q) represent Ω _j, (j=1,2 ..., user and user bunch C in n) _q, (q=1,2 ... k) center O _q, (q=1,2 ..., distance k).

Cluster obtains compact user bunch, like this at Ω ₁, Ω ₂..., Ω _nOn the basis of dividing, the user further has been divided into the littler G of colony ₁, G ₂..., G _m, realize subscriber segmentation.

(2) make up the user characteristics model, user's historical query record has well characterized user's interest characteristics, through the historical query record of analysis user, adopts vectorial empty progressive die type that the user is carried out feature modeling, and its step comprises:

(a) add up all interior historical query records of all user's Δ T times, statistics obtains the speech q of d inequality ₁, q ₂..., q _d, as d dimension of vector space, user's proper vector is expressed as f _Ui={ (q ₁, v ₁), (q ₂, v ₂) ..., (q _d, v _d), (i=1,2 ..., N), v _a, (a=1,2 ..., d) represent the weights of each dimension.

(b) adopt TF/IDF (word frequency/contrary document frequency) model, to each user U _i, (i=1,2 ..., N), calculate the weights of its each dimension of proper vector.To q ₁, q ₂..., q _dIn each speech q _a, (a=1,2 ..., d), if it does not appear in user's historical query record, its corresponding weight value v then _a, (a=1,2 ...; D) be 0, otherwise be its TF/IDF value, TF is a word frequency; Here occur the number of times of this speech in the historical query record for the user, the promptly contrary document frequency of IDF is added up the number D that occurred the user of this speech in those historical query records; The IDF value is log (N/D), and N is all numbers of users, and the TF/IDF value is the product of TF and IDF.

(3) digging user social networks information, the historical message registration of analysis user is to each user U _i, (i=1,2 ...; N), its social networks is rendered as a star topology figure with this user-center, and is as shown in Figure 3, Centroid B representative of consumer oneself; Star node A, C, D, E; F, representative such as G and B have the user of message registration, the degree of relationship between the weight ψ representative of consumer on limit, this step mainly is the value of estimation ψ.

User's historical message registration data recording the message registration between all users, comprise the id number of both call sides), the conversation start time, the end of conversation time etc. are to each user U _i, (i=1,2 ..., N), analyze the message registration in its Δ T time, to each user u of message registration is arranged with it _x, (x=1,2 ..., e, e represent to have with it user's number of message registration), analyze itself and U _i, (i=1,2 ..., N) the total talk times α in Δ T, total duration of call β, conversation rule γ, these factors of analysis-by-synthesis can roughly be inferred U _i, (i=1,2 ..., N) and u _x, (x=1,2 ..., the ψ of degree of relationship between e) _Ix

Total talk times α is easier to statistics with total duration of call β ratio and obtains; But they all are the statistic of bulking property, and are more single, the degree of relationship between can only be the generally rough body estimating user; And ignored important minutia; Whether even like the distribution in time of each conversation incident, be integral body evenly or local uniform etc., characterize U so also introduced this characteristic factor of conversation rule γ here _i, (i=1,2 ..., N) and u _x, (x=1,2 ..., the degree of relationship between e) through the time characteristic distributions of all the conversation incidents in the statistical study time Δ T, uses the thought of variance, suc as formula (4) (5) (6), t _h, (h=1,2 ..., α) be each conversation start time, Δ t _hBe the mistiming between adjacent twice message registration, S _tBe its variance, γ is inversely proportional to S _t, shown in (6), the conversation in little this section of expression period of variance is more regular, and γ is corresponding bigger, and vice versa.

Δt _h＝t _h-t _h-1，(h＝2，3，...，α) (4)

\overset{&OverBar;}{Δt} = \frac{1}{α - 1} Σ_{h = 1}^{α} {Δt}_{h} - - - (5)

S_{t} = \frac{1}{α - 1} Σ_{h = 2}^{α} {(\overset{&OverBar;}{Δt} - {Δt}_{h})}^{2} - - - (6)

γ = \frac{1}{S_{t}} - - - (7)

With the α that calculates, beta, gamma carries out normalization to be handled, and obtains the value between 0 and 1 scope, ψ _Ix, (i=1,2 ..., N, x=1,2 ..., value e) adopts formula (8) to calculate, and it is to take all factors into consideration α, the weighted value that beta, gamma obtains, in the formula (8), 0≤λ ₁≤1,0≤λ ₂≤1,0≤λ ₃≤1, and λ ₁+ λ ₂+ λ ₃=1, its default value is got average 1/3.

ψ _ix＝λ ₁·α+λ ₂·β+λ ₃·γ，(λ ₁+λ ₂+λ ₃＝1) (8)

Through the analysis and the calculating of this step, just obtained each user U like this _i, (i=1,2 ..., social networks information N) comprises its associated with it user u _x, (x=1,2 ..., the ψ of degree of relationship between e) _Ix

(4) Search Results filters; Preceding step (1) to step (3) all is the preparatory stage; Be for the Search Results filtering services of this step; The user characteristics model that step (2) is set up is to be used for Search Results is carried out content-based filtration, and the user social contact network information that subscriber segmentation that step (1) is done and step (3) are excavated is to be used for Search Results is carried out collaborative filtering.

This step is carried out content-based filtration earlier to Search Results, carries out collaborative filtering then.To reach personalization and the purpose of simplifying Search Results.

User U _i, (i=1,2; ..., N) Q is once searched in submission, and searching request is at first handled by existing internet search engine; Existing internet search engine returns an initial results collection to search Q, and this result set is bigger usually, and the preceding φ bar result who chooses in this result set filters; If not enough φ bar is then chosen whole initial results collection, as result set R to be filtered ₁, R ₂..., R _Z, φ is an empirical value, is preestablished by system, as is set at 300, Z is a result's to be filtered number.Result's filtering process is as shown in Figure 5, and step is following:

(a) treat filter result collection R ₁, R ₂..., R _Z, set up proper vector, adopt the d gt of setting up in the step (2) these results to be set up proper vector, R _r(r=1,2 ..., proper vector Z) is expressed as f _Rr={ q ₁, v ₁), (q ₂, v ₂) ..., (q _d, v _d), (r=1,2 ..., Z), v _a, (a=1,2 ..., the weights of d) representing each to tie up.Same TF/IDF (word frequency/contrary document frequency) model of using in the step (2) that adopts calculates f _Rr, (r=1,2 ..., Z) the weights v on each dimension _a, (a=1,2 ..., d), to q ₁, q ₂... q _dIn each speech q _a, (a=1,2 ..., d), if it does not appear at R _r, (r=1,2 ..., Z) in, then its weights are 0, otherwise are its TF/IDF value, TF is that it is at R _r, (r=1,2 ..., Z) the middle number of times that occurs, the promptly contrary document frequency of IDF is added up the z of number as a result that those comprise this speech, and the IDF value is log (Z/z), and Z is all number of results, and the TF/IDF value is the product of TF and IDF.

(b) next seek active user U _i, (i=1,2 ..., similar users N) is chosen from two user's set, and the one, the G of colony in the step (1) under the user _gG is the sequence number of the colony under the user, and its span is 1 to m, the 2nd, and the user's in the user social contact network of setting up in the step (3) set; These two set are merged (user that repetition might be arranged) obtain S set, from S set, choose several similar users.

sim (U_{i}, U_{is}) = (1 + ψ (U_{i}, U_{is})) \cdot \cos (f_{U_{i}}, f_{U_{is}}) - - - (9)

\cos (f_{U_{i}}, f_{U_{is}}) = \frac{f_{U_{i}} \cdot f_{U_{is}}}{| | f_{U_{is}} | | \cdot | | f_{U_{is}} | |} - - - (10)

In the formula (10), || || the mould of expression vector.

(5) the vectorial cosine angle formula shown in the employing formula (10) calculates U _i, (i=1,2 ..., N) with S set in each user U _IsBetween similarity, shown in (9), vector angle is more little, cosine value is big more, similarity is big more, vice versa.f _UiAnd f _UisRepresent U respectively _iAnd U _IsProper vector, ψ (U _i, U _Is) represent U _iWith U _IsBetween degree of relationship, if U _IsAt U _iSocial networks in, ψ (U then _i, U _Is) get corresponding value, otherwise get null value.Choose preceding η user U from high to low by similarity _I1, U _I2..., U _{I η}, if not enough η, then choose all users among the S.η is an empirical value, is preestablished by system, can get 10 like its default value.

(c) begin to carry out the result then and filtered, filter process divides two stages, content-based filtration stage and collaborative filtering stage:

(c1) content-based before this filtration is to each bar initial results R to be filtered in (a) _r, (r=1,2 ..., Z), calculate itself and user U successively _i, (i=1,2 ..., the similarity between N), same, employing formula (10) is calculated similarity between the two, shown in (11), f _UiAnd f _RrRepresent U respectively _iAnd R _rProper vector.Filter by threshold value ζ according to similarity, the result less than ζ filters out with similarity, obtains intermediate result collection R _r, (r=1,2 ..., Z _ζ), filter the intermediate result that obtains and arrange by original sequencing.Threshold value ζ is an empirical value, preestablishes by system, and 0≤ζ≤1, its default value can be set at 0.65.

sim (U_{i}, U_{r}) = \cos (f_{U_{i}}, f_{R_{r}}) - - - (11)

(c2) next to middle result set R _r, (r=1,2 ..., Z _ζ) carrying out collaborative filtering, collaborative filtering is based on similar users has this thought of similar interest usually, comes the active user is worked in coordination with recommendation with active user's similar users, adopts the user U that calculates in the step (b) _i, (i=1,2 ..., the similar users U of η N) _I1, U _I2..., U _{I η}, to middle R as a result _r, (r=1,2 ..., Z _ζ), calculate similarity sim ' (U by formula (12) _i, R _r) carry out collaborative filtering, in the formula, the vectorial cosine angle of employing formula (10) formula,

With

Represent U respectively _IsWith U _i, U _IsWith R _rBetween similarity.

{sim}^{'} (U_{i}, R_{r}) = Σ_{s = 1}^{η} (\cos (f_{U_{is}}, f_{U_{i}}) \cdot \cos (f_{U_{is}}, f_{R_{r}})) - - - (12)

Rank _r＝θ·r+(1-θ)·sim′(U _i，R _r) (13)

According to sim ' (U _i, R _r) carrying out collaborative filtering by threshold epsilon, the result less than ε filters out with similarity, obtains interim result set R _r, (r=1,2 ..., Z _ε), r represents its sequencing ordering in interim result set, is followed successively by 1,2 ..., Z _ε), to R _r, (r=1,2 ..., Z _ε), calculate its order r and sim ' (U with weighting coefficient θ _i, R _r) weighted sum, as net result rank Rank _r, shown in (13), with this rank to R _r, (r=1,2 ..., Z _ε) rearrangement, obtain net result, return to the user, filter process finishes.Threshold epsilon and weighting coefficient θ are empirical value, preestablish by system, and 0≤ε≤1,0≤θ≤1, the default value of ε can be set at 0.85, and the default value of θ can be set at 0.5.

The present invention not only is confined to above-mentioned embodiment; Persons skilled in the art are according to content disclosed by the invention; Can adopt other multiple embodiment embodiment of the present invention, therefore, every employing project organization of the present invention and thinking; Do some simple designs that change or change, all fall into the scope of the present invention's protection.

Claims

1. the Search Results filter method under the mobile scene, this method comprises the steps:

The 1st step is to user U _i, i=1,2 ..., the initial results collection R to be filtered of N ₁, R ₂..., R _Z, utilize the d gt to treat filter result and set up proper vector, R _rProper vector be expressed as f _Rr={ q ₁, v ₁), (q ₂, v ₂) ..., (q _d, v _d), v _aRepresent the weights on each dimension; Utilize word frequency/contrary document frequency TF/IDF Model Calculation f _Rr, the weights v on each dimension _a, to q ₁, q ₂... q _dIn each speech q _aIf it does not appear at R _r, in, then its weights are 0, otherwise are its TF/IDF value, TF is that it is at R _rThe middle number of times that occurs, the promptly contrary document frequency of IDF is added up the z of number as a result that those comprise this speech;

Sim (U_{i}, U_{Is}) = (1 + ψ (U_{i}, U_{Is})) \cdot Cos (f_{U_{i}}, f_{U_{Is}})

Formula I

Cos (f_{U_{i}}, f_{U_{Is}}) = \frac{f_{U_{i}} \cdot f_{U_{Is}}}{| | f_{U_{Is}} | | \cdot | | f_{U_{Is}} | |}

Formula II

The 3rd content-based filtration of step:

Sim (U_{i}, R_{r}) = Cos (f_{U_{i}}, f_{R_{r}})

Formula III

Wherein,

Cos (f_{U_{i}}, f_{R_{r}}) = \frac{f_{U_{i}} \cdot f_{R_{r}}}{| | f_{U_{i}} | | \cdot | | f_{R_{r}} | |}

With

Represent U respectively _IsWith U _i, U _IsWith R _rBetween similarity;

{Sim}^{'} (U_{i}, R_{r}) = Σ_{s = 1}^{η} (Cos (f_{U_{Is}}, f_{U_{i}}) \cdot Cos (f_{U_{Is}}, f_{R_{r}}))

Formula IV

Rank _r=θ r+ (1-θ) sim ' (U _i, R _r) formula V

2. the Search Results filter method under the mobile scene according to claim 1 is characterized in that: the initial results collection in the 1st step obtains in the following manner:

For user U _iSubmit to and once search for Q; Searching request is at first handled by existing internet search engine; Existing internet search engine returns an initial results collection to search Q, and the preceding φ bar result who chooses in this result set filters, if not enough φ bar; Then choose whole initial results collection, as result set R to be filtered ₁, R ₂..., R _Z, φ is preestablished by system, and Z is a result's to be filtered number.

3. the Search Results filter method under the mobile scene according to claim 1 is characterized in that: the 1st step obtained result's to be filtered proper vector in the following manner:

Add up all the historical query records in all user's Δ T times, statistics obtains the speech q of d inequality ₁, q ₂..., q _d, as d dimension of vector space, user's proper vector is expressed as f _Ui={ q ₁, v ₁), (q ₂, v ₂) ..., (q _d, v _d), i=1,2 ..., N, v _a, a=1,2 ..., d represents the weights of each dimension.

4. the Search Results filter method under the mobile scene according to claim 1 is characterized in that: in the 2nd step, obtain similar users in the following manner:

The 4.1st step was sought active user U _i, similar users, with the G of colony under the user _gMerge with user's in the user social contact network set, obtain S set, g is the sequence number of the colony under the user, and its span is 1 to m, and m representes the number of colony;

The 4.2nd step employing formula VI calculates U _iWith each the user U in the S set _IsBetween similarity sim (U _i, U _Is), f _UiAnd f _UisRepresent U respectively _iAnd U _IsProper vector, ψ (U _i, U _Is) represent U _iWith U _IsBetween degree of relationship, if U _IsAt U _iSocial networks in, ψ (U then _i, U _Is) get corresponding value, otherwise get null value; Choose preceding η user U from high to low by similarity _I1, U _I2..., U _{I η}, if not enough η, then choose all users among the S; η is predefined value;

Sim (U_{i}, U_{Is}) = (1 + ψ (U_{i}, U_{Is})) \cdot Cos (f_{U_{i}}, f_{U_{Is}})

Formula VI

Wherein,

Cos (f_{U_{i}}, f_{U_{Is}}) = \frac{f_{U_{i}} \cdot f_{U_{Is}}}{| | f_{U_{Is}} | | \cdot | | f_{U_{Is}} | |} .

5. the Search Results filter method under the mobile scene according to claim 4 is characterized in that: in the 4.1st step, and the G of colony under the user _gObtain in the following manner:

The 5.1st step divided the user according to user's historical position change frequency; User's historical position information has write down user's historical position information L and corresponding temporal information T; Historical position information L is recorded in the data set with the form of longitude and latitude; Temporal information T is with the form record of time point, and the longitude and latitude of adjacent twice historical position of known users adopts the longitude and latitude range formula to calculate its distance;

To each user U _i,, calculate the historical position accumulative total change frequency F in its nearest a period of time Δ T according to formula VII _Ij:

F_{i} = \frac{1}{ΔT} Σ_{1}^{M} | \frac{Dis (L_{k}, L_{k - 1})}{T_{k} - T_{k - 1}} |

VII

(L ₁, T ₁), (L ₂, T ₂) ..., (L _M, T _M) be user U _i, the historical position information in nearest a period of time Δ T, (L _K-1, T _K-1) and (L _k, T _k) be twice adjacent historical position of user and temporal information, Dis (L _k, L _K-1) and T _k-T _K-1Be respectively the poor of adjacent twice historical position distance and time; M representes active user's historical position quantity, and k representes the sequence number of historical position;

The 5.2nd step all users' of statistics historical position adds up change frequency F, obtains the interval Ω of overall range of F, and Ω is divided into the interval Ω of plurality of sub ₁, Ω ₂..., Ω _n, n representes user group's quantity, and these sub-ranges characterize different user groups with F, and the user is divided in the corresponding sub-range according to its F, the different Ω of colony that the user is divided into ₁, Ω ₂..., Ω _n

The 5.3rd step is to each Ω _jIn the user carry out cluster by historical position information, it is one type that the user that the position is contiguous gathers, and again the user further has been divided into the littler G of colony ₁, G ₂..., G _m, j=1,2 ..., n, j represent the sequence number of colony.

6. the Search Results filter method under the mobile scene according to claim 5 is characterized in that: the 5.3rd step adopted the k means clustering algorithm to each Ω _jIn the user carry out cluster, step is following:

(b1) at first calculate each user U _iThe center O of the historical position in nearest a period of time Δ T _i, according to center O _iThe user is carried out cluster; I representes user's sequence number;

(b2) from Ω _jA middle picked at random k user, each user U _q, represent an initial user bunch C _q, its center O _qThe initial center of representative of consumer bunch, q 1,2 ..., k;

(b3) to Ω _j, in remaining each user, calculate itself and each user bunch C _qCenter O _qDistance, assign to give nearest user bunch with it;

(b4) recomputate each user's bunch new center O then _q,, replace old central value; By formula VIII calculation criterion function E _jValue, if E _jValue restrain then cluster process and finish, otherwise, change step b3;

E_{j} = Σ_{q = 1}^{k} \underset{U &Element; Ω_{j}}{Σ} Dis (U, C_{q}), j = 1,2, . . . n

Formula VIII

Among the formula VIII, Dis (U, C _q) represent Ω _jIn user and user bunch C _q, center O _qDistance;

(b5) cluster obtains compact user bunch, like this at Ω ₁, Ω ₂..., Ω _nOn the basis of dividing, the user further has been divided into the littler G of colony ₁, G ₂..., G _m, realize subscriber segmentation.

7. the Search Results filter method under the mobile scene according to claim 4 is characterized in that: in the 4.1st step, the user social contact network makes up in the following manner:

The 7.1st step was adopted word frequency/contrary document frequency TF/IDF model, to each user U _iCalculate the weights of its each dimension of proper vector; To q ₁, q ₂..., q _dIn each speech q _A,If it does not appear in user's the historical query record, then its corresponding weight value v _aBe 0, otherwise be its TF/IDF value, TF is a word frequency, and the promptly contrary document frequency of IDF is added up the number D that occurred the user of this speech in those historical query records, and the IDF value is log (N/D), and N is all numbers of users, and the TF/IDF value is the product of TF and IDF;

The 7.2nd step is to each user U _iAnalyze the message registration in its nearest Δ T time a period of time, to each user u of message registration is arranged with it _xAnalyze itself and U _iTotal talk times α in Δ T, total duration of call β, conversation rule γ utilizes formula IX to calculate U _iWith u _xBetween the ψ of degree of relationship _Ix

ψ _Ix=λ ₁α+λ ₂β+λ ₃γ formula IX

In the formula, 0≤λ ₁≤1,0≤λ ₂≤1,0≤λ ₃≤1, and λ ₁+ λ ₂+ λ ₃=1

γ = \frac{1}{S_{t}}

S_{t} = \frac{1}{α - 1} Σ_{h = 2}^{α} {(\overset{&OverBar;}{Δt} - {Δt}_{h})}^{2}

Δt _h＝t _h-t _h-1，h＝2，3，...，α

\overset{&OverBar;}{Δt} = \frac{1}{α - 1} Σ_{h = 1}^{α} {Δt}_{h} .