CN102591966A

CN102591966A - Filtering method of search results in mobile environment

Info

Publication number: CN102591966A
Application number: CN2011104581556A
Authority: CN
Inventors: 金海�; 赵峰; 袁平鹏; 严奉伟; 方飞; 谢海洋
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2011-12-31
Filing date: 2011-12-31
Publication date: 2012-07-18
Anticipated expiration: 2031-12-31
Also published as: CN102591966B

Abstract

The invention discloses a filtering method of search results in a mobile environment. The method comprises the steps of: finely dividing users into different groups according to history position information characteristics of the users; characteristically modeling the users according to the history query records of the users; analyzing history call records of the users, establishing a social intercourse relation network of the users and calculating the social intercourse relation importance among the users; and during search, firstly, filtering the search results based on contents by using an established user characteristic model, secondly, cooperatively filtering the search results with the finely divided user group information and the excavated information of the social intercourse relation network of the users, and thirdly, returning the search results to the users. With the method for excavating the user characteristics and filtering the information, the search results can be better filtered in a personalized way, a mass of unrelated search results can be removed, a result set can be simplified, and the personalized precise search in the mobile environment can be realized.

Description

A method for filtering search results in mobile scenarios

技术领域 technical field

本发明属于信息检索领域，具体涉及一种移动场景下的搜索结果过滤方法，该方法适用于移动场景下的个性化搜索。The invention belongs to the field of information retrieval, and in particular relates to a method for filtering search results in a mobile scene, and the method is suitable for personalized search in a mobile scene.

背景技术 Background technique

过去的十几年里，搜索引擎技术取得了飞速发展，传统的互联网搜索从技术实现到商业模式都已经发展的相当成熟，并取得了巨大成功。近年来，以移动互联网为代表的新兴技术和应用不断涌现，移动搜索便是移动互联网重要应用之一。In the past ten years, search engine technology has achieved rapid development, and traditional Internet search has developed quite maturely from technology implementation to business model, and has achieved great success. In recent years, emerging technologies and applications represented by the mobile Internet have emerged continuously, and mobile search is one of the important applications of the mobile Internet.

移动搜索由于移动终端移动性，便携性，以及屏幕尺寸、处理能力和可用带宽等局限性，使得其不能直接照搬现有互联网搜索的实现方案，主要原因有以下两点：(1)传统的互联网搜索引擎通常返回给用户大量的结果，实际上大多数情况下这些结果对用户而言，有一半以上是不相关的。其中一个主要的原因搜索引擎在只是对搜索关键字进行了简单了匹配，没有考虑其他信息(如用户上下文信息，个人偏好等)，加上互联网上信息的激增，导致了很多“垃圾结果”的产生，用户不得不在搜索结果中自己筛选，这大大加重了用户的负担。在移动场景下，由于移动终端屏幕键盘尺寸、处理能力和可用带宽等局限性，上述情形是用户不能容忍的，一是大量垃圾结果浪费宝贵流量，二是用户在移动终端上对搜索结果进行翻页筛选是很不方便的，这决定了移动搜索必须是精准的搜索，要返回给用户尽量少的，精准的结果；(2)对同一个搜索关键字，统的互联网搜索引擎对所有的用户返回的是千篇一律的结果，然而不同用户由于其背景知识不同，兴趣爱好不同，信息需求是不同的，同一个关键字对不同的人，在不同的领域，不同的时间和地点都可能表达不同的意思，用户需要的往往只是所有搜索结果里面一个很小的子集。移动终端的移动性，便携性和私人性，使得用户可以随时随地的获取所需信息，使得个性化搜索需求更加强烈，这决定了移动搜索是一种与用户个人特征(如兴趣等)和用户上下文(如时间，地点，天气等因素)相关的个性化的搜索。Due to the limitations of mobile terminal mobility, portability, and screen size, processing power, and available bandwidth, mobile search cannot directly copy the existing Internet search implementation solutions. The main reasons are as follows: (1) Traditional Internet Search engines usually return a large number of results to users, in fact more than half of these results are irrelevant to users in most cases. One of the main reasons is that the search engine simply matches the search keywords without considering other information (such as user context information, personal preferences, etc.), coupled with the surge of information on the Internet, resulting in many "garbage results" As a result, users have to filter by themselves in the search results, which greatly increases the burden on users. In the mobile scenario, due to the limitations of the size of the mobile terminal screen keyboard, processing power, and available bandwidth, the above situation is unacceptable to users. First, a large number of garbage results waste valuable traffic, and second, users flip through search results on mobile terminals. Page screening is very inconvenient, which determines that mobile search must be an accurate search, and as few as possible, accurate results should be returned to the user; (2) For the same search keyword, the traditional Internet search engine can search for all users The returned results are the same. However, different users have different background knowledge, different hobbies, and different information needs. The same keyword may express different meanings to different people, in different fields, at different times and places. This means that what users need is often only a small subset of all search results. The mobility, portability and privacy of mobile terminals enable users to obtain the required information anytime and anywhere, which makes the demand for personalized search more intense. Context (such as time, location, weather and other factors) relevant personalized search.

因此，移动搜索需要实现的是个性化的精准搜索。目前，国内移动搜索研究尚处于起步阶段，实现技术较现有互联网搜索技术都尚不成熟，较早的技术有垂直搜索技术，如手机音乐搜索，小说搜索等，目前采用较多的实现方案是结合现有互联网搜索技术和相关辅助技术，如信息过滤技术，先对用户进行特征建模，然后以此模型对搜索结果进行个性化过滤，过滤掉不相关结果，实现个性化精准搜索。Therefore, what mobile search needs to achieve is personalized and precise search. At present, domestic mobile search research is still in its infancy, and its implementation technology is still immature compared with the existing Internet search technology. The earlier technology includes vertical search technology, such as mobile phone music search, novel search, etc. Currently, more implementation schemes are used. Combining the existing Internet search technology and related auxiliary technologies, such as information filtering technology, firstly model the characteristics of users, and then use this model to perform personalized filtering on search results, filter out irrelevant results, and realize personalized and accurate search.

用户特征建模常用技术有向量空间模型和本体模型，向量空间模型因其原理简单，实现容易，应用相对广泛。Common technologies for user feature modeling include vector space model and ontology model. Vector space model is relatively widely used because of its simple principle and easy implementation.

信息过滤技术常用的有基于内容的过滤技术和协同过滤技术，基于内容的过滤技术是对结果进行特征提取，计算结果和过滤模板(用户模型)的相似度，按设定阈值过滤，因为是以结果内容进行分析，通常能达到较好的过滤效果，但计算量较大。协同过滤技术则根据相同类型的人通常有着相同兴趣偏好这一思想，通过与当前用户兴趣相似的用户来对用户的搜索结果进行协同过滤，这一技术已在电子商务领域取得了很好的发展和应用。Commonly used information filtering technologies are content-based filtering technology and collaborative filtering technology. Content-based filtering technology is to extract features from the results, calculate the similarity between the results and the filtering template (user model), and filter according to the set threshold, because it is based on Analysis of the result content can usually achieve a better filtering effect, but the amount of calculation is relatively large. Collaborative filtering technology is based on the idea that people of the same type usually have the same interest preferences, and collaborative filtering of user search results is performed through users with similar interests to the current user. This technology has achieved good development in the field of e-commerce. and apply.

发明内容 Contents of the invention

本发明的目的是提供一种移动场景下的搜索结果过滤方法，该方法通过挖掘用户数据(用户历史位置信息，历史通话记录等)建立用户特征模型和用户社交网络，并依据用户特征模型和用户社交网络对搜索结果分别进行基于内容的过滤和协同过滤，过滤掉不相关的搜索结果，实现移动场景下的个性化的精准搜索，这对提高移动搜索用户体验和用户粘性是很有价值的。The purpose of the present invention is to provide a method for filtering search results in a mobile scene. The method establishes a user characteristic model and a user social network by mining user data (user historical location information, historical call records, etc.), and based on the user characteristic model and user Social networks perform content-based filtering and collaborative filtering on search results to filter out irrelevant search results and realize personalized and precise search in mobile scenarios, which is valuable for improving mobile search user experience and user stickiness.

本发明提供的一种移动场景下的搜索结果过滤方法，该方法包括下述步骤：A method for filtering search results in a mobile scene provided by the present invention, the method includes the following steps:

第1步对用户U_i，i＝1，2，...，N的待过滤初始结果集R₁，R₂，...，R_Z，利用d维向量空间对待过滤结果建立特征向量，R_r的特征向量表示为f_Rr＝{(q₁，v₁)，(q₂，v₂)，...，(q_d，v_d)}，v_a代表各个维上的权值；利用词频/逆文档频率TF/IDF模型计算f_Rr，在每一维上的权值v_a，对q₁，q₂，...q_d中的每一个词q_a，如果其没有出现在R_r，中，则其权值为0，否则为其TF/IDF值，TF为其在R_r中出现的次数，IDF即逆文档频率，统计那些包含该词的结果个数z；Step 1 For the initial result sets R ₁ , R ₂ , . . . , R _Z of users U _i , i=1, 2, . The feature vector of R _r is expressed as f _Rr = {(q ₁ , v ₁ ), (q ₂ , v ₂ ),..., (q _d , v _d )}, v _a represents the weight on each dimension; Use the word frequency/inverse document frequency TF/IDF model to calculate f _Rr , the weight v _a on each dimension, for each word q _a in q ₁ , q ₂ ,...q _d , if it does not appear in If R _r is medium, its weight is 0, otherwise it is the TF/IDF value, TF is the number of times it appears in R _r , IDF is the inverse document frequency, and the number z of the results containing the word is counted;

其中，IDF值即log(Z/z)，Z是待过滤初始结果的个数，TF/IDF值为TF与IDF的乘积，r＝1，2，...，Z，a＝1，2，...，d；Among them, the IDF value is log(Z/z), Z is the number of initial results to be filtered, the TF/IDF value is the product of TF and IDF, r=1, 2,..., Z, a=1, 2 ,...,d;

第2步寻找当前用户U_i，的相似用户，从下述两个用户集合中选取，一是用户所属的群体G_g，g为用户所属的群体的序号，其取值范围为1至m，二是用户社交网络里的用户的集合，将这两个集合进行合并得到集合S，记该集合中的用户为U_is，利用式I所示的向量余弦夹角公式计算用户U_i与集合S中的每一个用户U_is之间的相似度，如式II所示，向量夹角越小，余弦值越大，相似度越大，反之亦然；i表示用户的序号，N表示用户的数量，i＝1，2，...，N，f_Ui和f_Uis分别代表U_i和U_is的特征向量，ψ(U_i，U_is)代表U_i与U_is之间的关系程度，若U_is在U_i的社交网络中，则ψ(U_i，U_is)取相应的值，否则取零值；按相似度从高到低选取前η个用户U_i1，U_i2，...，U_iη，若不足η个，则选取S中的所有用户；η为预先设定值；The second step is to find similar users of the current user U _i , and select from the following two user sets, one is the group G _g to which the user belongs, and g is the serial number of the group to which the user belongs, and its value ranges from 1 to m, The second is the set of users in the user's social network. Merge these two sets to obtain a set S, record the user in this set as U _is , use the vector cosine angle formula shown in formula I to calculate the user U _i and set S The similarity between each user U _is , as shown in formula II, the smaller the vector angle, the larger the cosine value, the greater the similarity, and vice versa; i represents the serial number of the user, and N represents the number of users , i=1, 2,..., N, f _Ui and f _Uis represent the eigenvectors of U _i and U _is respectively, ψ(U _i , U _is ) represents the degree of relationship between U _i and U _is , if U _is in the social network of U _i , then ψ(U _i , U _is ) takes the corresponding value, otherwise it takes zero value; select the first n users U _i1 , U _i2 ,... , U _i , if less than n, then select all users in S; n is a preset value;

$sim (U_{i}, U_{is}) = (1 + ψ (U_{i}, U_{is})) \cdot \cos (f_{U_{i}}, f_{U_{is}})$ 式I $sim (u_{i}, u_{is}) = (1 + ψ (u_{i}, u_{is})) &Center Dot; \cos (f_{u_{i}}, f_{u_{is}})$ Formula I

$\cos (f_{U_{i}}, f_{U_{is}}) = \frac{f_{U_{i}} \cdot f_{U_{is}}}{| | f_{U_{is}} | | \cdot | | f_{U_{is}} | |}$ 式II $\cos (f_{u_{i}}, f_{u_{is}}) = \frac{f_{u_{i}} \cdot f_{u_{is}}}{| | f_{u_{is}} | | \cdot | | f_{u_{is}} | |}$ Formula II

第3步基于内容过滤：Step 3 Filter based on content:

对每一条待过滤初始结果R_r，采用式III依次计算其与用户U_i之间的相似度，f_Ui和f_Rr分别代表U_i和R_r的特征向量；根据相似度按预先设定的阈值ζ过滤，将相似度小于阈值ζ的初始结果过滤掉，得到中间结果集R_r，r＝1，2，...，Z_ζ，过滤得到的中间结果按原有的先后顺序排列；For each initial result R _r to be filtered, the similarity between it and the user U _i is calculated sequentially using formula III, f _Ui and f _Rr represent the feature vectors of U _i and R _r respectively; according to the similarity according to the preset Threshold ζ filtering, filtering out the initial results whose similarity is smaller than the threshold ζ to obtain an intermediate result set R _r , r=1, 2, ..., Z _ζ , and the intermediate results obtained by filtering are arranged in the original sequence;

$sim (U_{i}, R_{r}) = \cos (f_{U_{i}}, f_{R_{r}})$ 式III $sim (u_{i}, R_{r}) = \cos (f_{u_{i}}, f_{R_{r}})$ Formula III

其中， $\cos (f_{U_{i}}, f_{R_{r}}) = \frac{f_{U_{i}} \cdot f_{R_{r}}}{| | f_{U_{i}} | | \cdot | | f_{R_{r}} | |}$ in, $\cos (f_{u_{i}}, f_{R_{r}}) = \frac{f_{u_{i}} &Center Dot; f_{R_{r}}}{| | f_{u_{i}} | | &Center Dot; | | f_{R_{r}} | |}$

第2步对中间结果集R_r，r＝1，2，...，Z_ζ进行协同过滤，利用用户U_i的η个最相似用户U_i1，U_i2，...，U_iη，对中间结果R_r，，按式IV计算相似度sim′(U_i，R_r)进行协同过滤，式中，

和

分别代表U_is与U_i，U_is与R_r之间的相似度；The second step is to perform collaborative filtering on the intermediate result set R _r , r=1, 2, ..., Z _ζ , using the n most similar users U _i1 , U _i2 , ..., U _{i η} of the user U _i , to The intermediate result R _r ,, calculate the similarity sim'(U _i , R _r ) according to formula IV for collaborative filtering, where,

and

represent the similarity between U _is and U _i , U _is and R _r respectively;

${sim}^{'} (U_{i}, R_{r}) = Σ_{s = 1}^{η} (\cos (f_{U_{is}}, f_{U_{i}}) \cdot \cos (f_{U_{is}}, f_{R_{r}}))$ 式IV ${sim}^{'} (u_{i}, R_{r}) = Σ_{the s = 1}^{η} (\cos (f_{u_{is}}, f_{u_{i}}) &Center Dot; \cos (f_{u_{is}}, f_{R_{r}}))$ Formula IV

Rank_r＝θ·r+(1-θ)·sim′(U_i，R_r) 式VRank _r ＝θ·r+(1-θ)·sim′(U _i , R _r ) Formula V

根据sim′(U_i，R_r)按预先设定的阈值ε进行协同过滤，将相似度小于ε的中间结果过滤掉，得到临时结果集R_r，r＝1，2，...，Z_ε，r代表其在临时结果集中的先后顺序排序，依次为1，2，...，Z_ε，对临时R_r，，以预先设定的加权系数θ利用式V计算其顺序r和sim′(U_i，R_r)的加权和，作为最终结果排名Rank_r，以此排名对临时结果集R_r，重新排序，得到最终结果，返回给用户，过滤过程结束。According to sim′(U _i , R _r ), perform collaborative filtering according to the preset threshold ε, filter out the intermediate results whose similarity is less than ε, and obtain the temporary result set R _r , r=1, 2, ..., Z _ε , r represents its order in the temporary result set, which is 1, 2, ..., Z _ε , for temporary R _r , use the formula V to calculate its order r and sim with the preset weighting coefficient θ The weighted sum of ′(U _i , R _r ) is used as the final result to rank Rank _r , and the temporary result set R _r is reordered based on this ranking to obtain the final result, which is returned to the user, and the filtering process ends.

本发明提供的移动场景下的搜索结果过滤方法，综合采用了数据挖掘方法(分类，聚类)，基于内容过滤算法和协同过滤算法。具体而言，本发明有以下效果和优点：The search result filtering method under the mobile scene provided by the present invention comprehensively adopts data mining methods (classification, clustering), and is based on a content filtering algorithm and a collaborative filtering algorithm. Specifically, the present invention has the following effects and advantages:

(1)准确度高，本发明创新性的将用户社交网络信息加以分析，在传统的基于内容过滤的基础上同时进行协同过滤，很大程度提高了准确度。(1) High accuracy. The present invention innovatively analyzes user social network information, and simultaneously performs collaborative filtering on the basis of traditional content-based filtering, which greatly improves accuracy.

(2)适应性强，本发明考虑到移动用户群体和个人的多样性，能很好地适应各种用户群体和个人的个性化需求。(2) Strong adaptability, the present invention considers the diversity of mobile user groups and individuals, and can well adapt to the personalized needs of various user groups and individuals.

(3)可扩展性高，本发明提供的过滤方法除了能用于移动搜索，也能用于其移动互联网应用，精准广告投放等，用户特征建模方法也能应用于客户关系管理(CRM)等。(3) high scalability, the filtering method provided by the present invention can not only be used for mobile search, but also can be used for its mobile Internet application, precise advertisement placement, etc., and the user characteristic modeling method can also be applied to customer relationship management (CRM) wait.

附图说明 Description of drawings

图1为本发明方法的整体流程图；Fig. 1 is the overall flowchart of the inventive method;

图2为移动用户历史位置变化频率简图；Figure 2 is a simplified diagram of the frequency of changes in the historical location of mobile users;

图3为移动用户按位置聚类的流程图；Fig. 3 is the flowchart of clustering by location of mobile users;

图4为移动用户社交网络结构图；Fig. 4 is a structural diagram of a mobile user social network;

图5为移动搜索结果的详细过滤流程图。Fig. 5 is a detailed filtering flow chart of mobile search results.

具体实施方式 Detailed ways

下面结合附图对本发明进行详细说明。The present invention will be described in detail below in conjunction with the accompanying drawings.

本发明提供的一种移动场景下的搜索结果过滤方法，如图1所示，先是过滤预处理阶段，主要包括用户细分，构建用户特征模型和构建用户社会网络，分别对应下述步骤(1)至步骤(3)，然后是结果过滤阶段，对应下述步骤(4)。具体的处理步骤如下：A method for filtering search results under a mobile scene provided by the present invention, as shown in FIG. 1 , is first a filtering preprocessing stage, mainly including user subdivision, constructing a user characteristic model and constructing a user social network, respectively corresponding to the following steps (1 ) to step (3), and then the result filtering stage, corresponding to the following step (4). The specific processing steps are as follows:

1、过滤预处理阶段，包括下述步骤(1)至步骤(3)。1. Filtration pretreatment stage, comprising the following steps (1) to (3).

(1)用户细分，采用数据挖掘的方法对用户进行细分，现有电信运营商提供的用户数据集，里面收集了大量的用户数据，如用户的历史位置信息，历史通话记录，用户的历史查询记录和浏览记录，历史业务数据等，本发明主要以用户的历史位置信息来对用户进行细分，具体步骤如下：(1) User segmentation, using data mining methods to segment users, the user data sets provided by existing telecom operators, which collect a large amount of user data, such as user historical location information, historical call records, user Historical query records and browsing records, historical business data, etc., the present invention mainly uses the user's historical location information to subdivide the user, and the specific steps are as follows:

(a)根据用户的历史位置变化频率对用户进行划分，用户的历史位置信息记录了用户历史位置L和相应时间信息T，位置信息L以经纬度的形式记录在数据集里，如(30.2332，114.3243)，时间信息T以时间点的形式记录，已知用户相邻两次历史位置的经纬度，采用经纬度距离公式(式(1))很容易计算出其距离，设第一个位置L₁的经纬度为(lon₁，lat₁)，第二个位置L₂的经纬度为(lon₂，lat₂)，按照0度经线的基准，东经取正值，西经取负值，北纬按(90°-lat)带入计算，南纬按(90°+lat)带入计算，用式子(1)则可计算两点之间的距离。(a) The users are divided according to the user’s historical position change frequency. The user’s historical position information records the user’s historical position L and corresponding time information T. The position information L is recorded in the data set in the form of latitude and longitude, such as (30.2332, 114.3243 ), the time information T is recorded in the form of time points, and the longitude and latitude of the two adjacent historical positions of the user are known, and the distance can be easily calculated by using the longitude and latitude distance formula (formula (1)), and the longitude and latitude of the first position L ₁ is (lon ₁ , lat ₁ ), the latitude and longitude of the second location L ₂ is (lon ₂ , lat ₂ ), according to the benchmark of the 0-degree meridian, the east longitude takes a positive value, the west longitude takes a negative value, and the north latitude is (90°- lat) into the calculation, the south latitude is calculated according to (90°+lat), and the distance between two points can be calculated by formula (1).

C＝sin(lat₁)·sin(lat₂)·cos(lon₁-lon₂)+cos(lat₁)·cos(lat₂)C＝sin(lat ₁ )·sin(lat ₂ )·cos(lon ₁ -lon ₂ )+cos(lat ₁ )·cos(lat ₂ )

$Dis dis (({L L}_{11},, {L L}_{22})) = = R R \cdot &Center Dot; arccos arccos ((C C)) \cdot &Center Dot; \frac{π π}{180180} - - - - - - ((11))$

对每一个用户U_i，(i＝1，2，...，N)，计算其最近一段时间ΔT(如一个月)内的历史位置累计变化频率F_i，(i＝1，2，...，N)，其中，N表示用户的数量。For each user U _i , (i=1, 2, ..., N), calculate the accumulative change frequency F _i , (i=1, 2,. .., N), where N represents the number of users.

${F f}_{i i} = = \frac{11}{ΔT ΔT} {Σ Σ}_{11}^{M m} | | \frac{Dis dis (({L L}_{k k},, {L L}_{k k - - 11}))}{{T T}_{k k} - - {T T}_{k k - - 11}} | | - - - - - - ((22))$

如式(2)所示，(L₁，T₁)，(L₂，T₂)，...，(L_M，T_M)是用户U_i，(i＝1，2，...，N)最近一段时间ΔT内的历史位置信息，(L_k-1，T_k-1)和(L_k，T_k)即用户相邻的两次历史位置和时间信息，Dis(L_k，L_k-1)与T_k-T_k-1分别为相邻两次的历史位置距离与时间之差。M表示当前用户的历史位置数量，k表示历史位置的序号。As shown in formula (2), (L ₁ , T ₁ ), (L ₂ , T ₂ ), ..., (L _M , T _M ) are users U _i , (i=1, 2, ... , N) The historical location information within the most recent period of ΔT, (L _k-1 , T _k-1 ) and (L _k , T _k ) are two adjacent historical location and time information of the user, Dis(L _k , L _k-1 ) and T _k −T _k-1 are respectively the difference between the historical location distance and time of two adjacent times. M represents the number of historical positions of the current user, and k represents the sequence number of the historical positions.

统计所有用户的F，得到F的总体范围区间Ω，将Ω划分成若干子区间Ω₁，Ω₂，...，Ω_n，n表示用户群体的数量，这些子区间以F表征不同的用户群体，用户依照其F被划分至相应的子区间内，如图2所示，用户A的F较高，可能是经常出差的商务人士。用户B的F较低，则可能经常是较长时间都在某一固定位置，如可能是某一高校学生，这样根据位置的变化频率F，对用户进行一个初步的划分，将用户分成的不同的群体Ω₁，Ω₂，...，Ω_n。对Ω进行划分可以采用均分的方式，也可以由系统预先设定一个划分标准。Count the F of all users to obtain the overall range range Ω of F, and divide Ω into several sub-intervals Ω ₁ , Ω ₂ , ..., Ω _n , where n represents the number of user groups, and these sub-intervals use F to represent different users Groups, users are divided into corresponding sub-intervals according to their F, as shown in Figure 2, user A has a higher F, and may be a business person who travels frequently. If user B's F is low, he may be in a fixed location for a long time, for example, he may be a student in a certain college. In this way, according to the location change frequency F, a preliminary division of users is carried out, and users are divided into different groups. The groups Ω ₁ , Ω ₂ , ..., Ω _n . Dividing Ω can be divided equally, or a division standard can be preset by the system.

(b)接下来对每一个Ω_j，(j＝1，2，...，n，j表示群体的序号)里的用户按历史位置信息进行聚类，将位置邻近的用户聚为一类，相关调查研究表明，地理位置邻近的用户在一定程度上有着相似的用户特征，采用k均值聚类算法对每一个Ω_j，(j＝1，2，...，n)里的用户进行聚类，步骤如下：(b) Next, for each Ω _j , (j=1, 2, ..., n, j represents the serial number of the group), the users in each group are clustered according to the historical location information, and the users with adjacent positions are clustered into one group , related investigations and studies have shown that geographically adjacent users have similar user characteristics to a certain extent, and the k-means clustering algorithm is used for each user in Ω _j , (j=1, 2,..., n) Clustering, the steps are as follows:

(b1)首先计算出每一个用户U_i，(i＝1，2，...，N)在ΔT时间内的历史位置的中心位置O_i，根据O_i对用户进行聚类；i表示用户的序号；(b1) First calculate the central position O _{i of each user U i} , (i=1 _{, 2, ..., N) in the historical position within ΔT time, and cluster the users according to O i} _; i represents the user serial number;

(b2)从Ω_j，(j＝1，2，...，n)中随机选取k个用户，每个用户U_q，(q＝1，2，...，k)代表一个初始的用户簇C_q，(q＝1，2，...k)，其O_q，(q＝1，2，...，k)代表用户簇的初始中心；(b2) Randomly select k users from Ω _j , (j=1, 2, ..., n), each user U _q , (q = 1, 2, ..., k) represents an initial User cluster C _q , (q=1, 2, ... k), whose O _q , (q = 1, 2, ..., k) represents the initial center of the user cluster;

(b3)对Ω_j，(j＝1，2，...，n)中剩余的每个用户，计算其与每个用户簇C_q，(q＝1，2，...k)中心O_q，(q＝1，2，...，k)的距离(经纬度距离公式)，将其指派给距离最近的用户簇；(b3) For each remaining user in Ω _j , (j=1, 2, ..., n), calculate its center with each user cluster C _q , (q = 1, 2, ... k) O _q , the distance (longitude and latitude distance formula) of (q=1, 2, ..., k), assign it to the nearest user cluster;

(b4)然后重新计算每个用户簇的新的中心值O_q，(q＝1，2，...，k)，替换旧的中心值。按式(3)计算准则函数E_j的值，若E_j的值收敛则聚类过程结束，否则，转步骤b3。(b4) Then recalculate the new central value O _q of each user cluster, (q=1, 2, . . . , k), and replace the old central value. Calculate the value of criterion function E _j according to formula (3), if the value of E _j converges, the clustering process ends, otherwise, go to step b3.

$E_{j} = Σ_{q = 1}^{k} \underset{U &Element; Ω_{j}}{Σ} Dis (U, C_{q}),$ (j＝1，2，....n) (3) ${E.}_{j} = Σ_{q = 1}^{k} \underset{u &Element; Ω_{j}}{Σ} dis (u, C_{q}),$ (j=1, 2, ... n) (3)

如式(3)所示，Dis(U，C_q)代表Ω_j，(j＝1，2，...，n)里的用户与用户簇C_q，(q＝1，2，...k)中心O_q，(q＝1，2，...，k)的距离。As shown in formula (3), Dis(U, C _q ) represents the user and user cluster C _q in Ω _j , (j=1, 2, ..., n), (q = 1, 2, .. .k) distance from the center O _q , (q=1, 2, . . . , k).

聚类得到紧凑的用户簇，这样在Ω₁，Ω₂，...，Ω_n划分的基础上，将用户进一步划分成了更小的群体G₁，G₂，...，G_m，实现用户细分。Clustering to obtain compact user clusters, so that based on the division of Ω ₁ , Ω ₂ , ..., Ω _n , the users are further divided into smaller groups G ₁ , G ₂ , ..., G _m , Implement user segmentation.

(2)构建用户特征模型，用户的历史查询记录很好的表征了用户的兴趣特征，通过分析用户的历史查询记录，采用向量空进模型对用户进行特征建模，其步骤包括：(2) Build a user characteristic model, the user's historical query records well characterize the user's interest characteristics, by analyzing the user's historical query records, adopt the vector air-entry model to carry out feature modeling to the user, and its steps include:

(a)统计所有用户ΔT时间内的所有历史查询记录，统计得到d个互异的词q₁，q₂，...，q_d，作为向量空间的d个维，用户的特征向量表示为f_Ui＝{(q₁，v₁)，(q₂，v₂)，...，(q_d，v_d)}，(i＝1，2，...，N)，v_a，(a＝1，2，...，d)代表各个维的权值。(a) Count all historical query records of all users within ΔT time, and obtain d different words q ₁ , q ₂ , ..., q _d as the d dimensions of the vector space, and the user's feature vector is expressed as f _Ui = {(q ₁ , v ₁ ), (q ₂ , v ₂ ), ..., (q _d , v _d )}, (i=1, 2, ..., N), v _a , (a=1, 2, . . . , d) represents the weight of each dimension.

(b)采用TF/IDF(词频/逆文档频率)模型，对每一个用户U_i，(i＝1，2，...，N)，计算其特征向量每一维的权值。对q₁，q₂，...，q_d中的每一个词q_a，(a＝1，2，...，d)，如果其没有出现在用户的历史查询记录中，则其相应权值v_a，(a＝1，2，...，d)为0，否则为其TF/IDF值，TF即词频，这里为用户的历史查询记录中出现该词的次数，IDF即逆文档频率，统计那些历史查询记录中出现过该词的用户的个数D，IDF值即log(N/D)，N是所有用户数，TF/IDF值为TF与IDF的乘积。(b) Using TF/IDF (term frequency/inverse document frequency) model, for each user U _i , (i=1, 2, . . . , N), calculate the weight of each dimension of its feature vector. For each word q _a in q ₁ , q ₂ , ..., q _d , (a=1, 2, ..., d), if it does not appear in the user's historical query records, then its corresponding The weight v _a , (a=1, 2, ..., d) is 0, otherwise it is the TF/IDF value, TF is the word frequency, here is the number of times the word appears in the user's historical query records, and IDF is the inverse Document frequency, counting the number D of users who have the word in the historical query records, the IDF value is log(N/D), N is the number of all users, and the TF/IDF value is the product of TF and IDF.

(3)挖掘用户社交网络信息，分析用户历史通话记录，对每一个用户U_i，(i＝1，2，...，N)，其社交网络呈现为一个以该用户为中心的星型拓扑图，如图3所示，中心节点B代表用户自己，星星节点A，C，D，E，F，G等代表与B有通话记录的用户，边的权重ψ代表用户之间的关系程度，该步骤主要是估算ψ的值。(3) Mining the user's social network information, analyzing the user's historical call records, for each user U _i , (i=1, 2, ..., N), its social network is presented as a star centered on the user Topological diagram, as shown in Figure 3, the central node B represents the user himself, the star nodes A, C, D, E, F, G, etc. represent the users who have call records with B, and the weight ψ of the edge represents the degree of relationship between users , this step is mainly to estimate the value of ψ.

用户的历史通话记录数据记录了所有用户之间的通话记录，包括通话双方的id号码)，通话开始时间，通话结束时间等，对每一个用户U_i，(i＝1，2，...，N)，分析其ΔT时间内的通话记录，对与其有通话记录的每一个用户u_x，(x＝1，2，...，e，e表示与其有通话记录的用户个数)，分析其与U_i，(i＝1，2，...，N)在ΔT内的总通话次数α，总通话时长β，通话规律γ，综合分析这些因素，可以大致推断出U_i，(i＝1，2，...，N)与u_x，(x＝1，2，...，e)之间的关系程度ψ_ix。The user's historical call record data records the call records between all users, including the id numbers of both parties), call start time, call end time, etc., for each user U _i , (i=1, 2,... , N), analyze its call record within ΔT time, for each user u _x who has a call record with it, (x=1, 2, ..., e, e represents the number of users who have a call record with it), By analyzing the total number of calls α, total call duration β, and call rules γ between it and U _i , (i=1, 2, ..., N) within ΔT, and comprehensively analyzing these factors, it can be roughly inferred that U _i , ( The relationship degree ψ _ix between i=1, 2, . . . , N) and u _x , (x=1, 2, . . . , e).

总通话次数α和总通话时长β比较容易统计得到，但它们都是总体性的统计量，比较单一，只能总体上粗略体估计用户之间的关系程度，而忽略了重要的细节特征，如每次通话事件随时间的分布是否均匀，是整体均匀还是局部均匀等，所以这里还引入了通话规律γ这一特征因素来表征U_i，(i＝1，2，...，N)与u_x，(x＝1，2，...，e)之间的关系程度，通过统计分析时间ΔT内的所有通话事件的时间分布特点，借用方差的思想，如式(4)(5)(6)，t_h，(h＝1，2，...，α)为每次通话开始时间，Δt_h为相邻两次通话记录之间的时间差，S_t为其方差，γ反比于S_t，如式(6)所示，方差小表示该段时间内的通话比较有规律，γ相应较大，反之亦然。The total number of calls α and the total call duration β are relatively easy to obtain, but they are overall statistics, relatively simple, and can only roughly estimate the relationship between users in general, while ignoring important details, such as Whether the distribution of each call event over time is uniform, whether it is uniform overall or locally, etc., so the characteristic factor of call rule γ is also introduced here to represent U _i , (i=1, 2, ..., N) and u _x , the degree of relationship between (x=1, 2, ..., e), by statistically analyzing the time distribution characteristics of all call events within the time ΔT, borrowing the idea of variance, such as formula (4) (5) (6), t _h , (h=1, 2, ..., α) is the start time of each call, Δt _h is the time difference between two adjacent call records, S _t is its variance, and γ is inversely proportional to S _t , as shown in formula (6), a small variance means that the calls within this period are relatively regular, and γ is relatively large, and vice versa.

Δt_h＝t_h-t_h-1，(h＝2，3，...，α) (4)Δt _h =t _h -t _h-1 , (h=2, 3, . . . , α) (4)

$\overset{&OverBar; &OverBar;}{Δt Δt} = = \frac{11}{α α - - 11} {Σ Σ}_{h h = = 11}^{α α} {Δt Δt}_{h h} - - - - - - ((55))$

${S S}_{t t} = = \frac{11}{α α - - 11} {Σ Σ}_{h h = = 22}^{α α} {((\overset{&OverBar; &OverBar;}{Δt Δt} - - {Δt Δt}_{h h}))}^{22} - - - - - - ((66))$

$γ γ = = \frac{11}{{S S}_{t t}} - - - - - - ((77))$

将计算得到的α，β，γ进行归一化处理，得到0和1范围之间的值，ψ_ix，(i＝1，2，...，N，x＝1，2，...，e)的值采用式(8)计算得到，它是综合考虑α，β，γ得到的一个加权值，式(8)中，0≤λ₁≤1，0≤λ₂≤1，0≤λ₃≤1，且λ₁+λ₂+λ₃＝1，其默认值取均值1/3。Normalize the calculated α, β, γ to obtain a value between 0 and 1, ψ _ix , (i=1, 2, ..., N, x = 1, 2, ... , the value of e) is calculated by formula (8), which is a weighted value obtained by comprehensively considering α, β, γ. In formula (8), 0≤λ ₁ ≤1, 0≤λ ₂ ≤1, 0≤ λ ₃ ≤ 1, and λ ₁ +λ ₂ +λ ₃ =1, and its default value is 1/3 of the mean value.

ψ_ix＝λ₁·α+λ₂·β+λ₃·γ，(λ₁+λ₂+λ₃＝1) (8)ψ _ix =λ ₁ ·α+λ ₂ ·β+λ ₃ ·γ, (λ ₁ +λ ₂ +λ ₃ =1) (8)

这样通过该步骤的分析与计算，就得到了每个用户U_i，(i＝1，2，...，N)的社交网络信息，包括其与之有联系的用户u_x，(x＝1，2，...，e)之间的关系程度ψ_ix。In this way, through the analysis and calculation of this step, the social network information of each user U _i , (i=1, ₂ , . 1, 2, ..., e) the degree of relationship between ψ _ix .

(4)搜索结果过滤，前面步骤(1)至步骤(3)都是准备阶段，是为了该步骤的搜索结果过滤服务的，步骤(2)建立的用户特征模型是用来对搜索结果进行基于内容的过滤，步骤(1)所做的用户细分和步骤(3)挖掘的用户社交网络信息是用来对搜索结果进行协同过滤。(4) Search result filtering. Steps (1) to (3) above are all preparatory stages for the search result filtering service of this step. The user characteristic model established in step (2) is used to perform search results based on Content filtering, user segmentation in step (1) and user social network information mined in step (3) are used for collaborative filtering of search results.

该步骤对搜索结果先进行基于内容的过滤，然后进行协同过滤。以达到个性化和精简搜索结果的目的。In this step, content-based filtering is first performed on search results, and then collaborative filtering is performed. To personalize and refine search results.

用户U_i，(i＝1，2，...，N)提交一次搜索Q，搜索请求首先由现有互联网搜索引擎来处理，现有互联网搜索引擎对搜索Q返回一个初始结果集，该结果集通常较大，选取该结果集里的前φ条结果来进行过滤，若不足φ条，则选取全部初始结果集，作为待过滤结果集R₁，R₂，...，R_Z，φ为一个经验值，由系统预先设定，如设定为300，Z为待过滤结果的个数。结果的过滤流程如图5所示，步骤如下：User U _i , (i=1, 2, ..., N) submits a search Q, the search request is first processed by the existing Internet search engine, and the existing Internet search engine returns an initial result set to the search Q, the result The set is usually large, select the first φ results in the result set to filter, if there are less than φ, select all the initial result sets as the result sets to be filtered R ₁ , R ₂ ,..., R _Z , φ is an empirical value, preset by the system, for example, it is set to 300, and Z is the number of results to be filtered. The result filtering process is shown in Figure 5, and the steps are as follows:

(a)对待过滤结果集R₁，R₂，...，R_Z，建立特征向量，采用步骤(2)中建立的d维向量空间对这些结果建立特征向量，R_r(r＝1，2，...，Z)的特征向量表示为f_Rr＝{q₁，v₁)，(q₂，v₂)，...，(q_d，v_d)}，(r＝1，2，...，Z)，v_a，(a＝1，2，...，d)代表各个维上的权值。同样采用步骤(2)中用到的TF/IDF(词频/逆文档频率)模型来计算f_Rr，(r＝1，2，...，Z)在每一维上的权值v_a，(a＝1，2，...，d)，对q₁，q₂，...q_d中的每一个词q_a，(a＝1，2，...，d)，如果其没有出现在R_r，(r＝1，2，...，Z)中，则其权值为0，否则为其TF/IDF值，TF为其在R_r，(r＝1，2，...，Z)中出现的次数，IDF即逆文档频率，统计那些包含该词的结果个数z，IDF值即log(Z/z)，Z是所有结果数，TF/IDF值为TF与IDF的乘积。(a) To filter the result sets R ₁ , R ₂ , ..., R _Z , establish feature vectors, use the d-dimensional vector space established in step (2) to establish feature vectors for these results, R _r (r=1, 2, ..., Z) is expressed as f _Rr = {q ₁ , v ₁ ), (q ₂ , v ₂ ), ..., (q _d , v _d )}, (r=1, 2, . . . , Z), v _a , (a=1, 2, . . . , d) represent weights on each dimension. Also use the TF/IDF (term frequency/inverse document frequency) model used in step (2) to calculate f _Rr , the weight v _a of (r=1, 2, ..., Z) on each dimension, (a=1,2,...,d), for each word q _a in q ₁ , q ₂ ,...q _d , (a=1,2,...,d), if its If it does not appear in R _r , (r=1, 2, ..., Z), its weight is 0, otherwise it is its TF/IDF value, and TF is its value in R _r , (r=1, 2, ..., Z), IDF is the inverse document frequency, and counts the number of results containing the word z, the IDF value is log(Z/z), Z is the number of all results, and the TF/IDF value is TF Multiplied by IDF.

(b)接下来寻找当前用户U_i，(i＝1，2，...，N)的相似用户，从两个用户集合中选取，一是步骤(1)中用户所属的群体G_g，g为用户所属的群体的序号，其取值范围为1至m，二是步骤(3)中建立的用户社交网络里的用户的集合，将这两个集合进行合并(有可能有重复的用户)得到集合S，从集合S中选取若干个相似用户。(b) Next, look for similar users of the current user U _i , (i=1, 2, ..., N), and select from two user sets, one is the group G _g to which the user belongs in step (1), g is the serial number of the group to which the user belongs, and its value range is from 1 to m. The second is the set of users in the user social network established in step (3), and these two sets are merged (there may be repeated users ) to get a set S, and select several similar users from the set S.

$sim sim (({U u}_{i i},, {U u}_{is is})) = = ((11 + + ψ ψ (({U u}_{i i},, {U u}_{is is})))) \cdot \cdot cos cos (({f f}_{{U u}_{i i}},, {f f}_{{U u}_{is is}})) - - - - - - ((99))$

$cos cos (({f f}_{{U u}_{i i}},, {f f}_{{U u}_{is is}})) = = \frac{{f f}_{{U u}_{i i}} \cdot \cdot {f f}_{{U u}_{is is}}}{| | | | {f f}_{{U u}_{is is}} | | | | \cdot \cdot | | | | {f f}_{{U u}_{is is}} | | | |} - - - - - - ((1010))$

式(10)中，|| ||表示向量的模。In formula (10), || || represents the modulus of the vector.

(5)采用式(10)所示的向量余弦夹角公式计算U_i，(i＝1，2，...，N)与集合S中的每一个用户U_is之间的相似度，如式(9)所示，向量夹角越小，余弦值越大，相似度越大，反之亦然。f_Ui和f_Uis分别代表U_i和U_is的特征向量，ψ(U_i，U_is)代表U_i与U_is之间的关系程度，若U_is在U_i的社交网络中，则ψ(U_i，U_is)取相应的值，否则取零值。按相似度从高到低选取前η个用户U_i1，U_i2，...，U_iη，若不足η个，则选取S中的所有用户。η为一个经验值，由系统预先设定，如其默认值可以取10个。(5) Calculate the similarity between U _i , (i=1, 2, ..., N) and each user U _is in the set S by using the vector cosine angle formula shown in formula (10), as As shown in formula (9), the smaller the angle between the vectors, the larger the cosine value, and the larger the similarity, and vice versa. f _Ui and f _Uis represent the feature vectors of U _i and U _is respectively, ψ(U _i , U _is ) represents the degree of relationship between U _i and U _is , if U _is in the social network of U _i , then ψ( U _i , U _is ) take corresponding values, otherwise take zero value. Select the _first n users U _i1 , U _i2 , . η is an empirical value, preset by the system, such as its default value can be 10.

(c)然后开始进行结果过滤了，过滤过程分两个阶段，基于内容的过滤阶段和协同过滤阶段：(c) Then start to filter the results. The filtering process is divided into two stages, the content-based filtering stage and the collaborative filtering stage:

(c1)先是基于内容过滤，对(a)中的每一条待过滤初始结果R_r，(r＝1，2，...，Z)，依次计算其与用户U_i，(i＝1，2，...，N)之间的相似度，同样，采用式(10)计算两者之间的相似度，如式(11)所示，f_Ui和f_Rr分别代表U_i和R_r的特征向量。根据相似度按阈值ζ过滤，将相似度小于ζ的结果过滤掉，得到中间结果集R_r，(r＝1，2，...，Z_ζ)，过滤得到的中间结果按原始的先后顺序排列。阈值ζ为一个经验值，由系统预先设定，0≤ζ≤1，其默认值可以设定为0.65。(c1) First, based on content filtering, for each initial result R _r to be filtered in (a), (r=1, 2, ..., Z), sequentially calculate its relationship with user U _i , (i=1, 2, ..., N), similarly, use formula (10) to calculate the similarity between the two, as shown in formula (11), f _Ui and f _Rr represent U _i and R _r respectively eigenvectors of . Filter according to the threshold ζ according to the similarity, filter out the results with a similarity smaller than ζ, and obtain the intermediate result set R _r , (r=1, 2, ..., Z _ζ ), the intermediate results obtained by filtering are in the original order arrangement. Threshold ζ is an empirical value, preset by the system, 0≤ζ≤1, and its default value can be set to 0.65.

$sim sim (({U u}_{i i},, {U u}_{r r})) = = cos cos (({f f}_{{U u}_{i i}},, {f f}_{{R R}_{r r}})) - - - - - - ((1111))$

(c2)接下来对中间结果集R_r，(r＝1，2，...，Z_ζ)进行协同过滤，协同过滤是基于相似用户通常有着相似的兴趣这一思想，以当前用户的相似用户来对当前用户进行协同推荐，采用步骤(b)中计算得到的用户U_i，(i＝1，2，...，N)的η个最相似用户U_i1，U_i2，...，U_iη，对中间结果R_r，(r＝1，2，...，Z_ζ)，按式(12)计算相似度sim′(U_i，R_r)进行协同过滤，式中，采用式(10)向量余弦夹角公式，

和

分别代表U_is与U_i，U_is与R_r之间的相似度。(c2) Next, perform collaborative filtering on the intermediate result set R _r , (r=1, 2, ..., Z _ζ ). Collaborative filtering is based on the idea that similar users usually have similar interests. _{The user comes to carry out collaborative recommendation for the current user, using the n most similar users U i1} _, U _i2 , . . . , U _{i η} , for the intermediate results R _r , (r=1, 2, ..., Z _ζ ), calculate the similarity sim′(U _i , R _r ) according to formula (12) to perform collaborative filtering, where, using Equation (10) vector cosine angle formula,

and

represent the similarity between U _is and U _i , U _is and R _r respectively.

${sim sim}^{' '} (({U u}_{i i},, {R R}_{r r})) = = {Σ Σ}_{s the s = = 11}^{η η} ((cos cos (({f f}_{{U u}_{is is}},, {f f}_{{U u}_{i i}})) \cdot \cdot cos cos (({f f}_{{U u}_{is is}},, {f f}_{{R R}_{r r}})))) - - - - - - ((1212))$

Rank_r＝θ·r+(1-θ)·sim′(U_i，R_r) (13)Rank _r = θ·r+(1-θ)·sim'(U _i , R _r ) (13)

根据sim′(U_i，R_r)按阈值ε进行协同过滤，将相似度小于ε的结果过滤掉，得到临时结果集R_r，(r＝1，2，...，Z_ε)，r代表其在临时结果集中的先后顺序排序，依次为1，2，...，Z_ε)，对R_r，(r＝1，2，...，Z_ε)，以加权系数θ计算其顺序r和sim′(U_i，R_r)的加权和，作为最终结果排名Rank_r，如式(13)所示，以此排名对R_r，(r＝1，2，...，Z_ε)重新排序，得到最终结果，返回给用户，过滤过程结束。阈值ε与加权系数θ均为经验值，由系统预先设定，0≤ε≤1，0≤θ≤1，ε的默认值可以设定为0.85，θ的默认值可以设定为0.5。According to sim′(U _i , R _r ), perform collaborative filtering according to the threshold ε, filter out the results with similarity less than ε, and obtain a temporary result set R _r , (r=1, 2, ..., Z _ε ), r Represents its order in the temporary result set, which is 1, 2, ..., Z _ε ), for R _r , (r=1, 2, ..., Z _ε ), the weighting coefficient θ is used to calculate its The weighted sum of sequence r and sim'(U _i , R _r ) is used as the final result to rank Rank _r , as shown in formula (13), and the ranking pair R _r , (r=1, 2, ..., Z _ε ) re-sort, get the final result, return to the user, and the filtering process ends. Both the threshold ε and the weighting coefficient θ are empirical values, preset by the system, 0≤ε≤1, 0≤θ≤1, the default value of ε can be set to 0.85, and the default value of θ can be set to 0.5.

本发明不仅局限于上述具体实施方式，本领域一般技术人员根据本发明公开的内容，可以采用其它多种具体实施方式实施本发明，因此，凡是采用本发明的设计结构和思路，做一些简单的变化或更改的设计，都落入本发明保护的范围。The present invention is not limited to the above-mentioned specific embodiments, and those skilled in the art can implement the present invention by using other various specific embodiments according to the disclosed content of the present invention. Changes or modified designs all fall within the protection scope of the present invention.

Claims

1. A method for filtering search results under a mobile scene, the method comprising the steps of:

Step 1 For the initial result sets R ₁ , R ₂ , . . . , R _Z of users U _i , i=1, 2, . The eigenvector of R _r is expressed as f _Rr = {q ₁ , v ₁ ), (q ₂ , v ₂ ),..., (q _d , v _d )}, v _a represents the weight on each dimension; use The term frequency/inverse document frequency TF/IDF model calculates f _Rr , the weight v _a on each dimension, for each word q _a in q ₁ , q ₂ ,...q _d , if it does not appear in R _r , medium, its weight is 0, otherwise its TF/IDF value, TF is the number of times it appears in R _r , IDF is the inverse document frequency, count the number z of the results containing the word;

Among them, the IDF value is log(Z/z), Z is the number of initial results to be filtered, the TF/IDF value is the product of TF and IDF, r=1, 2,..., Z, a=1, 2 ,...,d;

The second step is to find similar users of the current user U _i , and select from the following two user sets, one is the group G _g to which the user belongs, and g is the serial number of the group to which the user belongs, and its value ranges from 1 to m, The second is the set of users in the user's social network. Merge these two sets to obtain a set S, record the user in this set as U _is , use the vector cosine angle formula shown in formula I to calculate the user U _i and set S The similarity between each user U _is , as shown in formula II, the smaller the vector angle, the larger the cosine value, the greater the similarity, and vice versa; i represents the serial number of the user, and N represents the number of users , i=1, 2,..., N, f _Ui and f _Uis represent the eigenvectors of U _i and U _is respectively, ψ(U _i , U _is ) represents the degree of relationship between U _i and U _is , if U _is in the social network of U _i , then ψ(U _i , U _is ) takes the corresponding value, otherwise it takes zero value; select the first n users U _i1 , U _i2 ,... , U _i , if less than n, then select all users in S; n is a preset value;

sim (u_{i}, u_{is}) = (1 + ψ (u_{i}, u_{is})) &Center Dot; \cos (f_{u_{i}}, f_{u_{is}})

Formula I

\cos (f_{u_{i}}, f_{u_{is}}) = \frac{f_{u_{i}} &Center Dot; f_{u_{is}}}{| | f_{u_{is}} | | \cdot | | f_{u_{is}} | |}

Formula II

Step 3 Filter based on content:

For each initial result R _r to be filtered, the similarity between it and the user U _i is calculated sequentially using formula III, f _Ui and f _Rr represent the feature vectors of U _i and R _r respectively; according to the similarity according to the preset Threshold ζ filtering, filtering out the initial results whose similarity is smaller than the threshold ζ to obtain an intermediate result set R _r , r=1, 2, ..., Z _ζ , and the intermediate results obtained by filtering are arranged in the original sequence;

sim (u_{i}, R_{r}) = \cos (f_{u_{i}}, f_{R_{r}})

Formula III

in,

\cos (f_{u_{i}}, f_{R_{r}}) = \frac{f_{u_{i}} &Center Dot; f_{R_{r}}}{| | f_{u_{i}} | | \cdot | | f_{R_{r}} | |}

The second step is to perform collaborative filtering on the intermediate result set R _r , r=1, 2, ..., Z _ζ , using the n most similar users U _i1 , U _i2 , ..., U _{i η} of the user U _i , to The intermediate result R _r ,, calculate the similarity sim'(U _i , R _r ) according to formula IV for collaborative filtering, where,

and

represent the similarity between U _is and U _i , U _is and R _r respectively;

{sim}^{'} (u_{i}, R_{r}) = Σ_{the s = 1}^{η} (\cos (f_{u_{is}}, f_{u_{i}}) \cdot \cos (f_{u_{is}}, f_{R_{r}}))

Formula IV

Rank _r ＝θ·r+(1-θ)·sim′(U _i , R _r ) Formula V

According to sim′(U _i , R _r ), perform collaborative filtering according to the preset threshold ε, filter out the intermediate results whose similarity is less than ε, and obtain the temporary result set R _r , r=1, 2, ..., Z _ε , r represents its order in the temporary result set, which is 1, 2, ..., Z _ε , for temporary R _r , use the formula V to calculate its order r and sim with the preset weighting coefficient θ The weighted sum of ′(U _i , R _r ) is used as the final result to rank Rank _r , and the temporary result set R _r is reordered based on this ranking to obtain the final result, which is returned to the user, and the filtering process ends.

2. the search result filtering method under the mobile scene according to claim 1, is characterized in that: the initial result set in the 1st step obtains in the following manner:

For user U _i to submit a search Q, the search request is first processed by the existing Internet search engine, and the existing Internet search engine returns an initial result set to the search Q, and selects the first φ results in the result set to filter, if If there are less than φ items, select all the initial result sets as the result sets to be filtered R ₁ , R ₂ ,..., R _Z , φ is preset by the system, and Z is the number of results to be filtered.

3. the search result filtering method under the mobile scene according to claim 1, is characterized in that: the 1st step obtains the feature vector of result to be filtered according to the following manner:

Count all historical query records of all users within ΔT time, and obtain d different words q ₁ , q ₂ , ..., q _d as the d dimensions of the vector space, and the user's feature vector is expressed as f _Ui = {q ₁ , v ₁ ), (q ₂ , v ₂ ), ..., (q _d , v _d )}, i=1, 2, ..., N, v _a , a=1, 2, ..., d represents the weight of each dimension.

4. The search result filtering method under the mobile scene according to claim 1, characterized in that: in the second step, the most similar user is obtained in the following manner:

Step 4.1 Find similar users of the current user U _i , and merge the group G _g to which the user belongs with the set of users in the user's social network to obtain a set S, where g is the serial number of the group to which the user belongs, and its value range is 1 to m, m represents the number of groups;

Step 4.2 Use formula VI to calculate the similarity sim(U _i , U _is ) between U _i and each user U _is in the set S, f _Ui and f _Uis represent the feature vectors of U _i and U _is respectively, ψ (U _i , U _is ) represents the degree of relationship between U _i and U _is , if U _is in U _i 's social network, then ψ(U _i , U _is ) takes the corresponding value, otherwise it takes zero value; press Select the first n users U _i1 , U _i2 , _{. .} .

sim (u_{i}, u_{is}) = (1 + ψ (u_{i}, u_{is})) &Center Dot; \cos (f_{u_{i}}, f_{u_{is}})

Formula VI

in,

\cos (f_{u_{i}}, f_{u_{is}}) = \frac{f_{u_{i}} \cdot f_{u_{is}}}{| | f_{u_{is}} | | &Center Dot; | | f_{u_{is}} | |} .

5. the method for filtering search results under the mobile scene according to claim 4, characterized in that: in the 4.1 step, the group _G to which the user belongs is obtained in the following manner:

Step 5.1 Divide the users according to the user's historical location change frequency. The user's historical location information records the user's historical location information L and corresponding time information T. The historical location information L is recorded in the data set in the form of latitude and longitude, and the time information T Recorded in the form of time points, the longitude and latitude of the two adjacent historical locations of the user are known, and the distance is calculated using the longitude and latitude distance formula;

For each user U _i , calculate the accumulative change frequency F _ij of its historical position within the latest period of time ΔT according to formula VII:

f_{i} = \frac{1}{ΔT} Σ_{1}^{m} | \frac{dis (L_{k}, L_{k - 1})}{T_{k} - T_{k - 1}} |

VII

(L ₁ , T ₁ ), (L ₂ , T ₂ ), ..., (L _M , T _M ) are the historical location information of user U _i within the latest period ΔT, (L _k-1 , T _{k -1} ) and (L _k , T _k ) are the user's two adjacent historical location and time information, and Dis(L _k , L _k-1 ) and T _k -T _k-1 are the two adjacent histories respectively The difference between location distance and time; M represents the number of historical locations of the current user, and k represents the sequence number of historical locations;

Step 5.2 counts the accumulative change frequency F of all users' historical positions, and obtains the overall range interval Ω of F, and divides Ω into several sub-intervals Ω ₁ , Ω ₂ , ..., Ω _n , where n represents the number of user groups, these Sub-intervals use F to represent different user groups, users are divided into corresponding sub-intervals according to their F, and users are divided into different groups Ω ₁ , Ω ₂ , ..., Ω _n ;

Step 5.3 Cluster the users in each Ω _j according to the historical location information, cluster the users with adjacent locations into one group, and then further divide the users into smaller groups G ₁ , G ₂ ,..., G _m , j=1, 2, . . . , n, j represents the serial number of the group.

6. according to the search result filtering method under the mobile scene described in claim 5, it is characterized in that: the 5.3rd step adopts k-means clustering algorithm to carry out clustering to the user in each Ω _{j li} , and step is as follows:

(b1) First calculate the center position O _i of the historical position of each user U _i in the most recent period of ΔT, and cluster the users according to the center position O _i ; i represents the serial number of the user;

(b2) Randomly select k users from Ω _j , each user U _q represents an initial user cluster C _q , and its center position O _q represents the initial center of the user cluster, q 1, 2, ..., k ;

(b3) For each remaining user in Ω _j , calculate the distance between it and the central position O _q of each user cluster C _q , and assign it to the nearest user cluster;

(b4) Then recalculate the new center position O _q of each user cluster, and replace the old center value; calculate the value of the criterion function E _j according to formula VIII, if the value of E _j converges, the clustering process ends, otherwise, Go to step b3;

{E.}_{j} = Σ_{q = 1}^{k} \underset{u &Element; Ω_{j}}{Σ} dis (u, C_{q}), j = 1,2, . . . no

Formula VIII

In Formula VIII, Dis(U, C _q ) represents the distance between the user in Ω _j and the user cluster C _q , and the center position O _q ;

(b5) Clustering to obtain compact user clusters, so that based on the division of Ω ₁ , Ω ₂ , ..., Ω _n , the users are further divided into smaller groups G ₁ , G ₂ , ..., G _m , realize user segmentation.

7. The search result filtering method under the mobile scene according to claim 4, characterized in that: in the 4.1 step, the user social network is constructed in the following manner:

Step 7.1 uses the word frequency/inverse document frequency TF/IDF model to calculate the weight of each dimension of its feature vector for each user U _i ; for each word q in q ₁ , q ₂ ,...,q _d _a, , if it does not appear in the user's historical query records, its corresponding weight v _a is 0, otherwise it is its TF/IDF value, TF is the word frequency, IDF is the inverse document frequency, count those that appear in the historical query records The number D of users who have passed the word, the IDF value is log(N/D), N is the number of all users, and the TF/IDF value is the product of TF and IDF;

Step 7.2 Analyze the call records of each user U _i within the latest period of ΔT, and analyze the total number of calls α and total call duration β of each user u _x with U _i within ΔT. , call law γ , use formula IX to calculate the degree of relationship between U _i and u _x ψ _ix ;

ψ _ix =λ ₁ ·α+λ ₂ ·β+λ ₃ ·γ Formula IX

In the formula, 0≤λ ₁ ≤1, 0≤λ ₂ ≤1, 0≤λ ₃ ≤1, and λ ₁ +λ ₂ +λ ₃ =1

γ γ = = \frac{11}{{S S}_{t t}}

{S S}_{t t} = = \frac{11}{α α - - 11} {Σ Σ}_{h h = = 22}^{α α} {((\overset{&OverBar; &OverBar;}{Δt Δt} - - {Δt Δt}_{h h}))}^{22}

Δt _h =t _h -t _h-1 , h=2, 3, . . . , α

\overset{&OverBar; &OverBar;}{Δt Δt} = = \frac{11}{α α - - 11} {Σ Σ}_{h h = = 11}^{α α} {Δt Δt}_{h h} . .