CN106055594A - Information providing method based on user interests - Google Patents

Information providing method based on user interests Download PDF

Info

Publication number
CN106055594A
CN106055594A CN201610346247.8A CN201610346247A CN106055594A CN 106055594 A CN106055594 A CN 106055594A CN 201610346247 A CN201610346247 A CN 201610346247A CN 106055594 A CN106055594 A CN 106055594A
Authority
CN
China
Prior art keywords
data
set
search
result
results
Prior art date
Application number
CN201610346247.8A
Other languages
Chinese (zh)
Inventor
董政
吴文杰
陈露
李学生
Original Assignee
成都陌云科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 成都陌云科技有限公司 filed Critical 成都陌云科技有限公司
Priority to CN201610346247.8A priority Critical patent/CN106055594A/en
Publication of CN106055594A publication Critical patent/CN106055594A/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

The present invention provides an information providing method based on user interests. The method comprises performing data conversion and sampling on an input data set, obtaining initial retrieval results through data retrieval, and reordering the initial retrieval results on basis of correlation measurement of the results and retrieval types. According to the information providing method based on the user interests, the data set is uniformly collected and managed by a distributed retrieval system, the retrieval results are further optimized on basis of feedback and evaluation of users, and user customized requirements are efficiently satisfied.

Description

基于用户兴趣的信息提供方法 Based on the information provided interest-method

技术领域 FIELD

[0001] 本发明涉及数据推送,特别涉及一种基于用户兴趣的信息提供方法。 [0001] The present invention relates to a data push, and particularly relates to a method of providing information based on user interest.

背景技术 Background technique

[0002] 在信息时代的今天,随着互联网技术和社会信息化技术的不断发展,信息量以爆炸式的速度增长,互联网正不断地影响和改变着人们的日常生活方式。 [0002] In today's information age, with the continuous development of Internet technology and social information technology, the amount of information at a rate of explosive growth, the Internet is constantly influenced and changed the way people's daily lives. 然而,随着网络信息变得越来越纷繁复杂,人们如何从如此浩瀚的信息海洋中高效地找到符合需求的信息就成了一个越来越值得关注的课题。 However, as networks become more complex information, people find information on how to become a subject of growing concern to meet the needs of information from the vast ocean so efficiently. 虽然有相关分布式检索系统可以帮助人们更精确的找到所需要的信息,但在某些应用领域,如电影、音乐、社交网络搜索,用户一般不能很好的提出很好的检索需求,通过研究用户的历史记录、用户的社会化信息以及对应领域数据的属性信息,将用户的信息建模或者领域数据资源建模,通过可靠方式将用户潜在感兴趣的数据资源推荐给用户。 Although there are relevant distributed retrieval system can help people more accurately find the information they need, but in some applications, such as movies, music, social networking, search, users are generally not well made good retrieval needs, through research history of the user, the user's social information and attribute information corresponding to the field of data, information or data modeling user resource modeling field, reliable data resources by way of potential interest to the user recommended to the user. 然而现有的分布式检索系统在工作效率和用户的满意度各不相同,并且缺少通用的接口来处理异构数据的输入。 However, the conventional distributed retrieval system differ in the efficiency and user satisfaction, and the lack of a common interface to handle heterogeneous data input.

发明内容 SUMMARY

[0003] 为解决上述现有技术所存在的问题,本发明提出了一种基于用户兴趣的信息提供方法,包括: [0003] In order to solve the above problems of the prior art, the present invention proposes a method of providing information based on user interest, comprising:

[0004] 对输入数据集进行数据转换和采样, [0004] The data conversion and sampling of the input data set,

[0005] 通过数据检索得到初始检索结果, [0005] The obtained data retrieved by the initial search result,

[0006] 基于结果和检索式的相关性度量对初始检索结果进行重新排序。 [0006] reorder the search results based on the initial results and correlation metrics retrieval formula.

[0007] 优选地,所述对输入数据集进行数据转换和采样,进一步包括, [0007] Preferably, the data conversion and sampling of the input data set, further comprising,

[0008] 通过数据汇总将数据文件输入分布式检索系统,保存至数据库之后,先根据用户的需求将某些字段过滤,接下来将处理后的数据构造成评分矩阵,构造完毕后将其保存至数据库,如果该数据整理前的数据集非其他用户所私有,通过此整理后的数据集保存的向前引用,找到原始的数据集;当进行数据采样时首先读取数据的尺寸,构造一个布尔矩阵, 初始值全为false,接着选择采样方式,在计算对应的训练集时,将布尔矩阵与对应的数据集按位相与,在计算出测试集时,将训练集按位取反即可,生成的训练集与测试集表即可执行检索。 [0008] After the data collection by the input data file distributed retrieval system, saved to a database, the user first needs to filter certain fields, the next data is processed after scoring matrix configured, configured to save the complete database, if the data before finishing the data set by other users private non-referenced data set forth by this sort of preservation, find the original data set; when the first read data sampling size data, construct a Boolean matrix initial value are all false, then select the sampling method, in the calculation of the corresponding training set, the matrix data corresponding to the Boolean bitwise and sets, in the calculation of the test set, the training set can be bitwise, generation of training set and test set table to perform retrieval.

[0009] 优选地,所述基于结果和检索式的相关性度量对检索结果进行重新排序,进一步包括: [0009] Preferably, the search results based on the results and reordering the search formula measure of correlation, further comprising:

[0010] 首先将检索结果进行量化表示,即将每个检索结果CU表示成一个矢量,矢量的维度是检索结果文本中至少出现过一次的词构成集合的大小,每一维的值是相应的词在该结果中逆向词频指标表示的权值,采用以下公式评价结果和检索式之间的相关性评分score: [0010] The first search result is quantified, i.e. CU each search result is expressed as a vector, the vector dimension is a text search result word appeared constituting at least the size of the collection, each dimension is a value corresponding word reverse right in the result word frequency index value represented by using the correlation between the results of the evaluation scores score search formula and the following formula:

Figure CN106055594AD00031

[0012] [0012]

Figure CN106055594AD00041

'表示词t在检索结果di中的权值; 'Words t represents the weight of the di in the search result;

[0013] [0013]

Figure CN106055594AD00042

,表示词t在检索式Q中的权值; , T represents the right word Q in the search value in the formula;

[0014] l(di)为结果di的长度,tf (t | di)为词t在结果di中出现的频率,tf (t | Q)为词t在检索式Q中出现的频率,df(t|C)为词t在整个结果集C中的频率,lu,k2,b为预设调节参数; [0014] l (di) is the length of the efficacy of di, tf (t | di) Ci t frequency of occurrence in the results di in, tf (t | Q) Ci t appear in the search formula Q frequency, df ( t | C) is the result of the frequency word t is the entire set of C, lu, k2, b is a preset regulation parameters;

[0015] 最后根据结果的最终评分Score,对初始检索结果按评分由高到低进行排序。 [0015] Finally, according to the final score Score result, the initial search results are sorted in descending score.

[0016] 本发明相比现有技术,具有以下优点: [0016] The present invention as compared to the prior art, has the following advantages:

[0017] 本发明提出了一种基于用户兴趣的信息提供方法,分布式检索系统对数据集进行统一收集管理,并基于用户的反馈和评价对检索结果进一步优化,更高效率地满足了用户个性化的需求。 [0017] The present invention proposes a method of providing information based on user interest, distributed retrieval system managed data sets unified collection, evaluation and user feedback and further optimization of the search results based on more efficiently to meet the individual users demand of.

附图说明 BRIEF DESCRIPTION

[0018] 图1是根据本发明实施例的基于用户兴趣的信息提供方法的流程图。 [0018] FIG. 1 is a flowchart of a method of providing information based on user interest in an embodiment of the present invention.

具体实施方式 Detailed ways

[0019] 下文与图示本发明原理的附图一起提供对本发明一个或者多个实施例的详细描述。 [0019] provided below in conjunction with the accompanying drawings illustrate the principles of the present invention, the present invention is described in detail one or more embodiments. 结合这样的实施例描述本发明,但是本发明不限于任何实施例。 Such embodiment described in conjunction with the present invention, but the present invention is not limited to any embodiment. 本发明的范围仅由权利要求书限定,并且本发明涵盖诸多替代、修改和等同物。 Scope of the invention is defined only by the claims, and the present invention encompasses many alternatives, modifications, and equivalents thereof. 在下文描述中阐述诸多具体细节以便提供对本发明的透彻理解。 Numerous specific details are set forth in the following description to provide a thorough understanding of the present invention. 出于示例的目的而提供这些细节,并且无这些具体细节中的一些或者所有细节也可以根据权利要求书实现本发明。 For purposes of example, and to provide these details without these specific details, and some or all of the details can be implemented according to the claims of the present invention.

[0020] 本发明的一方面提供了一种基于用户兴趣的信息提供方法。 [0020] In one aspect the present invention provides a method of providing information based on user interest. 图1是根据本发明实施例的基于用户兴趣的信息提供方法流程图。 1 is a flowchart according to information based on the user's interests to provide a method of the present embodiment of the invention.

[0021] 本发明在分布式检索系统中对于检索输入数据集进行统一的管理与存储,并且对其进行数据转换,根据获得的反馈结果进行结果评价,分布式检索系统评价单元包括数据管理模块、检索执行模块和展现模块。 [0021] The present invention is useful for searching for the input data set unified management and storage, and subjected to data conversion, according to the results of the evaluation results of the obtained feedback, distributed retrieval system includes a data management module evaluation unit in a distributed retrieval system, The search execution module and display module.

[0022] 数据管理模块用于接收数据输入、统一格式以及数据集的特征分析和采样。 [0022] wherein the data management module receives a data input, and the unified format data sets for analysis and sampling. 数据文件输入系统后,经过数据管理模块的数据汇总子模块将其转换成系统可识别的数据资源,经过数据整理子模块处理,成为系统可计算的数据,数据整理包括将来自文本文件,数据库文件,以及日志文件的输入数据进行格式的统一,转换为二维矩阵或多维列表,以使后续的数据操作继续执行。 After the input data file systems, data collection sub-module through a data management module converts it into data recognized by the system resources, the processing data organized in sub-module, the system can calculate the data becomes, from the data arrangement including text files, database files , as well as the input data log files unified format, converted to a two-dimensional matrix or a multidimensional list, so that subsequent data operations continue. 在检索执行模块请求数据的时候,检索执行模块在对应的请求参数中包含请求数据的格式,然后数据传输子模块根据该参数来处理经过数据采样的数据。 When execution module retrieves the requested data, the search execution module contains the requested data in the format corresponding to the request parameters, and data transmission via the sub-module to handle the data based on the sampled parameter.

[0023] 数据集根据各服务器的存储情况存储在不同的服务器上,检索执行模块向数据管理模块请求数据时,数据管理模块先进行缓存查找,采用的是客户端的散列策略,如果缓存命中,直接从缓存中将数据集取出,若不命中,则在数据库中请求相关数据。 When the [0023] data sets on different servers, the search execution module requests data from the data management module according to the storage status of the storage of each server, the data management module first cache lookup, uses a client hash policy, if a cache hit, taken directly from the data set in the cache, if hit, the requested data in the database.

[0024]在数据管理模块访问缓存服务器时,首先,数据管理模块请求数据集时的key经过预定算法映射到其中一台缓存服务器,然后从该服务器上取出相应的数据值。 After a predetermined algorithm key mapped to a cache server when the [0024] When accessing the cache module in the data management server, first, the data management module requests data set, and then remove the corresponding data values ​​from the server. 为使其命中率尽量高,采取了以下策略:使用环形散列队列,将对应查找的对象映射到32位key,从0-232-1的数值空间,将其链接成首尾相连的环。 That it hit rate as high as possible, adopted the following strategy: using an annular hash queue, find the object corresponding to the mapped 32-bit key, from the value space 0-232-1, link end to end to form a ring. 缓存和对象经过同一个散列算法映射到同一个数值空间;在整个环形队列,沿顺时针方向找到对象的key值出发,直到遇到一个缓存,则就将此对象存储在该缓存中。 After the object cache and a map to the same hash algorithm with a value space; the entire circular queue, find the object in the clockwise direction starting key value, until it encounters a cache, then it stores this in the cache objects. 当移除缓存时,逆时针遍历此缓存至下一个缓存中的对象;当增加缓存时,将此缓存映射的位置逆时针找到与下一个缓存区间中的对象,将它们从顺时针的下一个缓存中删去,映射到该缓存中。 Upon removal of the cache, the cache counterclockwise to traverse this next target cache; when increasing the cache, the cache map position of this counter-clockwise to the next buffer section to find the objects, they are from a clockwise deleting cache mapped to the cache.

[0025] 由于用户输入的数据集形式多样,系统通过创建数据集板,每输入一种数据集时则实例化一个数据集,配置以不同的参数,由于不同的算法所需要的数据集不同,所以不同的算法使用到不同格式的数据集,对数据集格式整理包括:识别冗余的输入的字段或信息, 将其过滤;根据用户的配置文件,来对输入数据集的各个字段信息进行保存;设置数据集的稀疏性阈值,如果输入数据集低于阈值,可以根据用户的输入参数将低于该阈值的用户过滤。 [0025] Since the user input data set forms, the data set by creating a system board, each input is an instance of a data set to a data set, configured with different parameters, due to the different algorithms require different sets of data, Therefore, the use of a different algorithm to the data set of a different format, the data format collation set comprises: identifying an input field or redundant information, which was filtered; the user's profile, to save the information field of each input data set ; sparsity setting data sets the threshold value, if the input data set is below a threshold, the user can be filtered according to the input parameters of the user will be less than the threshold.

[0026] 通过数据汇总将数据文件输入分布式检索系统,保存至数据库之后,这些数据可以直接进入数据整理子模块,数据整理子模块先根据用户的需求将某些字段过滤。 [0026] After the summary of the input data file distributed retrieval system, saved to a database, the data can directly enter data through the data sorting sub-module, the first sub-module data arrangement according to the needs of users of certain fields filtered. 接下来将处理后的数据构造成评分矩阵,构造完毕后将其保存至数据库,如果该数据整理前的数据集非其他用户所私有,通过此整理后的数据集保存的向前引用,找到原始的数据集。 Next the processed data is configured to scoring matrix, after construction is completed save it to the database, the data set before the data if users organize other non-private, forward references by the data set after finishing this saved, find the original the data set. [0027]数据管理模块中,数据采样子模块的采样时间可以选择在数据集处理的时候进行采样,或者在算法配置完成的时候对其进行采样。 [0027] Data management module, the sampling time data acquisition module may choose to look at the time the data sample set processing or is sampled at the time of completion of the algorithms. 前一种方式是在数据管理模块内部完成, 其具体的逻辑是当用户选择数据集采样,然后选择数据集,接着选择对应的采样方式,如果操作能成功完成则将对应的采样后的数据集存储起来,原数据集不变,新的采样过后的数据集有标记字段指示原数据集,而且有对应的采样方式以及其他信息。 The former approach is completed within the data management module, the specific logic is set when a user selects data samples, and then selecting the data set, then select the corresponding sampling method, if the operation to complete the data set then the corresponding sampling success stored, the original data set unchanged, new data set after the flag field indicates that the sample has a set of original data, and a corresponding sampling mode and other information. 后一种方式是算法经过配置之后请求数据,而数据收到具体的采样需求,如数据集名称,采样方式以及其他信息后,检查检索执行模块传来的消息中是否能够完成数据采样的操作,如果是,则进行数据采样,采样完毕后将采样后的数据集在本地数据库备份,然后将对应的采样数据集发给请求的执行端,一次算法执行过程中可能会有多次数据传输,鉴于算法运行时间比较久,所以算法的运行采用分布式处理,为了算法执行的高效性,数据管理模块发送给检索执行模块中对应的不同执行端,执行模块在请求数据采样每次数据传输都会检查它要求的采样方式是否已经在数据库中存在,如果是,则取出数据,如果不是,重新发送该请求。 The latter method is an algorithm configuration request after the data, the data received on the sampling requirements, such as the name of the data set, the sampling mode and the additional information, the search execution module to check whether or not the news to complete the operation of data samples, If so, the data sampling is performed after the end of the sample after completion of sample sets of data in the local database backup, then the corresponding sample sets of data request issued during the execution of the algorithm may be a multiple transmissions of data, in view of the relatively long running time, the operation of the algorithm uses distributed processing, in order to efficiently, data transmitted to different execution management module ends the search execution module corresponding to the request data sampling execution module performs the algorithm checks each data transmission it whether the required sampling methods already exist in the database, if so, to retrieve data, if not, it resends the request.

[0028]当进行数据采样的时候,首先将数据的尺寸读入数据采样子模块,系统构造一个布尔矩阵,初始值全为false,接着选择采样方式,如果只是单次采样,生成的对应训练集和测试集都将只生成一次,如果是循环多次采样,将生成多个,根据采样方式不同,将把此矩阵的一些值填充为true,另外一些仍为false,这个布尔矩阵将它命名为训练集的模表,通过这个模表,可以计算出对应的训练集,只需将它与对应的数据集按位相与,同理可以计算出测试集,只需将训练集的模表按位取反即可。 [0028] When the data sampled first size read data mining like module, the system configuration of a Boolean matrix initial value are all false, then select the sampling method, if only a single sample, corresponding to the generated training set and test sets are generated only once, if multiple sampling cycle, will generate a plurality of different ways depending on the sampling, will fill in some values ​​of this matrix is ​​true, the additional number of still false, the Boolean matrix named it table training mode set by the mode table can be calculated corresponding to the training set, it simply corresponds to the bitwise data set, the test set can be calculated the same way, only the die table of the training set bit can be negated. 据此生成的训练集与测试集表即可发送给检索执行模块执行,检索执行模块根据训练集去预测测试集表中值为True的数据项评分即可。 Thus generated in the training set and test set table to send to the retrieval execution module, the search execution module according to the training set to test the prediction table set to True score to items.

[0029]在测试集中对检索结果进行评价,该测试集中的内容是用户感兴趣的项目集合。 [0029] In the test set to evaluate the search results, the content of the test set is a collection of items of interest to the user. 由于在数据采样的时候在本地保存了测试集,当算法执行执行完毕返回结果时,系统先从通信的报文中取出所对应的序列码,根据此序列码将数据库中所对应的测试集取出,然后将其与返回的结果进行比较,从而得出评价结果。 Since at the time of data samples stored in the local test suite, when executed, returns the results of the algorithm is finished, remove the system serial number corresponding to the start of message communication, according to this serial number in the database corresponding to the removed test set , then compared with the results returned to arrive at the results of the evaluation. 检索执行模块保存着以算法类型为主键, 算法配置概要信息的表,待算法执行完毕后将其非主键信息发送回来。 The search execution module stored in the primary key algorithm type, an algorithm summary table configuration information, the algorithm to be executed after completion of sending it back non-primary key information. 结合算法执行完毕后传来的各个参数,进行结果的评价输出。 After completion of the integration of the various parameters coming from the execution of the algorithm, the output of the evaluation.

[0030]检索执行模块返回数据的时候,附带双方约定的序列码,传回的算法执行结果,并附加上算法的执行类型表中所带的配置算法所需的参数,传回本地以后对结果进行评价和展现,以供用户反馈修改参数。 [0030] The search execution module when the return data, with code sequences mutually agreed upon, the algorithm returns an execution result, and appends the parameters required to perform the algorithm type table configuration algorithm brought, after the results returned to the local to evaluate and demonstrate to modify the parameters for user feedback.

[0031 ]在用户提供相关反馈后,对检索结果进行重新排序处理,具体为,结合检索结果评分、用户反馈中相关和不相关结果的近似度距离差来进行重新排序。 [0031] After the user provide feedback on the search results re-ordering process, specifically, binding score search results, the user feedback correlated and the correlation result of the approximate distance difference reorder.

[0032]在度量检索结果间的相关性之前,首先需要将其进行量化表示,将每个检索结果di表示成一个矢量,矢量的维度是文本中至少出现过一次的词构成集合的大小,每一维的值是相应的词在该结果中逆向词频指标表示的权值。 [0032] Before the measure of a correlation between the search result, first need to be quantized representations, each of di represents the search result as a vector, the vector dimension is the size of the text appeared in at least a set of words constituting each value is one-dimensional inverse word frequency index corresponding word in the result indicates weights. 然后采用以下公式评价结果和检索式之间的相关性评分: The following equation is then evaluated using the correlation between the scores and the retrieval result of the formula:

Figure CN106055594AD00061

[0036] 式中W(t|di)为词t在di中的权值; [0036] wherein W is (t | di) Ci t where di is the weight;

[0037] W(t|Q)为词t在检索式Q中的权值; [0037] W (t | Q) for the word in the search query Q t in weight;

[0038] l(di)为结果di的长度; [0038] l (di) is the result of di length;

[0039] tf (t I di)为词t在结果di中出现的频率; [0039] tf (t I di) Frequency of occurrence of the word in the result of di t; and

[0040] tf(t|Q)为词t在检索式Q中出现的频率; [0040] tf (t | Q) is the frequency of occurrence of words in t Q in the search query;

[0041] df(t|C)为词t在整个结果集C中的频率; [0041] df (t | C) is the term t frequency in the entire result set C;

[0042] lu,k2,b为预设调节参数。 [0042] lu, k2, b is a preset regulation parameters.

[0043] 最后根据结果的最终评分,对初始的检索结果进行重新排序,即按结果的Score的评分由高到低进行排序。 [0043] Finally, according to the final score of the results, the initial search results reordering, i.e. by score Score results sorted in descending order.

[0044] 本发明在以下实施例使用可选的结果排序方法,包括检索结果的领域表示和基于近似度计算的检索结果排序。 [0044] In the present invention, the results using the alternative embodiment of the sorting method of the embodiment, showing the art including the search results and search results based on the calculated degree of similarity sorted.

[0045] 首先是将用户的检索词提交给分布式检索系统,然后获取分布式检索系统的检索结果,并提取出检索结果标题、描述和URL,并进行分词,根据停用词表,将无用的词删除;根据逆向词频算法计算结果标题和描述的每个词的加权值,然后合并;检查每个词所属的细分领域,如果有两个词所属的细分领域相同,则将其加权值相加,作为该细分领域的加权值,最后可得到该检索结果的细分领域矢量;检查每个细分领域所属的主领域,如果相同则继续合并,最后可得到该检索结果的主领域矢量;对分布式检索系统结果集执行以上步骤, 得到分布式检索系统结果集的领域矢量表。 [0045] The first is to submit to the search term the user distributed search system and get search results distributed retrieval system, and extracts the search results title, description and URL, and word, according to the stop-list, will be useless remove word; the inverse weighted value of each word and word frequency calculation algorithms described in the title, then combined; segments checks each word belongs to, if there are segments of the same two words belong, it is weighted values ​​are added, as the weighted value of the segments of the last vector of the segments obtained search result; check the main field of each segment of the art, if the same merger continues, the last available main search result fIELD vector; these steps for a distributed system retrieval result, we obtain a vector field table distributed retrieval system result set.

[0046] 设UF为用户的主兴趣矢量,US为用户的细分兴趣矢量,依次计算用户兴趣和每个结果的近似度。 [0046] UF is provided for the user's primary interest vector, US segment of user interest vector, followed by calculation of user interest and results of each approximation. 设DF是检索集中某个检索结果的主领域矢量,DS该检索结果的细分领域矢量。 DF is provided a focus search result retrieval primary field vector, vector segments DS the search result.

[0047] 计算用户兴趣和检索结果的细分领域集合的边界差: [0047] The set of boundary segments difference calculating interests and retrieve results:

[0048] Bl = DS_US 门DS [0048] Bl = DS_US door DS

[0049] 计算用户兴趣和检索结果的细分领域集合的近似度: [0049] The set of segments and calculating the search result of interest-approximation:

Figure CN106055594AD00071

[0051] [0051]

Figure CN106055594AD00072

是该检索结果和用户兴趣中都存在的细分领域的权值乘积的和,num (BL)和num (DS)分别是Bl和DS的数量。 It is right segments of the search result and user interest exists in the value of the product and, num (BL) and NUM (DS) are the number of Bl and DS.

[0052]计算用户兴趣和检索结果的主领域集合的边界差: [0052] The main field boundaries set of difference calculation on interests and retrieve results:

[0053] Bu=DF-(UFnDF) [0053] Bu = DF- (UFnDF)

[0054]计算用户兴趣和检索结果的主领域集合的近似度: [0054] Field of the main set of user interest and calculating a search result approximation:

Figure CN106055594AD00073

[0056] [0056]

Figure CN106055594AD00074

是该检索结果和用户兴趣中都存在的主领域的权值乘积的和,num (BU)和num (DF)分别是Bu和DF的数量; It is the weight of the main field of the search result and the user interest exists in the value of the product and, num (BU) and NUM (DF) and Bu are the number of DF;

[0057] 最后计算该检索结果和用户兴趣的总近似度: [0057] Finally, the search result and calculates the total user interest approximation:

[0058] Sim = GXSimL(US,DS) + (l-〇XSimu(UF,DF) [0058] Sim = GXSimL (US, DS) + (l-〇XSimu (UF, DF)

[0059] 其中G为细分领域集合近似度的加权值。 [0059] wherein G set of weighting values ​​for the segments of the degree of approximation.

[0060] 依据这个步骤,对分布式检索系统返回的每个结果依次计算总近似度Sim,得到每个检索结果新的权值,然后从大到小排序,得到新的结果顺序。 [0060] According to this step, for each result returned sequentially distributed retrieval system calculates a total degree of similarity Sim, each search result to obtain the new weight, and then sorted in descending order, a new order of results obtained.

[0061] 在上述用户兴趣的向量表示中,本发明采用获取本地浏览记录以进行兴趣分析的方式。 [0061] In the vector representation of the user's interest, the present invention is employed for obtaining the local history of interest in the analysis mode. 首先获取用户访问的检索结果的标题和描述,并对这些标题和描述进行分词,分词后根据停用词表将无用的词删除;对照特征词表,检查浏览记录中所有检索结果的所有词,统计每个细分领域出现的特征词数,得到矢量{(hsi,ci),(hs2,C2),…,(hs n,cn)},其中hsi指第i个细分领域,Cl指第i个细分领域出现了多少个特征词;计算每个细分领域的权值,计算 First get the search results users access to the title and description, and the title and description word, the word will be useless word deleted as stop-list; all words the control features of vocabulary, all the search results of inspection browsing history, wherein each of the segments statistics appearing words, to obtain a vector {(hsi, ci), (hs2, C2), ..., (hs n, cn)}, where hsi refers to the i th segments, Cl means of i appeared a number of segments feature words; weights are calculated for each of the segments is calculated

Figure CN106055594AD00075

,最后得到一个细分兴趣矢量HS= {(hsi,hswi),(hs2,hsw2),…,(hsn, hswn)};细分兴趣矢量与用户选择的兴趣领域合并后,一起生成主领域兴趣矢量。 , The last segment of interest to obtain a vector HS = {(hsi, hswi), (hs2, hsw2), ..., (hsn, hswn)}; vector of interest were combined segments of interest selected by the user, the main areas of interest with the generated vector.

[0062] 综上所述,本发明提出了一种基于用户兴趣的信息提供方法,分布式检索系统对数据集进行统一收集管理,并基于用户的反馈和评价对检索结果进一步优化,更高效率地满足了用户个性化的需求。 [0062] In summary, the present invention provides a method of user-interest information based on distributed retrieval system, the data set collected unified management, and user feedback and further optimization of evaluation based on the search result, more efficient meet the needs of individual users.

[0063] 显然,本领域的技术人员应该理解,上述的本发明的各模块或各步骤可以用通用的计算系统来实现,它们可以集中在单个的计算系统上,或者分布在多个计算系统所组成的网络上,可选地,它们可以用计算系统可执行的程序代码来实现,从而,可以将它们存储在存储系统中由计算系统来执行。 [0063] Obviously, those skilled in the art will appreciate, each of the above modules or steps of the present invention may be a general-purpose computing systems, they can be integrated in a single computing system, or distributed across multiple computing systems available on the Internet, optionally, they may be implemented using a computing system executable program code, so that to be executed by a computing system may be stored in a storage system. 这样,本发明不限制于任何特定的硬件和软件结合。 Thus, the present invention is not limited to any particular hardware and software combination.

[0064] 应当理解的是,本发明的上述具体实施方式仅仅用于示例性说明或解释本发明的原理,而不构成对本发明的限制。 [0064] It should be appreciated that the above-described embodiments of the present invention are provided for illustrative or explain the principles of the present invention, not to limit the present invention. 因此,在不偏离本发明的精神和范围的情况下所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。 Thus, made without departing from the spirit and scope of the present invention any modification, equivalent replacement, or improvement, it should be included within the scope of the present invention. 此外,本发明所附权利要求旨在涵盖落入所附权利要求范围和边界、或者这种范围和边界的等同形式内的全部变化和修改例。 Furthermore, the claims appended hereto are intended to cover the scope of this embodiment and all changes and modifications within the boundary of equivalents of the appended claims and the range boundaries, or request.

Claims (3)

1. 一种基于用户兴趣的信息提供方法,其特征在于,包括: 对输入数据集进行数据转换和采样, 通过数据检索得到初始检索结果, 基于结果和检索式的相关性度量对初始检索结果进行重新排序。 1. A method of providing information based on the user's interests, wherein, comprising: a set of input data samples and the data conversion, the data obtained by the initial search result retrieval performed based on the result of the initial search results and search formula measure of correlation rearrange.
2. 根据权利要求1所述的方法,其特征在于,所述对输入数据集进行数据转换和采样, 进一步包括, 通过数据汇总将数据文件输入分布式检索系统,保存至数据库之后,先根据用户的需求将某些字段过滤,接下来将处理后的数据构造成评分矩阵,构造完毕后将其保存至数据库,如果该数据整理前的数据集非其他用户所私有,通过此整理后的数据集保存的向前引用,找到原始的数据集;当进行数据采样时首先读取数据的尺寸,构造一个布尔矩阵,初始值全为false,接着选择采样方式,在计算对应的训练集时,将布尔矩阵与对应的数据集按位相与,在计算出测试集时,将训练集按位取反即可,生成的训练集与测试集表即可执行检索。 After 2. The method according to claim 1, wherein said data conversion and sampling of the input data set, further comprising, summary data file input by the data distributed retrieval system, saved to a database, according to user the filter needs some fields, the processed data will next be configured scoring matrix, after construction is complete to save the database, the data set if the data arrangement before the other non-private users, after sorting the data set by this stored reference forward, find the original data set; when the sampling data is first read size of data, a Boolean matrix construct, the full initial value to false, and then select the sampling method, in the calculation of the corresponding training set, Boolean matrix corresponding to the data set according to the phase and, in the calculation of the test set, the bitwise training set to generate the training set and test set table to perform retrieval.
3. 根据权利要求2所述的方法,其特征在于,所述基于结果和检索式的相关性度量对检索结果进行重新排序,进一步包括: 首先将检索结果进行量化表示,即将每个检索结果cU表示成一个矢量,矢量的维度是检索结果文本中至少出现过一次的词构成集合的大小,每一维的值是相应的词在该结果中逆向词频指标表示的权值,采用以下公式评价结果和检索式之间的相关性评分score: 3. The method according to claim 2, characterized in that, on the search result reordered based on the correlation of the measurement results and the search formula, further comprising: a first search result is quantified, ie each search result cU represented as a vector, the vector dimension is the evaluation result of the search results of the following formula present at least once in the text of the word configuration size of the collection, each dimension value is inverse word frequency index corresponding word in the result indicates weights using and relevance rating score between the search query:
Figure CN106055594AC00021
l(di)为结果di的长度,tf (t | di)为词t在结果di中出现的频率,tf (t |Q)为词t在检索式Q 中出现的频率,df (11 C)为词t在整个结果集C中的频率,h,k2,b为预设调节参数; 最后根据结果的最终评分Score,对初始检索结果按评分由高到低进行排序。 L (di) is the length of the efficacy of di, tf (t | di) is the frequency of word t appear in the results di in, tf (t | Q) for the frequency word t appearing in the search query Q in, df (11 C) Ci t C, the entire result set frequency, h, k2, b is a preset regulation parameters; Finally, according to the final score score result, the initial search results are sorted in descending score.
CN201610346247.8A 2016-05-23 2016-05-23 Information providing method based on user interests CN106055594A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610346247.8A CN106055594A (en) 2016-05-23 2016-05-23 Information providing method based on user interests

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610346247.8A CN106055594A (en) 2016-05-23 2016-05-23 Information providing method based on user interests

Publications (1)

Publication Number Publication Date
CN106055594A true CN106055594A (en) 2016-10-26

Family

ID=57174306

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610346247.8A CN106055594A (en) 2016-05-23 2016-05-23 Information providing method based on user interests

Country Status (1)

Country Link
CN (1) CN106055594A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101520785A (en) * 2008-02-29 2009-09-02 富士通株式会社 Information retrieval method and system therefor
US7765178B1 (en) * 2004-10-06 2010-07-27 Shopzilla, Inc. Search ranking estimation
CN102819575A (en) * 2012-07-20 2012-12-12 南京大学 Personalized search method for Web service recommendation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7765178B1 (en) * 2004-10-06 2010-07-27 Shopzilla, Inc. Search ranking estimation
CN101520785A (en) * 2008-02-29 2009-09-02 富士通株式会社 Information retrieval method and system therefor
CN102819575A (en) * 2012-07-20 2012-12-12 南京大学 Personalized search method for Web service recommendation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
施振兴: "推荐系统综合仿真平台评估框架的研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
沈林: "基于模糊粗糙集的个性化搜索引擎研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Similar Documents

Publication Publication Date Title
US8027974B2 (en) Method and system for URL autocompletion using ranked results
CN100565509C (en) System and method for ranking search results using click distance
Skoutas et al. Ranking and clustering web services using multicriteria dominance relationships
US7243102B1 (en) Machine directed improvement of ranking algorithms
AU2010234452B2 (en) Generating improved document classification data using historical search results
US7966337B2 (en) System and method for prioritizing websites during a webcrawling process
Harth et al. Data summaries for on-demand queries over linked data
CN101551806B (en) Personalized website navigation method and system
JP5632124B2 (en) Rating method, the search result sorting method, rating systems and search results Sort system
US20030074352A1 (en) Database query system and method
US7984035B2 (en) Context-based document search
KR101793222B1 (en) Updating a search index used to facilitate application searches
KR101076894B1 (en) System and method for incorporating anchor text into ranking search results
EP1600861A2 (en) Query to task mapping
CN103886090B (en) Based on the user's preference recommendation method and apparatus
US20030018621A1 (en) Distributed information search in a networked environment
US7636713B2 (en) Using activation paths to cluster proximity query results
CN101273350B (en) Click distance determination
US20110191310A1 (en) Method and system for ranking intellectual property documents using claim analysis
US8458165B2 (en) System and method for applying ranking SVM in query relaxation
US7739270B2 (en) Entity-specific tuned searching
CN102117321B (en) Automatic discovery of aggregation and subject areas discussed in the organization
US20070061313A1 (en) Detection of search behavior based associations between web sites
JP5736469B2 (en) Recommendation of the search keyword based on the presence or absence of user intent
CN102016787B (en) Determining relevant information for domains of interest

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination