CN106055594A

CN106055594A - Information providing method based on user interests

Info

Publication number: CN106055594A
Application number: CN201610346247.8A
Authority: CN
Inventors: 董政; 吴文杰; 陈露; 李学生
Original assignee: Chengdu Mo Yun Science And Technology Ltd
Current assignee: Chengdu Mo Yun Science And Technology Ltd
Priority date: 2016-05-23
Filing date: 2016-05-23
Publication date: 2016-10-26

Abstract

The present invention provides an information providing method based on user interests. The method comprises performing data conversion and sampling on an input data set, obtaining initial retrieval results through data retrieval, and reordering the initial retrieval results on basis of correlation measurement of the results and retrieval types. According to the information providing method based on the user interests, the data set is uniformly collected and managed by a distributed retrieval system, the retrieval results are further optimized on basis of feedback and evaluation of users, and user customized requirements are efficiently satisfied.

Description

Information providing method based on user interest

Technical field

The present invention relates to data-pushing, particularly to a kind of information providing method based on user interest.

Background technology

In today of information age, along with Internet technology and the development of social informatization technology, quantity of information is with quick-fried The speed increment of fried formula, the Internet the most constantly affects and changes daily life mode.But, along with the network information Becoming increasingly numerous and complicated, how people efficiently find the information just one-tenth meeting demand from the immenseest information ocean One problem increasingly merited attention.Although there being correlation distribution formula searching system that people can be helped to find institute more accurately The information needed, but in some application, such as film, music, social network search, user typically can not well propose very Good Search Requirement, is believed by the attribute of the research historical record of user, socialization's information of user and corresponding FIELD Data Potential for user data interested, by information modeling or the FIELD Data Resource Modeling of user, are provided by breath by reliable fashion User is recommended in source.But existing distributed search system is different in the satisfaction of work efficiency and user, and lack Few general interface processes the input of isomeric data.

Summary of the invention

For solving the problem existing for above-mentioned prior art, the present invention proposes a kind of information based on user interest and provides Method, including:

Input data set is carried out data conversion and sampling,

Initial retrieval result is obtained by data retrieval,

Initial retrieval result is resequenced by relativity measurement based on result and retrieval type.

Preferably, described input data set is carried out data conversion and sampling, farther includes,

By data summarization, data file is inputted distributed search system, preserve after data base, first according to user Demand by some Field Sanitization, the data configuration after next processing becomes rating matrix, is saved to after structure Data base, if the data set before this data compilation other users non-institute is privately owned, by this reduced data collection preserve to Leading use, finds original data set；First read the size of data when carrying out data sampling, construct a Boolean matrix, Initial value is all false, then selects sample mode, when calculating corresponding training set, by Boolean matrix and corresponding data Collection step-by-step phase with, when calculating test set, training set step-by-step is negated, the training set of generation can be held with test set table Line retrieval.

Preferably, retrieval result is resequenced, further by described relativity measurement based on result and retrieval type Including:

First retrieval result is carried out quantization means, will each retrieval result d_iIt is expressed as a vector, the dimension of vector Degree is at least to occur that word once constituted the size of set in retrieval resulting text, and the most one-dimensional value is that corresponding word is at this knot The weights of reverse word frequency index expression in Guo, relevance score score between employing below equation evaluation result and retrieval type:

s c o r e (d_{i}, Q) = \underset{t &Element; d_{i} \cap Q}{Σ} W (t | d_{i}) \cdot W (t | Q)

WhereinRepresent that word t is in retrieval result d_iIn power Value；

Represent word t weights in retrieval type Q；

l(d_i) it is result d_iLength, tf (t | d_i) it is that word t is in result d_iThe frequency of middle appearance, and tf (t | Q) it is that word t is in inspection The frequency occurred in cable-styled Q, and df (t | C) it is word t frequency in whole result set C, k₁, k₂, b is for presetting regulation parameter；

Finally according to the Score that finally marks of result, initial retrieval result is ranked up from high to low by scoring.

The present invention compared to existing technology, has the advantage that

The present invention proposes a kind of information providing method based on user interest, and data set is carried out by distributed search system Unify collection management, and retrieval result is optimized by feedback based on user and evaluation further, meets user to higher efficiency Personalized demand.

Accompanying drawing explanation

Fig. 1 is the flow chart of information providing method based on user interest according to embodiments of the present invention.

Detailed description of the invention

Hereafter provide retouching in detail one or more embodiment of the present invention together with the accompanying drawing of the diagram principle of the invention State.Describe the present invention in conjunction with such embodiment, but the invention is not restricted to any embodiment.The scope of the present invention is only by right Claim limits, and the present invention contains many replacements, amendment and equivalent.Illustrate in the following description many details with Thorough understanding of the present invention is just provided.These details are provided for exemplary purposes, and without in these details Some or all details can also realize the present invention according to claims.

An aspect of of the present present invention provides a kind of information providing method based on user interest.Fig. 1 is real according to the present invention Execute the information providing method flow chart based on user interest of example.

The present invention carries out unified management and storage for retrieval input data set in distributed search system, and right It carries out data conversion, and the feedback result according to obtaining carries out evaluation of result, and distributed search system evaluation unit includes data Management module, retrieval perform module and represent module.

Data management module is for receiving data input, consolidation form and the feature analysis of data set and sampling.Data After file input system, the data summarization submodule through data management module converts thereof into system discernible data money Source, processes through data compilation submodule, becomes the computable data of system, and data compilation includes from text, number According to library file, and the input data of journal file carry out the unification of form, are converted to two-dimensional matrix or multidimensional list, so that after Continuous data manipulation continues executing with.Retrieval performs module request data when, retrieval performs module in corresponding request ginseng Comprising the form of request data in number, then data transmission module processes the data through data sampling according to this parameter.

Data set stores on a different server according to the storage condition of each server, and retrieval performs module to data pipe During reason module request data, data management module advanced person's row cache is searched, is used the hash strategy of client, if caching Hit, directly takes out data set from caching, if being not hit by, then asks related data in data base.

When data management module access cache server, first, key during data management module requested data set passes through Pre-defined algorithm is mapped to a wherein caching server, then takes out corresponding data value from this server.For making it hit Rate height as far as possible, takes following strategy: using annular hash queue, object map to 32 key correspondence searched, from 0- 2^32-1Numerical space, be linked into end to end ring.Caching and object are mapped to same through same hashing algorithm Individual numerical space；At whole circle queue, the key value of object is found to set out along clockwise direction, until running into a caching, then Just this object is stored in this caching.When removing caching, the object in the most next caching of this caching of traversal counterclockwise；When When increasing caching, the inverse position hour hands that this caching maps are found and the object in next caching interval, by them from up time The next caching of pin is left out, is mapped in this caching.

Owing to the data set of user's input is various informative, system is by creating data set plate, often during a kind of data set of input Then one data set of instantiation, configures with different parameters, owing to the data set required for different algorithms is different, thus different Algorithm use the data set of different-format, data set form collator is included: identify field or the information of the input of redundancy, Filtered；According to the configuration file of user, each field information of input data set is preserved；Data set is set Openness threshold value, if input data set is less than threshold value, can will be less than user's mistake of this threshold value according to the input parameter of user Filter.

By data summarization, data file being inputted distributed search system, preserve after data base, these data can To be directly entered data compilation submodule, the first demand according to user of data compilation submodule is by some Field Sanitization.Next Data configuration after processing becomes rating matrix, is saved to data base after structure, if the number before this data compilation According to collection, other user institutes non-are privately owned, the forward reference preserved by this reduced data collection, find original data set.

In data management module, the sampling time of data sampling submodule can select to carry out data set processes when Sampling, or algorithm configuration completes when, it is sampled.Former mode is to complete inside data management module, Its concrete logic is when user-selected number is according to collection sampling, then selects data set, then selects the sample mode of correspondence, if Operation can be successfully completed, and is stored by the data set after corresponding sampling, and former data set is constant, the new number after sampling There is tag field to indicate former data set according to collection, and have sample mode and other information of correspondence.Latter approach is algorithm Request data after configuration, and data receive concrete sample requirement, such as dataset name, sample mode and other letters After breath, check that retrieval performs whether can complete in the message that module transmits the operation of data sampling, if it is, carry out data Sampling, by the data set after sampling at local data library backup after sampling, then corresponding sampled data set being issued please The actuating station asked, there may be many data transfer during one time algorithm performs, in view of Riming time of algorithm is relatively more long, so The operation of algorithm uses distributed treatment, the high efficiency performed for algorithm, and data management module is sent to retrieval and performs module The different actuating stations of middle correspondence, perform module and all can check, in the every data transfer of request data sampling, the sample mode that it requires The most exist in data base, if it is, take out data, if it is not, resend this request.

When carrying out data sampling when, first the size of data being read in data sampling submodule, system constructs one Boolean matrix, initial value is all false, then selects sample mode, if simply unitary sampling, the corresponding training set of generation and Test set all will only generate once, if circulation multiple repairing weld, will generate multiple, different according to sample mode, will be this square Some values of battle array are filled to true, and other is still false, and this Boolean matrix is by the mould table of its named training set, logical Cross this mould table, can calculate the training set of correspondence, only need to by it with corresponding data set step-by-step phase and, in like manner can calculate Go out test set, only the mould table step-by-step of training set need to be negated.The training set accordingly generated can be sent to test set table Retrieval performs module execution, and retrieval performs module and goes to predict that the data item scoring that test set table intermediate value is True is according to training set Can.

Being evaluated retrieval result in test set, the content in this test set is the project set that user is interested. Owing to saving test set in this locality data sampling when, when algorithm is finished return result, system first from The message of communication takes out corresponding sequence code, according to this sequence code, test set corresponding in data base is taken out, then It is compared with the result returned, thus draws evaluation result.It is in store with algorithm types as major key that retrieval performs module, The table of algorithm configuration summary info, sends by its non-primary key information after algorithm is finished.Combination algorithm is finished After the parameters that transmits, carry out the evaluation output of result.

The when that retrieval performing module return data, the sequence code of subsidiary both sides' agreement, the algorithm passed back performs result, and Parameter needed for the placement algorithm performing to be carried in type list of affix algorithm, pass back result is evaluated after this locality and Represent, revise parameter for user feedback.

After user provides relevant feedback, retrieval result is carried out rearrangement process, comment specifically, combine retrieval result Point, the degree of approximation range difference of relevant and uncorrelated result is resequenced in user feedback.

Before dependency between tolerance retrieval result, it is necessary first to carried out quantization means, by each retrieval result d_iBeing expressed as a vector, the dimension of vector is at least to occur in text that word once constituted the size gathered, the most one-dimensional Value is corresponding word weights of reverse word frequency index expression in this result.Then below equation evaluation result and retrieval type are used Between relevance score:

s c o r e (d_{i}, Q) = \underset{t &Element; d_{i} \cap Q}{Σ} W (t | d_{i}) \cdot W (t | Q)

W (t | d_{i}) = \frac{(k_{1} + 1) \cdot t f (t | d_{i})}{k_{1} [(1 - b) + b \frac{l (d_{i})}{t f (t | d_{i})}]} l o g (d f (t | C))

W (t | Q) = \frac{(k_{2} + 1) t f (t | Q)}{k_{2} + t f (t | Q)}

W in formula (t | d_i) it is that word t is at d_iIn weights；

W (t | Q) it is word t weights in retrieval type Q；

l(d_i) it is result d_iLength；

tf(t|d_i) it is that word t is in result d_iThe frequency of middle appearance；

Tf (t | Q) is the frequency that word t occurs in retrieval type Q；

Df (t | C) it is word t frequency in whole result set C；

k₁, k₂, b is for presetting regulation parameter.

Finally according to the final scoring of result, initial retrieval result is resequenced, i.e. by the Score's of result Scoring is ranked up from high to low.

The present invention following example use optional result ordering method, including retrieval result field represent and based on The retrieval sort result that the degree of approximation calculates.

First it is distributed search system that the term of user is submitted to, then obtains the retrieval of distributed search system As a result, and extract retrieval result title, description and URL, and carry out participle, according to disabling vocabulary, useless word is deleted；Root According to reverse word frequency algorithm result of calculation title and the weighted value of each word of description, it is then combined with；Check belonging to each word is thin Dividing field, if there being the segmentation field belonging to two words identical, then its weighted value being added, as the weighting in this segmentation field Value, finally can obtain the segmentation field vector of this retrieval result；Check the primary territory belonging to each segmentation field, if the same Continue to merge, finally can obtain the primary territory vector of this retrieval result；Distributed search system results collection is performed above step, Obtain the field vector table of distributed search system results collection.

If the main interest vector that UF is user, US is the segmentation interest vector of user, calculates user interest and each successively The degree of approximation of result.If DF is the primary territory vector of certain retrieval result in retrieved set, the segmentation field of this retrieval result of DS is vowed Amount.

Calculate user interest and retrieve the boundary difference that the segmentation field of result is gathered:

B_L=DS-US ∩ DS

Calculate user interest and retrieve the degree of approximation that the segmentation field of result is gathered:

{Sim}_{L} (U S, D S) = (1 - n u m (B_{L}) / n u m (D S)) \times \underset{i &Element; U S \cap D S}{Σ} ({dsw}_{i} \times {usw}_{i})

WhereinIt is that the weights segmenting field all existed in this retrieval result and user interest are taken advantage of Long-pending sum, num (BL) and num (DS) is B respectively_LQuantity with DS.

The boundary difference of the primary territory set of calculating user interest and retrieval result:

B_U=DF-(UF ∩ DF)

The degree of approximation of the primary territory set of calculating user interest and retrieval result:

{Sim}_{U} (U F, D F) = (1 - n u m (B U) / n u m (D F)) \times \underset{i &Element; U S \cap D S}{Σ} ({dfw}_{i} \times {ufw}_{i})

WhereinIt it is the weights product of the primary territory all existed in this retrieval result and user interest Sum, num (BU) and num (DF) is B respectively_UQuantity with DF；

Finally calculate total degree of approximation of this retrieval result and user interest:

Sim=ζ × Sim_L(US, DS)+(1-ζ) × Sim_U(UF, DF)

Wherein ζ is the weighted value of segmentation field Set approximation degree.

The step for of foundation, each result returning distributed search system calculates total degree of approximation Sim successively, obtains every The weights that individual retrieval result is new, sort the most from big to small, obtain new result order.

In the vector representation of above-mentioned user interest, the present invention uses acquisition this locality to browse record to carry out interest analysis Mode.First obtain title and the description of the retrieval result that user accesses, and these titles and description are carried out participle, after participle According to disabling vocabulary, useless word is deleted；Comparison feature vocabulary, checks and browses all words of all retrieval results in record, system Count the Feature Words number that each segmentation field occurs, obtain vector { (hs₁, c₁), (hs₂, c₂) ..., (hs_n, c_n), wherein hs_iRefer to I-th segmentation field, c_iRefer to that i-th segmentation field occurs in that how many Feature Words；Calculate the weights in each segmentation field, calculate Formula isFinally obtain a segmentation interest vector HS={ (hs₁, hsw₁), (hs₂, hsw₂) ..., (hs_n, hsw_n)}；After the interest worlds that segmentation interest vector selects with user merge, generate primary territory interest vector together.

In sum, the present invention proposes a kind of information providing method based on user interest, distributed search system pair Data set carries out unifying collection management, and retrieval result is optimized by feedback based on user and evaluation further, higher efficiency ground Meet the demand of user individual.

Obviously, it should be appreciated by those skilled in the art, each module of the above-mentioned present invention or each step can be with general Calculating system realize, they can concentrate in single calculating system, or be distributed in multiple calculating system and formed Network on, alternatively, they can realize with the executable program code of calculating system, it is thus possible to by they store Performed by calculating system within the storage system.So, the present invention is not restricted to the combination of any specific hardware and software.

It should be appreciated that the above-mentioned detailed description of the invention of the present invention is used only for exemplary illustration or explains the present invention's Principle, and be not construed as limiting the invention.Therefore, that is done in the case of without departing from the spirit and scope of the present invention is any Amendment, equivalent, improvement etc., should be included within the scope of the present invention.Additionally, claims purport of the present invention Whole within containing the equivalents falling into scope and border or this scope and border change and repair Change example.

Claims

1. an information providing method based on user interest, it is characterised in that including:

Input data set is carried out data conversion and sampling,

Initial retrieval result is obtained by data retrieval,

Method the most according to claim 1, it is characterised in that described input data set is carried out data conversion and sampling, Farther include,

By data summarization, data file is inputted distributed search system, preserve after data base, first according to the need of user Asking some Field Sanitization, the data configuration after next processing becomes rating matrix, is saved to data after structure Storehouse, if the data set before this data compilation other users non-institute is privately owned, by this reduced data collection preserve to leading With, find original data set；First read the size of data when carrying out data sampling, construct a Boolean matrix, initially Value is all false, then selects sample mode, when calculating corresponding training set, is pressed with corresponding data set by Boolean matrix Position with, when calculating test set, training set step-by-step is negated, the training set of generation and test set table can perform inspection Rope.

Method the most according to claim 2, it is characterised in that described relativity measurement based on result and retrieval type is to inspection Hitch fruit is resequenced, and farther includes:

First retrieval result is carried out quantization means, will each retrieval result d_iBeing expressed as a vector, the dimension of vector is inspection At least occurring in rope resulting text that word once constituted the size of set, the most one-dimensional value is that corresponding word is inverse in this result Relevance score score between the weights of word frequency index expression, employing below equation evaluation result and retrieval type:

s c o r e (d_{i}, Q) = \underset{t &Element; d_{i} \cap Q}{Σ} W (t | d_{i}) \cdot W (t | Q)

WhereinRepresent that word t is in retrieval result d_iIn weights；

Represent word t weights in retrieval type Q；

l(d_i) it is result d_iLength, tf (t | d_i) it is that word t is in result d_iThe frequency of middle appearance, and tf (t | Q) it is that word t is at retrieval type Q The frequency of middle appearance, and df (t | C) it is word t frequency in whole result set C, k₁, k₂, b is for presetting regulation parameter；