CN106055594A - Information providing method based on user interests - Google Patents
Information providing method based on user interests Download PDFInfo
- Publication number
- CN106055594A CN106055594A CN201610346247.8A CN201610346247A CN106055594A CN 106055594 A CN106055594 A CN 106055594A CN 201610346247 A CN201610346247 A CN 201610346247A CN 106055594 A CN106055594 A CN 106055594A
- Authority
- CN
- China
- Prior art keywords
- data
- result
- retrieval
- word
- user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides an information providing method based on user interests. The method comprises performing data conversion and sampling on an input data set, obtaining initial retrieval results through data retrieval, and reordering the initial retrieval results on basis of correlation measurement of the results and retrieval types. According to the information providing method based on the user interests, the data set is uniformly collected and managed by a distributed retrieval system, the retrieval results are further optimized on basis of feedback and evaluation of users, and user customized requirements are efficiently satisfied.
Description
Technical field
The present invention relates to data-pushing, particularly to a kind of information providing method based on user interest.
Background technology
In today of information age, along with Internet technology and the development of social informatization technology, quantity of information is with quick-fried
The speed increment of fried formula, the Internet the most constantly affects and changes daily life mode.But, along with the network information
Becoming increasingly numerous and complicated, how people efficiently find the information just one-tenth meeting demand from the immenseest information ocean
One problem increasingly merited attention.Although there being correlation distribution formula searching system that people can be helped to find institute more accurately
The information needed, but in some application, such as film, music, social network search, user typically can not well propose very
Good Search Requirement, is believed by the attribute of the research historical record of user, socialization's information of user and corresponding FIELD Data
Potential for user data interested, by information modeling or the FIELD Data Resource Modeling of user, are provided by breath by reliable fashion
User is recommended in source.But existing distributed search system is different in the satisfaction of work efficiency and user, and lack
Few general interface processes the input of isomeric data.
Summary of the invention
For solving the problem existing for above-mentioned prior art, the present invention proposes a kind of information based on user interest and provides
Method, including:
Input data set is carried out data conversion and sampling,
Initial retrieval result is obtained by data retrieval,
Initial retrieval result is resequenced by relativity measurement based on result and retrieval type.
Preferably, described input data set is carried out data conversion and sampling, farther includes,
By data summarization, data file is inputted distributed search system, preserve after data base, first according to user
Demand by some Field Sanitization, the data configuration after next processing becomes rating matrix, is saved to after structure
Data base, if the data set before this data compilation other users non-institute is privately owned, by this reduced data collection preserve to
Leading use, finds original data set;First read the size of data when carrying out data sampling, construct a Boolean matrix,
Initial value is all false, then selects sample mode, when calculating corresponding training set, by Boolean matrix and corresponding data
Collection step-by-step phase with, when calculating test set, training set step-by-step is negated, the training set of generation can be held with test set table
Line retrieval.
Preferably, retrieval result is resequenced, further by described relativity measurement based on result and retrieval type
Including:
First retrieval result is carried out quantization means, will each retrieval result diIt is expressed as a vector, the dimension of vector
Degree is at least to occur that word once constituted the size of set in retrieval resulting text, and the most one-dimensional value is that corresponding word is at this knot
The weights of reverse word frequency index expression in Guo, relevance score score between employing below equation evaluation result and retrieval type:
WhereinRepresent that word t is in retrieval result diIn power
Value;
Represent word t weights in retrieval type Q;
l(di) it is result diLength, tf (t | di) it is that word t is in result diThe frequency of middle appearance, and tf (t | Q) it is that word t is in inspection
The frequency occurred in cable-styled Q, and df (t | C) it is word t frequency in whole result set C, k1, k2, b is for presetting regulation parameter;
Finally according to the Score that finally marks of result, initial retrieval result is ranked up from high to low by scoring.
The present invention compared to existing technology, has the advantage that
The present invention proposes a kind of information providing method based on user interest, and data set is carried out by distributed search system
Unify collection management, and retrieval result is optimized by feedback based on user and evaluation further, meets user to higher efficiency
Personalized demand.
Accompanying drawing explanation
Fig. 1 is the flow chart of information providing method based on user interest according to embodiments of the present invention.
Detailed description of the invention
Hereafter provide retouching in detail one or more embodiment of the present invention together with the accompanying drawing of the diagram principle of the invention
State.Describe the present invention in conjunction with such embodiment, but the invention is not restricted to any embodiment.The scope of the present invention is only by right
Claim limits, and the present invention contains many replacements, amendment and equivalent.Illustrate in the following description many details with
Thorough understanding of the present invention is just provided.These details are provided for exemplary purposes, and without in these details
Some or all details can also realize the present invention according to claims.
An aspect of of the present present invention provides a kind of information providing method based on user interest.Fig. 1 is real according to the present invention
Execute the information providing method flow chart based on user interest of example.
The present invention carries out unified management and storage for retrieval input data set in distributed search system, and right
It carries out data conversion, and the feedback result according to obtaining carries out evaluation of result, and distributed search system evaluation unit includes data
Management module, retrieval perform module and represent module.
Data management module is for receiving data input, consolidation form and the feature analysis of data set and sampling.Data
After file input system, the data summarization submodule through data management module converts thereof into system discernible data money
Source, processes through data compilation submodule, becomes the computable data of system, and data compilation includes from text, number
According to library file, and the input data of journal file carry out the unification of form, are converted to two-dimensional matrix or multidimensional list, so that after
Continuous data manipulation continues executing with.Retrieval performs module request data when, retrieval performs module in corresponding request ginseng
Comprising the form of request data in number, then data transmission module processes the data through data sampling according to this parameter.
Data set stores on a different server according to the storage condition of each server, and retrieval performs module to data pipe
During reason module request data, data management module advanced person's row cache is searched, is used the hash strategy of client, if caching
Hit, directly takes out data set from caching, if being not hit by, then asks related data in data base.
When data management module access cache server, first, key during data management module requested data set passes through
Pre-defined algorithm is mapped to a wherein caching server, then takes out corresponding data value from this server.For making it hit
Rate height as far as possible, takes following strategy: using annular hash queue, object map to 32 key correspondence searched, from 0-
232-1Numerical space, be linked into end to end ring.Caching and object are mapped to same through same hashing algorithm
Individual numerical space;At whole circle queue, the key value of object is found to set out along clockwise direction, until running into a caching, then
Just this object is stored in this caching.When removing caching, the object in the most next caching of this caching of traversal counterclockwise;When
When increasing caching, the inverse position hour hands that this caching maps are found and the object in next caching interval, by them from up time
The next caching of pin is left out, is mapped in this caching.
Owing to the data set of user's input is various informative, system is by creating data set plate, often during a kind of data set of input
Then one data set of instantiation, configures with different parameters, owing to the data set required for different algorithms is different, thus different
Algorithm use the data set of different-format, data set form collator is included: identify field or the information of the input of redundancy,
Filtered;According to the configuration file of user, each field information of input data set is preserved;Data set is set
Openness threshold value, if input data set is less than threshold value, can will be less than user's mistake of this threshold value according to the input parameter of user
Filter.
By data summarization, data file being inputted distributed search system, preserve after data base, these data can
To be directly entered data compilation submodule, the first demand according to user of data compilation submodule is by some Field Sanitization.Next
Data configuration after processing becomes rating matrix, is saved to data base after structure, if the number before this data compilation
According to collection, other user institutes non-are privately owned, the forward reference preserved by this reduced data collection, find original data set.
In data management module, the sampling time of data sampling submodule can select to carry out data set processes when
Sampling, or algorithm configuration completes when, it is sampled.Former mode is to complete inside data management module,
Its concrete logic is when user-selected number is according to collection sampling, then selects data set, then selects the sample mode of correspondence, if
Operation can be successfully completed, and is stored by the data set after corresponding sampling, and former data set is constant, the new number after sampling
There is tag field to indicate former data set according to collection, and have sample mode and other information of correspondence.Latter approach is algorithm
Request data after configuration, and data receive concrete sample requirement, such as dataset name, sample mode and other letters
After breath, check that retrieval performs whether can complete in the message that module transmits the operation of data sampling, if it is, carry out data
Sampling, by the data set after sampling at local data library backup after sampling, then corresponding sampled data set being issued please
The actuating station asked, there may be many data transfer during one time algorithm performs, in view of Riming time of algorithm is relatively more long, so
The operation of algorithm uses distributed treatment, the high efficiency performed for algorithm, and data management module is sent to retrieval and performs module
The different actuating stations of middle correspondence, perform module and all can check, in the every data transfer of request data sampling, the sample mode that it requires
The most exist in data base, if it is, take out data, if it is not, resend this request.
When carrying out data sampling when, first the size of data being read in data sampling submodule, system constructs one
Boolean matrix, initial value is all false, then selects sample mode, if simply unitary sampling, the corresponding training set of generation and
Test set all will only generate once, if circulation multiple repairing weld, will generate multiple, different according to sample mode, will be this square
Some values of battle array are filled to true, and other is still false, and this Boolean matrix is by the mould table of its named training set, logical
Cross this mould table, can calculate the training set of correspondence, only need to by it with corresponding data set step-by-step phase and, in like manner can calculate
Go out test set, only the mould table step-by-step of training set need to be negated.The training set accordingly generated can be sent to test set table
Retrieval performs module execution, and retrieval performs module and goes to predict that the data item scoring that test set table intermediate value is True is according to training set
Can.
Being evaluated retrieval result in test set, the content in this test set is the project set that user is interested.
Owing to saving test set in this locality data sampling when, when algorithm is finished return result, system first from
The message of communication takes out corresponding sequence code, according to this sequence code, test set corresponding in data base is taken out, then
It is compared with the result returned, thus draws evaluation result.It is in store with algorithm types as major key that retrieval performs module,
The table of algorithm configuration summary info, sends by its non-primary key information after algorithm is finished.Combination algorithm is finished
After the parameters that transmits, carry out the evaluation output of result.
The when that retrieval performing module return data, the sequence code of subsidiary both sides' agreement, the algorithm passed back performs result, and
Parameter needed for the placement algorithm performing to be carried in type list of affix algorithm, pass back result is evaluated after this locality and
Represent, revise parameter for user feedback.
After user provides relevant feedback, retrieval result is carried out rearrangement process, comment specifically, combine retrieval result
Point, the degree of approximation range difference of relevant and uncorrelated result is resequenced in user feedback.
Before dependency between tolerance retrieval result, it is necessary first to carried out quantization means, by each retrieval result
diBeing expressed as a vector, the dimension of vector is at least to occur in text that word once constituted the size gathered, the most one-dimensional
Value is corresponding word weights of reverse word frequency index expression in this result.Then below equation evaluation result and retrieval type are used
Between relevance score:
W in formula (t | di) it is that word t is at diIn weights;
W (t | Q) it is word t weights in retrieval type Q;
l(di) it is result diLength;
tf(t|di) it is that word t is in result diThe frequency of middle appearance;
Tf (t | Q) is the frequency that word t occurs in retrieval type Q;
Df (t | C) it is word t frequency in whole result set C;
k1, k2, b is for presetting regulation parameter.
Finally according to the final scoring of result, initial retrieval result is resequenced, i.e. by the Score's of result
Scoring is ranked up from high to low.
The present invention following example use optional result ordering method, including retrieval result field represent and based on
The retrieval sort result that the degree of approximation calculates.
First it is distributed search system that the term of user is submitted to, then obtains the retrieval of distributed search system
As a result, and extract retrieval result title, description and URL, and carry out participle, according to disabling vocabulary, useless word is deleted;Root
According to reverse word frequency algorithm result of calculation title and the weighted value of each word of description, it is then combined with;Check belonging to each word is thin
Dividing field, if there being the segmentation field belonging to two words identical, then its weighted value being added, as the weighting in this segmentation field
Value, finally can obtain the segmentation field vector of this retrieval result;Check the primary territory belonging to each segmentation field, if the same
Continue to merge, finally can obtain the primary territory vector of this retrieval result;Distributed search system results collection is performed above step,
Obtain the field vector table of distributed search system results collection.
If the main interest vector that UF is user, US is the segmentation interest vector of user, calculates user interest and each successively
The degree of approximation of result.If DF is the primary territory vector of certain retrieval result in retrieved set, the segmentation field of this retrieval result of DS is vowed
Amount.
Calculate user interest and retrieve the boundary difference that the segmentation field of result is gathered:
BL=DS-US ∩ DS
Calculate user interest and retrieve the degree of approximation that the segmentation field of result is gathered:
WhereinIt is that the weights segmenting field all existed in this retrieval result and user interest are taken advantage of
Long-pending sum, num (BL) and num (DS) is B respectivelyLQuantity with DS.
The boundary difference of the primary territory set of calculating user interest and retrieval result:
BU=DF-(UF ∩ DF)
The degree of approximation of the primary territory set of calculating user interest and retrieval result:
WhereinIt it is the weights product of the primary territory all existed in this retrieval result and user interest
Sum, num (BU) and num (DF) is B respectivelyUQuantity with DF;
Finally calculate total degree of approximation of this retrieval result and user interest:
Sim=ζ × SimL(US, DS)+(1-ζ) × SimU(UF, DF)
Wherein ζ is the weighted value of segmentation field Set approximation degree.
The step for of foundation, each result returning distributed search system calculates total degree of approximation Sim successively, obtains every
The weights that individual retrieval result is new, sort the most from big to small, obtain new result order.
In the vector representation of above-mentioned user interest, the present invention uses acquisition this locality to browse record to carry out interest analysis
Mode.First obtain title and the description of the retrieval result that user accesses, and these titles and description are carried out participle, after participle
According to disabling vocabulary, useless word is deleted;Comparison feature vocabulary, checks and browses all words of all retrieval results in record, system
Count the Feature Words number that each segmentation field occurs, obtain vector { (hs1, c1), (hs2, c2) ..., (hsn, cn), wherein hsiRefer to
I-th segmentation field, ciRefer to that i-th segmentation field occurs in that how many Feature Words;Calculate the weights in each segmentation field, calculate
Formula isFinally obtain a segmentation interest vector HS={ (hs1, hsw1), (hs2, hsw2) ..., (hsn,
hswn)};After the interest worlds that segmentation interest vector selects with user merge, generate primary territory interest vector together.
In sum, the present invention proposes a kind of information providing method based on user interest, distributed search system pair
Data set carries out unifying collection management, and retrieval result is optimized by feedback based on user and evaluation further, higher efficiency ground
Meet the demand of user individual.
Obviously, it should be appreciated by those skilled in the art, each module of the above-mentioned present invention or each step can be with general
Calculating system realize, they can concentrate in single calculating system, or be distributed in multiple calculating system and formed
Network on, alternatively, they can realize with the executable program code of calculating system, it is thus possible to by they store
Performed by calculating system within the storage system.So, the present invention is not restricted to the combination of any specific hardware and software.
It should be appreciated that the above-mentioned detailed description of the invention of the present invention is used only for exemplary illustration or explains the present invention's
Principle, and be not construed as limiting the invention.Therefore, that is done in the case of without departing from the spirit and scope of the present invention is any
Amendment, equivalent, improvement etc., should be included within the scope of the present invention.Additionally, claims purport of the present invention
Whole within containing the equivalents falling into scope and border or this scope and border change and repair
Change example.
Claims (3)
1. an information providing method based on user interest, it is characterised in that including:
Input data set is carried out data conversion and sampling,
Initial retrieval result is obtained by data retrieval,
Initial retrieval result is resequenced by relativity measurement based on result and retrieval type.
Method the most according to claim 1, it is characterised in that described input data set is carried out data conversion and sampling,
Farther include,
By data summarization, data file is inputted distributed search system, preserve after data base, first according to the need of user
Asking some Field Sanitization, the data configuration after next processing becomes rating matrix, is saved to data after structure
Storehouse, if the data set before this data compilation other users non-institute is privately owned, by this reduced data collection preserve to leading
With, find original data set;First read the size of data when carrying out data sampling, construct a Boolean matrix, initially
Value is all false, then selects sample mode, when calculating corresponding training set, is pressed with corresponding data set by Boolean matrix
Position with, when calculating test set, training set step-by-step is negated, the training set of generation and test set table can perform inspection
Rope.
Method the most according to claim 2, it is characterised in that described relativity measurement based on result and retrieval type is to inspection
Hitch fruit is resequenced, and farther includes:
First retrieval result is carried out quantization means, will each retrieval result diBeing expressed as a vector, the dimension of vector is inspection
At least occurring in rope resulting text that word once constituted the size of set, the most one-dimensional value is that corresponding word is inverse in this result
Relevance score score between the weights of word frequency index expression, employing below equation evaluation result and retrieval type:
WhereinRepresent that word t is in retrieval result diIn weights;
Represent word t weights in retrieval type Q;
l(di) it is result diLength, tf (t | di) it is that word t is in result diThe frequency of middle appearance, and tf (t | Q) it is that word t is at retrieval type Q
The frequency of middle appearance, and df (t | C) it is word t frequency in whole result set C, k1, k2, b is for presetting regulation parameter;
Finally according to the Score that finally marks of result, initial retrieval result is ranked up from high to low by scoring.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610346247.8A CN106055594A (en) | 2016-05-23 | 2016-05-23 | Information providing method based on user interests |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610346247.8A CN106055594A (en) | 2016-05-23 | 2016-05-23 | Information providing method based on user interests |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106055594A true CN106055594A (en) | 2016-10-26 |
Family
ID=57174306
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610346247.8A Pending CN106055594A (en) | 2016-05-23 | 2016-05-23 | Information providing method based on user interests |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106055594A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110807149A (en) * | 2019-10-11 | 2020-02-18 | 卓尔智联(武汉)研究院有限公司 | Retrieval method, retrieval device and storage medium |
CN114238772A (en) * | 2021-12-24 | 2022-03-25 | 韩效遥 | Intelligent network map recommendation system with content self-adaptive perception |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101520785A (en) * | 2008-02-29 | 2009-09-02 | 富士通株式会社 | Information retrieval method and system therefor |
US7765178B1 (en) * | 2004-10-06 | 2010-07-27 | Shopzilla, Inc. | Search ranking estimation |
CN102819575A (en) * | 2012-07-20 | 2012-12-12 | 南京大学 | Personalized search method for Web service recommendation |
-
2016
- 2016-05-23 CN CN201610346247.8A patent/CN106055594A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7765178B1 (en) * | 2004-10-06 | 2010-07-27 | Shopzilla, Inc. | Search ranking estimation |
CN101520785A (en) * | 2008-02-29 | 2009-09-02 | 富士通株式会社 | Information retrieval method and system therefor |
CN102819575A (en) * | 2012-07-20 | 2012-12-12 | 南京大学 | Personalized search method for Web service recommendation |
Non-Patent Citations (3)
Title |
---|
施振兴: "推荐系统综合仿真平台评估框架的研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
沈林: "基于模糊粗糙集的个性化搜索引擎研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
陈建荣: "基于用户反馈的智能查询扩展技术研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110807149A (en) * | 2019-10-11 | 2020-02-18 | 卓尔智联(武汉)研究院有限公司 | Retrieval method, retrieval device and storage medium |
CN110807149B (en) * | 2019-10-11 | 2023-07-14 | 卓尔智联(武汉)研究院有限公司 | Retrieval method, device and storage medium |
CN114238772A (en) * | 2021-12-24 | 2022-03-25 | 韩效遥 | Intelligent network map recommendation system with content self-adaptive perception |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10515424B2 (en) | Machine learned query generation on inverted indices | |
Zhang et al. | Unibench: A benchmark for multi-model database management systems | |
Cillo et al. | Niche tourism destinations’ online reputation management and competitiveness in big data era: Evidence from three Italian cases | |
CN104412265B (en) | Update for promoting the search of application searches to index | |
CN102859516B (en) | Generating improved document classification data using historical search results | |
CN105320719B (en) | A kind of crowd based on item label and graphics relationship raises website item recommended method | |
CN110532479A (en) | A kind of information recommendation method, device and equipment | |
CN107193967A (en) | A kind of multi-source heterogeneous industry field big data handles full link solution | |
CN103177066B (en) | Analysis and expression interpersonal relationships | |
Rakesh et al. | Probabilistic social sequential model for tour recommendation | |
Jiang et al. | Towards intelligent geospatial data discovery: a machine learning framework for search ranking | |
Ruiz et al. | Facilitating document annotation using content and querying value | |
CN110795613B (en) | Commodity searching method, device and system and electronic equipment | |
US8700624B1 (en) | Collaborative search apps platform for web search | |
Ye et al. | Crowdsourcing-enhanced missing values imputation based on Bayesian network | |
CN106055594A (en) | Information providing method based on user interests | |
Suresh Kumar et al. | Multi-ontology based points of interests (MO-POIS) and parallel fuzzy clustering (PFC) algorithm for travel sequence recommendation with mobile communication on big social media | |
Huang et al. | KG2Rec: LSH-CF recommendation method based on knowledge graph for cloud services | |
Saravanan et al. | Realizing social-media-based analytics for smart agriculture | |
Atta | The effect of usability and information quality on decision support information system (DSS) | |
Grover et al. | Latency reduction via decision tree based query construction | |
Alzua-Sorzabal et al. | Using MWD: A business intelligence system for tourism destination web | |
Cheng et al. | Extensions of GAP‐tree and its implementation based on a non‐topological data model | |
Hussan et al. | An optimized user behavior prediction model using genetic algorithm on mobile web structure | |
CN106021509A (en) | Object pushing method in big data environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161026 |
|
RJ01 | Rejection of invention patent application after publication |