CN105159932B

CN105159932B - A kind of data retrieval engine and ordering system and method

Info

Publication number: CN105159932B
Application number: CN201510478159.9A
Authority: CN
Inventors: 李文超; 金泰木; 王腾飞; 张士存; 段浩伟; 曹志伟; 柳少华; 孙华; 董丽; 王振中; 林霖
Original assignee: CRRC Qingdao Sifang Co Ltd
Current assignee: CRRC Qingdao Sifang Co Ltd
Priority date: 2015-08-07
Filing date: 2015-08-07
Publication date: 2019-06-21
Anticipated expiration: 2035-08-07
Also published as: CN105159932A

Abstract

The present invention relates to a kind of data retrieval engine and ordering system and methods, including user management module, are used for managing user information；Database for classified storage document and responds user's request；Relatedness computation module, calculating and sequence for search result；The relatedness computation module includes user behavior statistic submodule, to document review and applies behavior for counting user preference keyword, user；Sort relatedness computation submodule, for calculating the sequence degree of correlation and arranging search result by the sequence degree of correlation.The analytical calculation of data acquisition system further progress inquiry degree of correlation RC, demand degree of correlation DC and quality of data degree of correlation QC that the present invention is matched to term, the data acquisition system retrieved is set to meet user preference, user to document review and using behavior, the degree of correlation for improving search result and user demand has saved the user query time.

Description

A kind of data retrieval engine and ordering system and method

Technical field

The present invention relates to a kind of search engine, in particular to a kind of data retrieval engine and ordering system and method.

Background technique

It is most of to be retrieved using keyword input mode during existing search engine carries out data retrieval, Search result is ranked up according to Keywords matching degree, and it is past that different users inputs the search result that identical keyword obtains Toward be it is the same, users ' individualized requirement and data qualitative factor are not accounted for the sortord of search result.From a large amount of The information that oneself needs is found in return information, can waste user's long time and energy.

Summary of the invention

Present invention is primarily aimed at solving the above problems and insufficient, provides and a kind of data are improved based on user behavior information The degree of correlation improves a kind of data retrieval engine and ordering system of retrieval accuracy.

Another main purpose of the invention is to provide a kind of data retrieval and sort method.

To achieve the above object, the technical scheme is that

A kind of data retrieval engine and ordering system, comprising:

User management module is used for managing user information；

Database for classified storage document and responds user's request；

Relatedness computation module, calculating and sequence for search result；

The relatedness computation module includes user behavior statistic submodule, is used for counting user preference keyword, user To document review and apply behavior；

Sort relatedness computation submodule, for calculating the sequence degree of correlation and arranging search result by the sequence degree of correlation Column.

Further, a kind of data retrieval according to claim 1 and ordering system, it is characterised in that: the sequence The degree of correlation is determined by inquiry degree of correlation RC, demand degree of correlation DC and quality of data degree of correlation QC.

Further, the inquiry degree of correlation RC is calculated using TF-IDF method,

Wherein, i is term；

TFi (d) is the frequency that term i occurs in document d；

N is the number of all documents；

DF is the number of documents containing term i.

Further, the demand degree of correlation DC is by preference key Word similarity KeySim and Behavior preference classification similarity ClassSim is added to obtain；

The document subject matter vector DSV and user preference that the preference key Word similarity is formed by each document index are crucial The user preference vector UPV that vocabulary is formed carries out cosine similarity calculating,

DSV(a₁,w₁；a₂,w₂；...；a_m,w_m)

UPV(b₁,w₁；b₂,w₂；...；b_n,w_n)

Behavior preference classification similarity by user to document application behavior decision,

Wherein, df (t_a)、df(t_b)、df(t_c)、df(t_d) it is that classification belonging to document is browsed by user, downloaded, collecting, pushing away Recommend number.

Further, the quality of data degree of correlation QC by user to document application factor, user to document review quantity because Son, user are added to obtain to the document scores factor,

Wherein, FD_iIt is respectively represented in the number and whole documents that the document is downloaded with FD and is downloaded most documents Download time；FL_iThe browsing that most documents is browsed in the number and whole documents that the document is browsed is respectively represented with FL Number；FF_iThe collection number that most documents is collected in the number and whole documents that the document is collected is respectively represented with FF； FR_iThe recommendation number that most documents is recommended in the recommended number of the document and whole documents is respectively represented with FR；CM_iWith CM respectively represents the number that the document is commented on and is commented on the comment number of most documents in whole documents；Score is represented The document scoring factor.

Further, the document scores factor S core scores ability US and user to the scoring e of document by user_jWeighting It is calculated,

Wherein, j is scoring number；

The user scores ability US by static factor USS, dynamic factor UDS and professional domain factor M MS weighted calculation It is added and obtains afterwards,

Further, the static state factor USS is calculated by the conversion value SS of the age of user, title for technical personnel, educational background,

Further, the dynamic factor UDS is determined by the active degree AD of user's monthly login system number,

Wherein, fr_iFor monthly user's login times.

Further, if field described in the affiliated professional domain of user and the document of user's scoring is same domain, have special Industry field factor M MS, i.e. γ are 0.1；

If field described in the affiliated professional domain of user and the document of user's scoring is different field, do not have profession neck Domain factor M MS, i.e. γ are 0.

Another technical solution of the invention is:

A kind of data retrieval and sort method, include the following steps:

Step 1, user input term i；

Step 2 matches term i with document in database, obtains matched data set；

Step 3, according to inquiry degree of correlation RC, demand degree of correlation DC and quality of data degree of correlation QC, to matched data set In document be ranked up the degree of correlation RankC calculating,

RankC=log (RC+QC+DC) (9)；

Search result is ranked up by step 4 according to sequence degree of correlation RankC.

To sum up content, a kind of data retrieval engine and ordering system of the present invention and method, are matched to term Data acquisition system further progress inquiry degree of correlation RC, demand degree of correlation DC and quality of data degree of correlation QC analytical calculation, make The data acquisition system retrieved meets user preference, user to document review and using behavior, and improving search result and user needs The degree of correlation asked has saved the user query time.

Detailed description of the invention

Fig. 1 is the flow chart of the method for the present invention；

Fig. 2 is the frame diagram of present system；

Fig. 3 is user's scoring ability US schematic diagram of the present invention.

Specific embodiment

Present invention is further described in detail with specific embodiment with reference to the accompanying drawing:

As shown in Fig. 2, a kind of data retrieval engine and ordering system, including user management module, database and the degree of correlation Computing module.

User management module is used for managing user information, essential record age of user, title for technical personnel, educational background, affiliated technology The data such as field, user preference antistop list.

Database is mainly used for storing document data, and carries out Classification Management to document, and response user is directed to all kinds of documents Application request.

Relatedness computation module is mainly used for realizing that the calculating and sequence of search result, including user behavior count submodule Block and sequence relatedness computation submodule.

User behavior statistic submodule, be mainly used for record user to document application behavior record (browsing, downloading, collect, Recommend), user to document review and scoring record, user's login times record etc..

Sort relatedness computation submodule, be mainly used for calculate sequence the degree of correlation and by sequence the degree of correlation by search result into Row arrangement.

Sequence degree of correlation RankC determined by inquiry degree of correlation RC, demand degree of correlation DC and quality of data degree of correlation QC, formula It is as follows:

RankC=log (RC+QC+DC) (9)

After user inputs term i, system matches term i with document in database, obtains matched data collection It closes, inquiry degree of correlation RC is calculated using TF-IDF method, and formula is as follows:

TFi (d) is the frequency that term i occurs in document d, and N is the number of all documents, and DF is to contain term i Number of documents.The it is proposed of the function is based on such a hypothesis: the word significant to difference document should be those The frequency of occurrences is sufficiently high in document, but the frequency of occurrences word few enough in other documents of entire collection of document.

Demand degree of correlation DC is added by preference key Word similarity KeySim with Behavior preference classification similarity ClassSim It obtains.

The document subject matter vector DSV and user preference that preference key Word similarity KeySim is formed by each document index are closed The user preference vector UPV that keyword table is formed carries out cosine similarity calculating.User preference antistop list be derived from data retrieval and The interest tags word of the user's self-defining saved in ordering system user management module, formula are as follows:

DSV(a₁,w₁；a₂,w₂；...；a_m,w_m)

UPV(b₁,w₁；b₂,w₂；...；b_n,w_n)

Behavior preference classify similarity ClassSim by user to document application behavior decision, according to user to document application Behavior, including browsing, downloading, collection, recommendation calculate Behavior preference classification similarity ClassSim.User goes to document application To record the user behavior statistic submodule being derived from the relatedness computation module of system, formula is as follows:

Quality of data degree of correlation QC by user to document application factor, user to document review Quantitative factor, user to text The shelves scoring factor is added to obtain, and formula is as follows:

By above-mentioned definition it is found thatSimilarly,

User merges processing to downloading, browsing behavior to document application factor, is assigned to the arithmetic average of the two One weight α merges processing to collection, recommendation behavior, is assigned to weight 1- α to the arithmetic average of the two.Ordinary meaning On, 0.1 α≤0.4 ≦.That is, the influence of the downloading and browsing behavior of user for the quality of data will be lower than collection and recommendation behavior Influence to the quality of data.User is derived from data retrieval and ordering system relatedness computation module document application behavior record User behavior statistic submodule.

By above-mentioned definition it is found that

User is individually assigned to weight β to comment behavior to document review Quantitative factor.On ordinary meaning, 1 β≤2 ≦.User The user behavior statistic submodule being derived from data retrieval and ordering system relatedness computation module is recorded to document review quantity.

Document scores factor S core scores ability US and user to the scoring e of document by user_jWeighted calculation obtains, public Formula is as follows:

Wherein, j is scoring number, 1≤e_j≤ 5, user scores ability US as weight, and weight is higher, and user is to data It evaluates more credible.

User's scoring ability US will affect its confidence level to data scoring, analysis core associated with user's scoring ability Heart factor, such as age, title for technical personnel, professional domain calculate user's scoring ability US.User scores ability US by the static factor It is added to obtain after USS, dynamic factor UDS and professional domain factor M MS weighted calculation, formula is as follows:

η、With the operator that γ is for adjusting user's evaluation capacity calculation result.In the present embodiment, 0.6 η≤0.8 ≦,γ is 0 or 0.1.

Static factor USS is calculated by the conversion value SS of the age of user, title for technical personnel, educational background, and formula is as follows:

Wherein conversion value SS is between [0,1].

Following table is static factor USS conversion value SS schematic table:

Dynamic factor UDS refers to user's evaluation ability dynamic score, and dynamic factor refers to that user can constantly occur at any time The factor of variation determines that formula is as follows by the active degree AD of user's monthly login system number:

It is 1, fr that active degree is defined when initial_iFor monthly user's login times.User's active degree and scoring ability are just Correlation, because any active ues can improve scoring ability during application data.

Professional domain factor M MS refers to user's professional domain matching degree score, if the affiliated professional domain of user and user When field described in the document of scoring is same domain, then having professional domain factor M MS, i.e. γ is 0.1；

A kind of data retrieval and sort method, include the following steps:

Step 1, user input term i.

Step 2 matches term i with document in database, obtains matched data set.

Step 3, according to the document index and term matching degree in matched data set, calculated and looked into using formula (1) Ask degree of correlation RC；

User preference keyword matches the document index in matched data set, according to document index and user Preference Keywords matching degree calculates demand degree of correlation DC using formula (2), (3)；

Automatic Evaluation is carried out to the document data quality in matched data set based on user behavior information, uses formula (4), (5), (6), (7), (8) calculate quality of data degree of correlation QC；

According to inquiry degree of correlation RC, demand degree of correlation DC and quality of data degree of correlation QC, using formula (9) to matched data Document in set is ranked up degree of correlation RankC calculating.

Search result is ranked up by step 4 according to sequence degree of correlation RankC, and sequence follows descending sort, and sequence is related The high ranking of RankC numerical value is spent preceding, and search result is returned into user.

As described above, the plan content in conjunction with given by attached drawing, can derive similar technical solution.But it is all not take off Content from technical solution of the present invention, according to the technical essence of the invention it is to the above embodiments it is any it is simple modification, etc. With variation and modification, all of which are still within the scope of the technical scheme of the invention.

Claims

1. a kind of data retrieval engine and ordering system characterized by comprising

User management module is used for managing user information, records age of user, title for technical personnel, educational background, technical field, use Family preference antistop list data；

Database for classified storage document and responds user's request；

Relatedness computation module, calculating and sequence for search result；

The relatedness computation module includes user behavior statistic submodule and sequence relatedness computation submodule；

The user behavior statistic submodule, for counting user preference keyword, user to document application behavior record, user To document review and scoring record, user's login times record, the document application behavior record includes browsing time, downloading time Number, recommends number at collection number；

The sequence relatedness computation submodule, for calculating the sequence degree of correlation and arranging search result by the sequence degree of correlation Column；

The sequence degree of correlation RankC determines by inquiry degree of correlation RC, demand degree of correlation DC and quality of data degree of correlation QC, formula It is as follows:

RankC=log (RC+QC+DC) (9)

The inquiry degree of correlation RC is calculated using TF-IDF method；

The demand degree of correlation DC is added by preference key Word similarity KeySim with Behavior preference classification similarity ClassSim It obtains；

The quality of data degree of correlation QC by user to document application factor, user to document review Quantitative factor, user to text The shelves scoring factor is added to obtain；

The document subject matter vector DSV and user preference antistop list that the preference key Word similarity is formed by each document index The user preference vector UPV of formation carries out cosine similarity calculating, and user preference antistop list is derived from data retrieval and sequence system The interest tags word of the user's self-defining saved in system user management module,

DSV(a₁,w₁；a₂,w₂；...；a_m,w_m)

UPV(b₁,w₁；b₂,w₂；...；b_n,w_n)

For the Behavior preference classification similarity by user to document application behavior decision, user is derived from document application behavior record User behavior statistic submodule in the relatedness computation module of system,

Wherein, df (t_a)、df(t_b)、df(t_c)、df(t_d) it is the affiliated classification of document by user's browsing, downloading, collection, recommendation time Number；

The quality of data degree of correlation QC by user to document application factor, user to document review Quantitative factor, user to text The shelves scoring factor is added to obtain,

Wherein, FD_iThe downloading time that most documents is downloaded in the number and whole documents that the document is downloaded is respectively represented with FD Number；FL_iThe browsing time that most documents is browsed in the number and whole documents that the document is browsed is respectively represented with FL；FF_i The collection number that most documents is collected in the number and whole documents that the document is collected is respectively represented with FF；FR_iAnd FR Respectively represent the recommendation number that most documents is recommended in the recommended number of the document and whole documents；CM_iDistinguish with CM Represent the comment number that most documents is commented in the number and whole documents that the document is commented on；Score represents the document Score the factor；

User merges processing to downloading, browsing behavior to document application factor, is assigned to one to the arithmetic average of the two Weight α merges processing to collection, recommendation behavior, is assigned to weight 1- α to the arithmetic average of the two；

User is individually assigned to weight β to comment behavior to document review Quantitative factor；

The document scores factor S core scores ability US and user to the scoring e of document by user_jWeighted calculation obtains,

Wherein, j is scoring number；

The user scores ability US by phase after static factor USS, dynamic factor UDS and professional domain factor M MS weighted calculation Add to obtain,

The static state factor USS is calculated by the conversion value SS of the age of user, title for technical personnel, educational background,

The dynamic factor UDS determines by the active degree AD of user's monthly login system number,

Wherein, fr_iFor monthly user's login times；

If field described in the affiliated professional domain of user and the document of user's scoring is same domain, there is the professional domain factor MMS, i.e. γ are 0.1；

If the affiliated professional domain of user and user scoring document described in field be different field, do not have professional domain because Sub- MMS, i.e. γ are 0.

2. a kind of data retrieval engine according to claim 1 and ordering system, it is characterised in that: the side TF-IDF Method:

Wherein, i is term；

TFi (d) is the frequency that term i occurs in document d；

N is the number of all documents；

DF is the number of documents containing term i.

3. retrieval and the sort method of a kind of system as described in claim 1, which comprises the steps of:

Step 1, user input term i；

Step 2 matches term i with document in database, obtains matched data set；

Step 3, according to inquiry degree of correlation RC, demand degree of correlation DC and quality of data degree of correlation QC, in matched data set Document is ranked up degree of correlation RankC calculating,

RankC=log (RC+QC+DC) (9)；

The inquiry degree of correlation RC is calculated using TF-IDF method；

DSV(a₁,w₁；a₂,w₂；...；a_m,w_m)

UPV(b₁,w₁；b₂,w₂；...；b_n,w_n)

Wherein, df (t_a)、df(t_b)、df(t_c)、df(t_d) it is classification belonging to document by user's browsing time, download time, collection Number recommends number；

Wherein, j is scoring number；

Wherein, fr_iFor monthly user's login times；

If the affiliated professional domain of user and user scoring document described in field be different field, do not have professional domain because Sub- MMS, i.e. γ are 0；