CN110334269B - Information retrieval method and system - Google Patents

Information retrieval method and system Download PDF

Info

Publication number
CN110334269B
CN110334269B CN201910622980.1A CN201910622980A CN110334269B CN 110334269 B CN110334269 B CN 110334269B CN 201910622980 A CN201910622980 A CN 201910622980A CN 110334269 B CN110334269 B CN 110334269B
Authority
CN
China
Prior art keywords
webpage
document
relevance
time sequence
ith
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910622980.1A
Other languages
Chinese (zh)
Other versions
CN110334269A (en
Inventor
董文轩
程洁丹
晏裕生
姚晗
孙孟阳
江洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Institute Of Marine Technology & Economy
Original Assignee
China Institute Of Marine Technology & Economy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Institute Of Marine Technology & Economy filed Critical China Institute Of Marine Technology & Economy
Priority to CN201910622980.1A priority Critical patent/CN110334269B/en
Publication of CN110334269A publication Critical patent/CN110334269A/en
Application granted granted Critical
Publication of CN110334269B publication Critical patent/CN110334269B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an information retrieval method and an information retrieval system. The information retrieval method and the system provided by the invention firstly calculate the relevance of each webpage document in the keyword set to be searched and the webpage document set of the data source to be searched in the field of national defense science and technology information; and then outputting the webpage documents with the relevance larger than or equal to the similarity threshold, and outputting the webpage documents with the relevance smaller than the similarity threshold in the order from high to low according to the time sequence. The retrieval method and the retrieval system provided by the invention output the webpage documents with larger relevance as the retrieval result, can ensure the coverage rate of the retrieval result, and simultaneously output the webpage documents with smaller relevance to the user according to the sequence from high to low, and can meet the requirement of high timeliness of information retrieval. Therefore, the method and the system provided by the invention are adopted to carry out information retrieval in the field of national defense science and technology information, and can simultaneously meet the requirements of high timeliness and high coverage rate.

Description

Information retrieval method and system
Technical Field
The present invention relates to the field of information retrieval, and in particular, to an information retrieval method and system.
Background
Information Retrieval (Information Retrieval) refers to a search process of finding out Information needed by a user from a large number of Information sets by adopting a certain Information Retrieval method according to the user needs. The core problem of information retrieval is result ordering, i.e., how to arrange the information most needed by the user in front of the return list. The information retrieval is used as a part of information retrieval, which means that a certain information retrieval method is utilized to provide information such as news, dynamic, policy, viewpoint and the like required by a user, and the method has the main characteristics of high timeliness, individuation and the like. The information retrieval in the technical information field of defense is used as a special information retrieval, and has the characteristics of high timeliness and high coverage rate, but the existing retrieval method cannot meet the requirements of high timeliness and high coverage rate at the same time.
Disclosure of Invention
The invention aims to provide an information retrieval method and an information retrieval system, which can simultaneously meet the requirements of high timeliness and high coverage rate of information retrieval in the field of national defense science and technology information.
In order to achieve the purpose, the invention provides the following scheme:
an information retrieval method, the method comprising:
acquiring a keyword set to be searched and a webpage document set of a data source to be searched in the field of national defense science and technology intelligence, wherein the webpage document set comprises a plurality of webpage documents;
calculating the correlation between the keyword set to be searched and each webpage document;
and outputting the webpage documents with the relevance larger than or equal to the similarity threshold, and outputting the webpage documents with the relevance smaller than the similarity threshold in sequence from high to low.
Optionally, the calculating the relevance between the keyword set to be searched and each of the web documents specifically includes:
and calculating the relevance of the keyword set to be searched and each webpage document by adopting a BM25 model.
Optionally, the outputting the webpage document with the relevance greater than or equal to the similarity threshold specifically includes:
and outputting the webpage documents with the relevance larger than or equal to the similarity threshold value in the order of high relevance to low relevance.
Optionally, the outputting the web documents with the relevance smaller than the similarity threshold from high to low according to the time sequence specifically includes:
acquiring time sequence parameters of each webpage document with the correlation smaller than the similarity threshold, wherein the time sequence parameters comprise: at least one of the release time, the update time, the total number of clicks, the total number of downloads, the total length of the dwell time of the page and the acceleration of updating the webpage content;
calculating the time sequence of each webpage document according to the time sequence parameters;
and outputting the webpage documents with the relevance smaller than the similarity threshold value in the order of high chronological order to low chronological order.
Optionally, the timing parameter includes: the method includes the following steps that the issuing time, the updating time, the total click quantity, the total download quantity, the total page retention time and the webpage content updating acceleration are calculated, the time sequence of each webpage document is calculated according to the time sequence parameters, and the method specifically includes the following steps:
according to the formula:
Figure BDA0002126122310000021
calculating the time sequence of the ith webpage document, wherein I is more than or equal to 1 and less than or equal to I, I represents the number of the webpage documents with the correlation less than the similarity threshold value, SiRepresenting the time sequence of the ith webpage document; diRepresenting the total download amount of the ith webpage document; ciRepresenting the total click rate of the ith webpage document; piRepresenting the total length of the page stay time of the ith webpage document; t2iIndicating the update time of the ith webpage document; t1iIndicating the publishing time of the ith webpage document; giIndicating the web content update acceleration of the ith web document.
An information retrieval system, the system comprising:
the system comprises a data acquisition module, a search module and a search module, wherein the data acquisition module is used for acquiring a keyword set to be searched and a webpage document set of a data source to be searched in the field of national defense science and technology intelligence, and the webpage document set comprises a plurality of webpage documents;
the correlation calculation module is used for calculating the correlation between the keyword set to be searched and each webpage document;
and the retrieval output module is used for outputting the webpage documents with the relevance greater than or equal to the similarity threshold value and outputting the webpage documents with the relevance less than the similarity threshold value in sequence from high to low according to the time sequence.
Optionally, the correlation calculation module includes:
and the correlation calculation unit is used for calculating the correlation between the keyword set to be searched and each webpage document by adopting a BM25 model.
Optionally, the retrieval output module includes:
and the high-similarity document output unit is used for outputting the webpage documents of which the relevance is greater than or equal to the similarity threshold value in the order of high relevance to low relevance.
Optionally, the retrieval output module includes:
a time sequence parameter obtaining unit, configured to obtain a time sequence parameter of each web document whose correlation is smaller than the similarity threshold, where the time sequence parameter includes: at least one of the release time, the update time, the total number of clicks, the total number of downloads, the total length of the dwell time of the page and the acceleration of updating the webpage content;
the time sequence calculating unit is used for calculating the time sequence of each webpage document according to the time sequence parameters;
and the time sequence document output unit is used for outputting the webpage documents with the relevance smaller than the similarity threshold value according to the time sequence from high to low.
Optionally, the timing parameter includes: the time sequence calculating unit comprises the following components of issuing time, updating time, total number of click rate, total number of download amount, total length of dwell time of a page and updating acceleration of webpage content:
a timing calculation subunit configured to:
Figure BDA0002126122310000031
calculating the time sequence of the ith webpage document, wherein I is more than or equal to 1 and less than or equal to I, I represents the number of the webpage documents with the correlation less than the similarity threshold value, SiRepresenting the time sequence of the ith webpage document; diRepresenting the total download amount of the ith webpage document; ciRepresenting the total click rate of the ith webpage document; piRepresenting the total length of the page stay time of the ith webpage document; t2iIndicating the update time of the ith webpage document; t1iIndicating the publishing time of the ith webpage document; giIndicating the web content update acceleration of the ith web document.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the information retrieval method and the system provided by the invention firstly calculate the relevance of each webpage document in the keyword set to be searched and the webpage document set of the data source to be searched in the field of national defense science and technology information; and then outputting the webpage documents with the relevance larger than or equal to the similarity threshold, and outputting the webpage documents with the relevance smaller than the similarity threshold in the order from high to low according to the time sequence. The retrieval method and the retrieval system provided by the invention output the webpage documents with larger relevance as the retrieval result, can ensure the coverage rate of the retrieval result, and simultaneously output the webpage documents with smaller relevance to the user according to the sequence from high to low, and can meet the requirement of high timeliness of information retrieval. Therefore, the method and the system provided by the invention are adopted to carry out information retrieval in the field of national defense science and technology information, and can simultaneously meet the requirements of high timeliness and high coverage rate.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
Fig. 1 is a flowchart of an information retrieval method according to an embodiment of the present invention;
fig. 2 is a block diagram of an information retrieval system according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide an information retrieval method and an information retrieval system, which can simultaneously meet the requirements of high timeliness and high coverage rate of information retrieval in the field of national defense science and technology information.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a flowchart of an information retrieval method according to an embodiment of the present invention. As shown in fig. 1, the method includes:
step 101: acquiring a keyword set to be searched and a webpage document set of a data source to be searched in the field of national defense science and technology intelligence, wherein the webpage document set comprises a plurality of webpage documents.
Step 102: and calculating the correlation between the keyword set to be searched and each webpage document. In this embodiment, a BM25 model is used to calculate the relevance between the keyword set to be searched and each of the web documents.
Step 103: and outputting the webpage documents with the relevance larger than or equal to the similarity threshold, and outputting the webpage documents with the relevance smaller than the similarity threshold in sequence from high to low.
In practical application, the web documents with the relevance greater than or equal to the similarity threshold value can be output to the user according to the sequence of the relevance from high to low, that is, the web document with the highest relevance is placed at the top, the web document with the second relevance is placed at the second position, and so on, and the web documents with the relevance greater than or equal to the similarity threshold value are output to the user.
The outputting the webpage documents with the relevance smaller than the similarity threshold value according to the sequence from high to low in time sequence specifically comprises:
acquiring time sequence parameters of each webpage document with the correlation smaller than the similarity threshold, wherein the time sequence parameters comprise: at least one of the release time, the update time, the total number of clicks, the total number of downloads, the total length of the dwell time of the page and the acceleration of updating the webpage content;
calculating the time sequence of each webpage document according to the time sequence parameters;
and outputting the webpage documents with the relevance smaller than the similarity threshold value in the order of high chronological order to low chronological order.
In this embodiment, the timing parameters include: the method includes the following steps that the issuing time, the updating time, the total click quantity, the total download quantity, the total page retention time and the webpage content updating acceleration are calculated, the time sequence of each webpage document is calculated according to the time sequence parameters, and the method specifically includes the following steps:
according to the formula:
Figure BDA0002126122310000051
calculating the time sequence of the ith webpage document, wherein I is more than or equal to 1 and less than or equal to I, I represents the number of the webpage documents with the correlation less than the similarity threshold value, SiRepresenting the time sequence of the ith webpage document; diRepresenting the total download amount of the ith webpage document; ciRepresenting the total click rate of the ith webpage document; piRepresenting the total length of the page stay time of the ith webpage document; t2iIndicating the update time of the ith webpage document; t1iIndicating the publishing time of the ith webpage document; giIndicating the web content update acceleration of the ith web document.
Fig. 2 is a block diagram of an information retrieval system according to an embodiment of the present invention. As shown in fig. 2, the system includes:
the data acquisition module 201 is configured to acquire a keyword set to be searched and a web document set of a data source to be searched in the field of defense science and technology intelligence, where the web document set includes a plurality of web documents.
And the correlation calculation module 202 is configured to calculate correlations between the keyword set to be searched and each of the web page documents.
And the retrieval output module 203 is used for outputting the webpage documents with the relevance greater than or equal to the similarity threshold value, and outputting the webpage documents with the relevance less than the similarity threshold value in sequence from high to low.
The correlation calculation module 202 includes:
and the correlation calculation unit is used for calculating the correlation between the keyword set to be searched and each webpage document by adopting a BM25 model.
The retrieval output module 203 includes:
and the high-similarity document output unit is used for outputting the webpage documents of which the relevance is greater than or equal to the similarity threshold value in the order of high relevance to low relevance.
The retrieval output module 203 further includes:
a time sequence parameter obtaining unit, configured to obtain a time sequence parameter of each web document whose correlation is smaller than the similarity threshold, where the time sequence parameter includes: at least one of the release time, the update time, the total number of clicks, the total number of downloads, the total length of the dwell time of the page and the acceleration of updating the webpage content;
the time sequence calculating unit is used for calculating the time sequence of each webpage document according to the time sequence parameters;
and the time sequence document output unit is used for outputting the webpage documents with the relevance smaller than the similarity threshold value according to the time sequence from high to low.
In this embodiment, the timing parameters include: the time sequence calculating unit comprises the following components of issuing time, updating time, total number of click rate, total number of download amount, total length of dwell time of a page and updating acceleration of webpage content:
a timing calculation subunit configured to:
Figure BDA0002126122310000071
calculating the time sequence of the ith webpage document, wherein I is more than or equal to 1 and less than or equal to I, I represents the number of the webpage documents with the correlation less than the similarity threshold value, SiRepresenting the time sequence of the ith webpage document; diRepresenting the total download amount of the ith webpage document; ciRepresenting the total click rate of the ith webpage document; piRepresenting the total length of the page stay time of the ith webpage document; t2iIndicating the update time of the ith webpage document; t1iIndicating the publishing time of the ith webpage document; giIndicating the web content update acceleration of the ith web document.
The specific implementation process of the invention is as follows:
s1 obtaining national defenseWebpage document set D, D ═ D of data source to be checked in scientific and technological information field1,d2,……,dn},diRepresenting the ith web page document in D.
S2, obtaining the query text input by the user, segmenting the query text to obtain the keyword set Q to be searched1,q2,……,quWherein q isiAnd the ith keyword to be searched in the keyword set to be searched is represented, i is more than or equal to 1 and less than or equal to u, and u represents the number of the keywords to be searched. Each web page document diIs shown as<Q,fi,ri>Q is the keyword set to be searched of the user; f. ofiFor web page documents diThe features of (1); r isiAnd taking the value of the relevance judgment condition of the document and the keyword set Q to be searched, wherein the value range is {0,1},0 represents irrelevant, and 1 represents relevant. Specifically, when determining the keyword set to be searched, for each webpage document diAn optimal segmentation of each document is found by using an unsupervised feature selection method of an RSR algorithm (regulated Self-reconstruction), and the specific steps are as follows:
(1) web page document diIs characterized by the feature set of fi={fi1,fi2,……,fimEach specific feature fijCan be linearly expressed by other features or by itself as:
Figure BDA0002126122310000072
wherein i is more than or equal to 1 and less than or equal to n, j is more than or equal to 1 and less than or equal to k and less than or equal to m, and wjkDenotes fijAnd fikCoefficient of relationship of eijRepresenting a weighted term, fijRepresenting the jth feature of the ith document.
(2) Set of features f for the documentiSolving for optimality using extremum algorithms
Figure BDA0002126122310000073
Figure BDA0002126122310000074
Wherein W represents a web document diThe matrix of coefficients of (a) is,W=[wij]∈Rm×m,l2,1the norm on E is to make the algorithm robust to outliers, and also to add W computation2,1Regularization terms to avoid trivial solutions; λ is a non-zero regularization weighting parameter.
Order to
Figure BDA0002126122310000075
Wherein, wiIs that
Figure BDA0002126122310000076
Row i of (2). According to the formula
Figure BDA0002126122310000077
Corresponding coefficients for each feature may be obtained, where v ═ v1,v2,……,vmI.e. the web page document diJ (th) feature fijCorresponding coefficient is vj
(3) Counting the appearance of keywords q to be searched in the document characteristicsiWord frequency xiAccording to the formula
Figure BDA0002126122310000081
Obtaining the key word set coefficient t of the filei. According to tiSorting the t according to the sequence from big to smalliThe maximum segmentation is used as the optimal segmentation, so that a keyword set Q to be searched is obtained1,q2,……,qu}。
S3: for each web page document diThe content is divided into 7 content fields, which are respectively a web address (URL), a title, a body content, a document tag (meta keywords), a tag description (meta description), an anchor text (i.e., a link text in a web page), and a search time log. Where each web page document is represented and indexed in the search engine by these fields.
S4, calculating the relevance between the keyword set to be searched and each webpage document in the document set D by using a BM25 model, and finally obtaining the relevance ranking result of the n webpage documents in the document set D through ranking and screening.
The specific calculation method is as follows:
(1) firstly, each keyword Q in a keyword set Q to be searched is calculatediAnd each web document diCorrelation degree R (q) of each content fieldi,di) Then according to the formula
Figure BDA0002126122310000082
Performing accumulation operation to obtain the final keyword set Q to be searched and the webpage document diCorrelation of (A), (B), (C), (i),PiRepresenting the weight of the keyword. Wherein the degree of correlation R (q)i,di) The calculation formula of (a) is as follows:
R(qi,di)=[fqi×(k1+1)/(fqi+K)]×[qfi×(k2+1)/(qfi+k2)]wherein K is K1 × (1-b + b × dli×avgdl),qfiAs a keyword qiFrequency of occurrence, fq, in the query statement QiAs a keyword qiIn web page document diThe occurrence frequencies of k1, k2 and b are all adjustment factors, and can be set to k1 ═ 1, k2 ═ 2, dl in generaliIs a web page document diAvgdl is the average length of all web page documents, i.e., document set D,
Figure BDA0002126122310000083
(2) for all the webpage documents in the document set D, according to the relevance value S (Q, D)i) And sorting from big to small to obtain a document set with descending relevance.
(3) And acquiring a correlation threshold T, and dividing the document set with descending correlation into two parts by using the correlation threshold T, wherein the first half part is the document set with the correlation larger than or equal to the correlation threshold T, and the second half part is the document set with the correlation smaller than the correlation threshold T.
S5, acquiring the publishing time T1, the updating time T2, the total click quantity C (the default value is 0 when the user clicks any position of the webpage with a single mouse), the total download quantity D (the default value is 0 when the user triggers the downloading operation of the webpage content, namely 1 downloading), the total dwell time P and the updating acceleration G of the webpage content in the document set with the correlation smaller than the correlation threshold T. And when the total number C of the click quantity is calculated, 1 click is performed when the user clicks any position of the webpage by a single mouse, and the default value is 0. The value of the web content updating acceleration G changes according to the speed of the web content updating time interval.
S6 according to the formula
Figure BDA0002126122310000091
The time sequence of each web page document is calculated.
And S7, sequentially outputting the webpage documents with the relevance smaller than the similarity threshold value T to the user according to the chronological sequence from high to low.
According to the retrieval method and the retrieval system, the relevance of the retrieval theme and the time sequence of information release are combined, the items of the retrieval result are sorted according to the actual requirement degree of the user, the information search current situation of information personnel is improved, the result concerned by the user is really placed at the forefront, and the requirements of high relevance and high timeliness of the information retrieval result in the field of national defense science and technology information are met.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (6)

1. An information retrieval method, the method comprising:
acquiring a keyword set to be searched and a webpage document set of a data source to be searched in the field of national defense science and technology intelligence, wherein the webpage document set comprises a plurality of webpage documents;
calculating the correlation between the keyword set to be searched and each webpage document;
outputting the webpage documents with the relevance larger than or equal to a similarity threshold, and outputting the webpage documents with the relevance smaller than the similarity threshold in sequence from high to low according to the time sequence;
the outputting the webpage documents with the relevance smaller than the similarity threshold value according to the sequence from high to low in time sequence specifically comprises:
acquiring time sequence parameters of each webpage document with the correlation smaller than the similarity threshold, wherein the time sequence parameters comprise: at least one of the release time, the update time, the total number of clicks, the total number of downloads, the total length of the dwell time of the page and the acceleration of updating the webpage content;
calculating the time sequence of each webpage document according to the time sequence parameters, which specifically comprises the following steps:
according to the formula:
Figure FDA0002977483760000011
calculating the time sequence of the ith webpage document, wherein I is more than or equal to 1 and less than or equal to I, I represents the number of the webpage documents with the correlation less than the similarity threshold value, SiRepresenting the time sequence of the ith webpage document; diRepresenting the total download amount of the ith webpage document; ciRepresenting the total click rate of the ith webpage document; piRepresenting the total length of the page stay time of the ith webpage document; t2iIndicating the update time of the ith webpage document; t1iIndicating the publishing time of the ith webpage document; giRepresenting the web page content updating acceleration of the ith web page document;
and outputting the webpage documents with the relevance smaller than the similarity threshold value in the order of high chronological order to low chronological order.
2. The method according to claim 1, wherein the calculating the relevance of the keyword set to be searched to each of the web documents specifically comprises:
and calculating the relevance of the keyword set to be searched and each webpage document by adopting a BM25 model.
3. The method according to claim 1, wherein outputting the web page document whose relevance is greater than or equal to the similarity threshold specifically includes:
and outputting the webpage documents with the relevance larger than or equal to the similarity threshold value in the order of high relevance to low relevance.
4. An information retrieval system, the system comprising:
the system comprises a data acquisition module, a search module and a search module, wherein the data acquisition module is used for acquiring a keyword set to be searched and a webpage document set of a data source to be searched in the field of national defense science and technology intelligence, and the webpage document set comprises a plurality of webpage documents;
the correlation calculation module is used for calculating the correlation between the keyword set to be searched and each webpage document;
the retrieval output module is used for outputting the webpage documents with the relevance larger than or equal to the similarity threshold value and outputting the webpage documents with the relevance smaller than the similarity threshold value from high to low according to the time sequence;
the retrieval output module comprises:
a time sequence parameter obtaining unit, configured to obtain a time sequence parameter of each web document whose correlation is smaller than the similarity threshold, where the time sequence parameter includes: at least one of the release time, the update time, the total number of clicks, the total number of downloads, the total length of the dwell time of the page and the acceleration of updating the webpage content;
a time sequence calculating unit, configured to calculate a time sequence of each web document according to the time sequence parameter, where the time sequence calculating unit includes:
a timing calculation subunit configured to:
Figure FDA0002977483760000021
calculating the time sequence of the ith webpage document, wherein I is more than or equal to 1 and less than or equal to I, and I represents small relevanceNumber of web documents in the similarity threshold, SiRepresenting the time sequence of the ith webpage document; diRepresenting the total download amount of the ith webpage document; ciRepresenting the total click rate of the ith webpage document; piRepresenting the total length of the page stay time of the ith webpage document; t2iIndicating the update time of the ith webpage document; t1iIndicating the publishing time of the ith webpage document; giRepresenting the web page content updating acceleration of the ith web page document;
and the time sequence document output unit is used for outputting the webpage documents with the relevance smaller than the similarity threshold value according to the time sequence from high to low.
5. The system of claim 4, wherein the correlation computation module comprises:
and the correlation calculation unit is used for calculating the correlation between the keyword set to be searched and each webpage document by adopting a BM25 model.
6. The system of claim 4, wherein the search output module comprises:
and the high-similarity document output unit is used for outputting the webpage documents of which the relevance is greater than or equal to the similarity threshold value in the order of high relevance to low relevance.
CN201910622980.1A 2019-07-11 2019-07-11 Information retrieval method and system Active CN110334269B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910622980.1A CN110334269B (en) 2019-07-11 2019-07-11 Information retrieval method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910622980.1A CN110334269B (en) 2019-07-11 2019-07-11 Information retrieval method and system

Publications (2)

Publication Number Publication Date
CN110334269A CN110334269A (en) 2019-10-15
CN110334269B true CN110334269B (en) 2021-05-07

Family

ID=68146347

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910622980.1A Active CN110334269B (en) 2019-07-11 2019-07-11 Information retrieval method and system

Country Status (1)

Country Link
CN (1) CN110334269B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1306258A (en) * 2001-03-09 2001-08-01 北京大学 Method for judging position correlation of a group of query keys or words on network page
CN101477556A (en) * 2009-01-22 2009-07-08 苏州智讯科技有限公司 Method for discovering hot sport in internet mass information
CN101625680A (en) * 2008-07-09 2010-01-13 东北大学 Document retrieval method in patent field
CN102982153A (en) * 2012-11-29 2013-03-20 北京亿赞普网络技术有限公司 Information retrieval method and device
CN104991962A (en) * 2015-07-22 2015-10-21 无锡天脉聚源传媒科技有限公司 Method and apparatus for generating recommendation information
CN107977405A (en) * 2017-11-16 2018-05-01 北京三快在线科技有限公司 Data reordering method, data sorting device, electronic equipment and readable storage medium storing program for executing

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5642502A (en) * 1994-12-06 1997-06-24 University Of Central Florida Method and system for searching for relevant documents from a text database collection, using statistical ranking, relevancy feedback and small pieces of text

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1306258A (en) * 2001-03-09 2001-08-01 北京大学 Method for judging position correlation of a group of query keys or words on network page
CN101625680A (en) * 2008-07-09 2010-01-13 东北大学 Document retrieval method in patent field
CN101477556A (en) * 2009-01-22 2009-07-08 苏州智讯科技有限公司 Method for discovering hot sport in internet mass information
CN102982153A (en) * 2012-11-29 2013-03-20 北京亿赞普网络技术有限公司 Information retrieval method and device
CN104991962A (en) * 2015-07-22 2015-10-21 无锡天脉聚源传媒科技有限公司 Method and apparatus for generating recommendation information
CN107977405A (en) * 2017-11-16 2018-05-01 北京三快在线科技有限公司 Data reordering method, data sorting device, electronic equipment and readable storage medium storing program for executing

Also Published As

Publication number Publication date
CN110334269A (en) 2019-10-15

Similar Documents

Publication Publication Date Title
CN109408703B (en) Information recommendation method and system, device, electronic equipment and storage medium thereof
CN107145496B (en) Method for matching image with content item based on keyword
US9020947B2 (en) Web knowledge extraction for search task simplification
CN102760138B (en) Classification method and device for user network behaviors and search method and device for user network behaviors
US8612435B2 (en) Activity based users&#39; interests modeling for determining content relevance
US8150841B2 (en) Detecting spiking queries
CN111708740A (en) Mass search query log calculation analysis system based on cloud platform
US20070143300A1 (en) System and method for monitoring evolution over time of temporal content
US20080077569A1 (en) Integrated Search Service System and Method
WO2002019158A2 (en) Method and system for personalisation of digital information
WO2014149199A1 (en) Method and system for multi-phase ranking for content personalization
CN103324669A (en) Method and client for processing web page bookmark
CN107145497B (en) Method for selecting image matched with content based on metadata of image and content
US20090132517A1 (en) Socially-derived relevance in search engine results
CN102163228A (en) Method, apparatus and device for determining sorting result of resource candidates
CN105760443A (en) Project recommending system, device and method
CN102930038A (en) Combined method of search result similar items and system of the same
CN102364467A (en) Network search method and system
CN105095209A (en) Document clustering method, document clustering device and network equipment
CN108959580A (en) A kind of optimization method and system of label data
CN104615723B (en) The determination method and apparatus of query word weighted value
Hoang et al. Academic event recommendation based on research similarity and exploring interaction between authors
CN108509449B (en) Information processing method and server
CN117593089A (en) Credit card recommendation method, apparatus, device, storage medium and program product
CN110334269B (en) Information retrieval method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant