CN110334269B

CN110334269B - Information retrieval method and system

Info

Publication number: CN110334269B
Application number: CN201910622980.1A
Authority: CN
Inventors: 董文轩; 程洁丹; 晏裕生; 姚晗; 孙孟阳; 江洋
Original assignee: China Institute Of Marine Technology & Economy
Current assignee: China Institute Of Marine Technology & Economy
Priority date: 2019-07-11
Filing date: 2019-07-11
Publication date: 2021-05-07
Anticipated expiration: 2039-07-11
Also published as: CN110334269A

Abstract

The invention discloses an information retrieval method and an information retrieval system. The information retrieval method and the system provided by the invention firstly calculate the relevance of each webpage document in the keyword set to be searched and the webpage document set of the data source to be searched in the field of national defense science and technology information; and then outputting the webpage documents with the relevance larger than or equal to the similarity threshold, and outputting the webpage documents with the relevance smaller than the similarity threshold in the order from high to low according to the time sequence. The retrieval method and the retrieval system provided by the invention output the webpage documents with larger relevance as the retrieval result, can ensure the coverage rate of the retrieval result, and simultaneously output the webpage documents with smaller relevance to the user according to the sequence from high to low, and can meet the requirement of high timeliness of information retrieval. Therefore, the method and the system provided by the invention are adopted to carry out information retrieval in the field of national defense science and technology information, and can simultaneously meet the requirements of high timeliness and high coverage rate.

Description

Information retrieval method and system

Technical Field

The present invention relates to the field of information retrieval, and in particular, to an information retrieval method and system.

Background

Information Retrieval (Information Retrieval) refers to a search process of finding out Information needed by a user from a large number of Information sets by adopting a certain Information Retrieval method according to the user needs. The core problem of information retrieval is result ordering, i.e., how to arrange the information most needed by the user in front of the return list. The information retrieval is used as a part of information retrieval, which means that a certain information retrieval method is utilized to provide information such as news, dynamic, policy, viewpoint and the like required by a user, and the method has the main characteristics of high timeliness, individuation and the like. The information retrieval in the technical information field of defense is used as a special information retrieval, and has the characteristics of high timeliness and high coverage rate, but the existing retrieval method cannot meet the requirements of high timeliness and high coverage rate at the same time.

Disclosure of Invention

The invention aims to provide an information retrieval method and an information retrieval system, which can simultaneously meet the requirements of high timeliness and high coverage rate of information retrieval in the field of national defense science and technology information.

In order to achieve the purpose, the invention provides the following scheme:

an information retrieval method, the method comprising:

acquiring a keyword set to be searched and a webpage document set of a data source to be searched in the field of national defense science and technology intelligence, wherein the webpage document set comprises a plurality of webpage documents;

calculating the correlation between the keyword set to be searched and each webpage document;

and outputting the webpage documents with the relevance larger than or equal to the similarity threshold, and outputting the webpage documents with the relevance smaller than the similarity threshold in sequence from high to low.

Optionally, the calculating the relevance between the keyword set to be searched and each of the web documents specifically includes:

and calculating the relevance of the keyword set to be searched and each webpage document by adopting a BM25 model.

Optionally, the outputting the webpage document with the relevance greater than or equal to the similarity threshold specifically includes:

and outputting the webpage documents with the relevance larger than or equal to the similarity threshold value in the order of high relevance to low relevance.

Optionally, the outputting the web documents with the relevance smaller than the similarity threshold from high to low according to the time sequence specifically includes:

acquiring time sequence parameters of each webpage document with the correlation smaller than the similarity threshold, wherein the time sequence parameters comprise: at least one of the release time, the update time, the total number of clicks, the total number of downloads, the total length of the dwell time of the page and the acceleration of updating the webpage content;

calculating the time sequence of each webpage document according to the time sequence parameters;

and outputting the webpage documents with the relevance smaller than the similarity threshold value in the order of high chronological order to low chronological order.

Optionally, the timing parameter includes: the method includes the following steps that the issuing time, the updating time, the total click quantity, the total download quantity, the total page retention time and the webpage content updating acceleration are calculated, the time sequence of each webpage document is calculated according to the time sequence parameters, and the method specifically includes the following steps:

according to the formula:

calculating the time sequence of the ith webpage document, wherein I is more than or equal to 1 and less than or equal to I, I represents the number of the webpage documents with the correlation less than the similarity threshold value, S_iRepresenting the time sequence of the ith webpage document; d_iRepresenting the total download amount of the ith webpage document; c_iRepresenting the total click rate of the ith webpage document; p_iRepresenting the total length of the page stay time of the ith webpage document; t2_iIndicating the update time of the ith webpage document; t1_iIndicating the publishing time of the ith webpage document; g_iIndicating the web content update acceleration of the ith web document.

An information retrieval system, the system comprising:

the system comprises a data acquisition module, a search module and a search module, wherein the data acquisition module is used for acquiring a keyword set to be searched and a webpage document set of a data source to be searched in the field of national defense science and technology intelligence, and the webpage document set comprises a plurality of webpage documents;

the correlation calculation module is used for calculating the correlation between the keyword set to be searched and each webpage document;

and the retrieval output module is used for outputting the webpage documents with the relevance greater than or equal to the similarity threshold value and outputting the webpage documents with the relevance less than the similarity threshold value in sequence from high to low according to the time sequence.

Optionally, the correlation calculation module includes:

and the correlation calculation unit is used for calculating the correlation between the keyword set to be searched and each webpage document by adopting a BM25 model.

Optionally, the retrieval output module includes:

and the high-similarity document output unit is used for outputting the webpage documents of which the relevance is greater than or equal to the similarity threshold value in the order of high relevance to low relevance.

Optionally, the retrieval output module includes:

a time sequence parameter obtaining unit, configured to obtain a time sequence parameter of each web document whose correlation is smaller than the similarity threshold, where the time sequence parameter includes: at least one of the release time, the update time, the total number of clicks, the total number of downloads, the total length of the dwell time of the page and the acceleration of updating the webpage content;

the time sequence calculating unit is used for calculating the time sequence of each webpage document according to the time sequence parameters;

and the time sequence document output unit is used for outputting the webpage documents with the relevance smaller than the similarity threshold value according to the time sequence from high to low.

Optionally, the timing parameter includes: the time sequence calculating unit comprises the following components of issuing time, updating time, total number of click rate, total number of download amount, total length of dwell time of a page and updating acceleration of webpage content:

a timing calculation subunit configured to:

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the information retrieval method and the system provided by the invention firstly calculate the relevance of each webpage document in the keyword set to be searched and the webpage document set of the data source to be searched in the field of national defense science and technology information; and then outputting the webpage documents with the relevance larger than or equal to the similarity threshold, and outputting the webpage documents with the relevance smaller than the similarity threshold in the order from high to low according to the time sequence. The retrieval method and the retrieval system provided by the invention output the webpage documents with larger relevance as the retrieval result, can ensure the coverage rate of the retrieval result, and simultaneously output the webpage documents with smaller relevance to the user according to the sequence from high to low, and can meet the requirement of high timeliness of information retrieval. Therefore, the method and the system provided by the invention are adopted to carry out information retrieval in the field of national defense science and technology information, and can simultaneously meet the requirements of high timeliness and high coverage rate.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

Fig. 1 is a flowchart of an information retrieval method according to an embodiment of the present invention;

fig. 2 is a block diagram of an information retrieval system according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a flowchart of an information retrieval method according to an embodiment of the present invention. As shown in fig. 1, the method includes:

step 101: acquiring a keyword set to be searched and a webpage document set of a data source to be searched in the field of national defense science and technology intelligence, wherein the webpage document set comprises a plurality of webpage documents.

Step 102: and calculating the correlation between the keyword set to be searched and each webpage document. In this embodiment, a BM25 model is used to calculate the relevance between the keyword set to be searched and each of the web documents.

Step 103: and outputting the webpage documents with the relevance larger than or equal to the similarity threshold, and outputting the webpage documents with the relevance smaller than the similarity threshold in sequence from high to low.

In practical application, the web documents with the relevance greater than or equal to the similarity threshold value can be output to the user according to the sequence of the relevance from high to low, that is, the web document with the highest relevance is placed at the top, the web document with the second relevance is placed at the second position, and so on, and the web documents with the relevance greater than or equal to the similarity threshold value are output to the user.

The outputting the webpage documents with the relevance smaller than the similarity threshold value according to the sequence from high to low in time sequence specifically comprises:

In this embodiment, the timing parameters include: the method includes the following steps that the issuing time, the updating time, the total click quantity, the total download quantity, the total page retention time and the webpage content updating acceleration are calculated, the time sequence of each webpage document is calculated according to the time sequence parameters, and the method specifically includes the following steps:

according to the formula:

Fig. 2 is a block diagram of an information retrieval system according to an embodiment of the present invention. As shown in fig. 2, the system includes:

the data acquisition module 201 is configured to acquire a keyword set to be searched and a web document set of a data source to be searched in the field of defense science and technology intelligence, where the web document set includes a plurality of web documents.

And the correlation calculation module 202 is configured to calculate correlations between the keyword set to be searched and each of the web page documents.

And the retrieval output module 203 is used for outputting the webpage documents with the relevance greater than or equal to the similarity threshold value, and outputting the webpage documents with the relevance less than the similarity threshold value in sequence from high to low.

The correlation calculation module 202 includes:

The retrieval output module 203 includes:

The retrieval output module 203 further includes:

In this embodiment, the timing parameters include: the time sequence calculating unit comprises the following components of issuing time, updating time, total number of click rate, total number of download amount, total length of dwell time of a page and updating acceleration of webpage content:

a timing calculation subunit configured to:

The specific implementation process of the invention is as follows:

s1 obtaining national defenseWebpage document set D, D ═ D of data source to be checked in scientific and technological information field₁,d₂,……,d_n}，d_iRepresenting the ith web page document in D.

S2, obtaining the query text input by the user, segmenting the query text to obtain the keyword set Q to be searched₁,q₂,……,q_uWherein q is_iAnd the ith keyword to be searched in the keyword set to be searched is represented, i is more than or equal to 1 and less than or equal to u, and u represents the number of the keywords to be searched. Each web page document d_iIs shown as<Q，f_i，r_i>Q is the keyword set to be searched of the user; f. of_iFor web page documents d_iThe features of (1); r is_iAnd taking the value of the relevance judgment condition of the document and the keyword set Q to be searched, wherein the value range is {0,1},0 represents irrelevant, and 1 represents relevant. Specifically, when determining the keyword set to be searched, for each webpage document d_iAn optimal segmentation of each document is found by using an unsupervised feature selection method of an RSR algorithm (regulated Self-reconstruction), and the specific steps are as follows:

(1) web page document d_iIs characterized by the feature set of f_i＝{f_i1,f_i2,……,f_imEach specific feature f_ijCan be linearly expressed by other features or by itself as:

wherein i is more than or equal to 1 and less than or equal to n, j is more than or equal to 1 and less than or equal to k and less than or equal to m, and w_jkDenotes f_ijAnd f_ikCoefficient of relationship of e_ijRepresenting a weighted term, f_ijRepresenting the jth feature of the ith document.

(2) Set of features f for the document_iSolving for optimality using extremum algorithms

Wherein W represents a web document d_iThe matrix of coefficients of (a) is,W＝[w_ij]∈R^m×m，l_2,1the norm on E is to make the algorithm robust to outliers, and also to add W computation_2,1Regularization terms to avoid trivial solutions; λ is a non-zero regularization weighting parameter.

Order to

Wherein, w_iIs that

Row i of (2). According to the formula

Corresponding coefficients for each feature may be obtained, where v ═ v₁,v₂,……,v_mI.e. the web page document d_iJ (th) feature f_ijCorresponding coefficient is v_j。

(3) Counting the appearance of keywords q to be searched in the document characteristics_iWord frequency x_iAccording to the formula

Obtaining the key word set coefficient t of the file_i. According to t_iSorting the t according to the sequence from big to small_iThe maximum segmentation is used as the optimal segmentation, so that a keyword set Q to be searched is obtained₁,q₂,……,q_u}。

S3: for each web page document d_iThe content is divided into 7 content fields, which are respectively a web address (URL), a title, a body content, a document tag (meta keywords), a tag description (meta description), an anchor text (i.e., a link text in a web page), and a search time log. Where each web page document is represented and indexed in the search engine by these fields.

S4, calculating the relevance between the keyword set to be searched and each webpage document in the document set D by using a BM25 model, and finally obtaining the relevance ranking result of the n webpage documents in the document set D through ranking and screening.

The specific calculation method is as follows:

(1) firstly, each keyword Q in a keyword set Q to be searched is calculated_iAnd each web document d_iCorrelation degree R (q) of each content field_i,d_i) Then according to the formula

Performing accumulation operation to obtain the final keyword set Q to be searched and the webpage document d_iCorrelation of (A), (B), (C), (_i)，P_iRepresenting the weight of the keyword. Wherein the degree of correlation R (q)_i,d_i) The calculation formula of (a) is as follows:

R(q_i,d_i)＝[fq_i×(k1+1)/(fq_i+K)]×[qf_i×(k2+1)/(qf_i+k2)]wherein K is K1 × (1-b + b × dl_i×avgdl)，qf_iAs a keyword q_iFrequency of occurrence, fq, in the query statement Q_iAs a keyword q_iIn web page document d_iThe occurrence frequencies of k1, k2 and b are all adjustment factors, and can be set to k1 ═ 1, k2 ═ 2, dl in general_iIs a web page document d_iAvgdl is the average length of all web page documents, i.e., document set D,

(2) for all the webpage documents in the document set D, according to the relevance value S (Q, D)_i) And sorting from big to small to obtain a document set with descending relevance.

(3) And acquiring a correlation threshold T, and dividing the document set with descending correlation into two parts by using the correlation threshold T, wherein the first half part is the document set with the correlation larger than or equal to the correlation threshold T, and the second half part is the document set with the correlation smaller than the correlation threshold T.

S5, acquiring the publishing time T1, the updating time T2, the total click quantity C (the default value is 0 when the user clicks any position of the webpage with a single mouse), the total download quantity D (the default value is 0 when the user triggers the downloading operation of the webpage content, namely 1 downloading), the total dwell time P and the updating acceleration G of the webpage content in the document set with the correlation smaller than the correlation threshold T. And when the total number C of the click quantity is calculated, 1 click is performed when the user clicks any position of the webpage by a single mouse, and the default value is 0. The value of the web content updating acceleration G changes according to the speed of the web content updating time interval.

S6 according to the formula

The time sequence of each web page document is calculated.

And S7, sequentially outputting the webpage documents with the relevance smaller than the similarity threshold value T to the user according to the chronological sequence from high to low.

According to the retrieval method and the retrieval system, the relevance of the retrieval theme and the time sequence of information release are combined, the items of the retrieval result are sorted according to the actual requirement degree of the user, the information search current situation of information personnel is improved, the result concerned by the user is really placed at the forefront, and the requirements of high relevance and high timeliness of the information retrieval result in the field of national defense science and technology information are met.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. An information retrieval method, the method comprising:

outputting the webpage documents with the relevance larger than or equal to a similarity threshold, and outputting the webpage documents with the relevance smaller than the similarity threshold in sequence from high to low according to the time sequence;

calculating the time sequence of each webpage document according to the time sequence parameters, which specifically comprises the following steps:

according to the formula:

calculating the time sequence of the ith webpage document, wherein I is more than or equal to 1 and less than or equal to I, I represents the number of the webpage documents with the correlation less than the similarity threshold value, S_iRepresenting the time sequence of the ith webpage document; d_iRepresenting the total download amount of the ith webpage document; c_iRepresenting the total click rate of the ith webpage document; p_iRepresenting the total length of the page stay time of the ith webpage document; t2_iIndicating the update time of the ith webpage document; t1_iIndicating the publishing time of the ith webpage document; g_iRepresenting the web page content updating acceleration of the ith web page document;

2. The method according to claim 1, wherein the calculating the relevance of the keyword set to be searched to each of the web documents specifically comprises:

3. The method according to claim 1, wherein outputting the web page document whose relevance is greater than or equal to the similarity threshold specifically includes:

4. An information retrieval system, the system comprising:

the retrieval output module is used for outputting the webpage documents with the relevance larger than or equal to the similarity threshold value and outputting the webpage documents with the relevance smaller than the similarity threshold value from high to low according to the time sequence;

the retrieval output module comprises:

a time sequence calculating unit, configured to calculate a time sequence of each web document according to the time sequence parameter, where the time sequence calculating unit includes:

a timing calculation subunit configured to:

calculating the time sequence of the ith webpage document, wherein I is more than or equal to 1 and less than or equal to I, and I represents small relevanceNumber of web documents in the similarity threshold, S_iRepresenting the time sequence of the ith webpage document; d_iRepresenting the total download amount of the ith webpage document; c_iRepresenting the total click rate of the ith webpage document; p_iRepresenting the total length of the page stay time of the ith webpage document; t2_iIndicating the update time of the ith webpage document; t1_iIndicating the publishing time of the ith webpage document; g_iRepresenting the web page content updating acceleration of the ith web page document;

5. The system of claim 4, wherein the correlation computation module comprises:

6. The system of claim 4, wherein the search output module comprises: