CN116933766A

CN116933766A - Ad-hoc information retrieval model based on triple word frequency scheme

Info

Publication number: CN116933766A
Application number: CN202310648601.2A
Authority: CN
Inventors: 陈朝峰; 叶保丹; 郭乃瑄; 曹瑞; 王媛媛; 周锋; 徐森; 王如刚
Original assignee: Yancheng Institute of Technology; Yancheng Institute of Technology Technology Transfer Center Co Ltd
Current assignee: Yancheng Institute of Technology; Yancheng Institute of Technology Technology Transfer Center Co Ltd
Priority date: 2023-06-02
Filing date: 2023-06-02
Publication date: 2023-10-24
Anticipated expiration: 2043-06-02
Also published as: CN116933766B

Abstract

The invention provides an Ad-hoc information retrieval model based on a triple word frequency scheme, which belongs to the technical field of information, and the construction of the retrieval model comprises the following steps: step one, calculating LTF, RTF and PTF ranking functions, and carrying out normalization processing to obtain corresponding representation functions TF1, TF2 and TF3; wherein LTF represents a length regularized term frequency; RTF represents relative term frequency within the document; PTF represents a proximity-based term frequency; and step two, carrying out linear combination on the TF1, TF2 and TF3 functions, and obtaining a retrieval model based on the linear combination result of the TF1, TF2 and TF3 functions. The retrieval models TriATF and TriATF constructed by the invention linearly combine the LTF, RTF and PTF, fully utilize the advantages of word frequencies, realize more efficient document retrieval, and are simple and practical.

Description

Ad-hoc information retrieval model based on triple word frequency scheme

Technical Field

The invention belongs to the technical field of information, and particularly relates to an Ad-hoc information retrieval model based on a triple word frequency scheme.

Background

The core of modern document retrieval models, such as vector space models or probabilistic models, is the use of a document ranking function. Given any query document, its numerical score is calculated using a ranking function. Thus, all documents in a given collection may be correctly ranked. Most of these ranking functions adopt heuristic methods to solve the relationship among factors such as term-frequency (TF), document-frequency (DF), and Document Length (DL), wherein TF generally plays a central role in these functions.

In the vector space-based search model and the probability search model, the document length is directly applied to the normalized TF. Such a model is referred to herein as a document length normalized term frequency (document length normalized term frequency, LTF) model. A parameter is introduced in the ranking function of these models to control DL, thereby applying normalization to TF. However, in the language model, the probability of generating a term in a document is determined by directly using DL as a divisor. Regarding vector space models, it is generally believed that the significance of a term can be better captured when TF is not only normalized by DL, but also relative to the average term frequency limit within a document. Such models are referred to herein as relative document term frequency (relative term frequency, RTF) models. Meanwhile, a term frequency based on proximity is introduced in a framework based on BM25 and language modeling, wherein the associated proximity of two given terms in a document is quantitatively captured by measuring the distance between co-occurrences of the two terms, which is referred to herein as a term frequency based on Proximity (PTF) model.

Essentially, LTF, RTF and PTF provide three different approaches to normalize/modulate TF factors. LTFs and RTFs are term independent and are more suitable for long queries or short queries. The existing effective retrieval models based on vector space frames include MATF, MVD and PDM. And the PTF is a model for inquiring term correlation, calculates the number of times of term co-occurrence in a specified distance and is used for measuring the term relevance. In the prior art, the PTF has been integrated into a model based on BM25 and language modeling framework, and achieves good results. In the vector space framework, PTF is used with a pseudo-related feedback mechanism to measure candidate terms in a feedback document.

Weighting schemes based on word frequency are widely used in modern information retrieval models. Many existing models use document length normalized term frequencies (document length normalized term frequency, LTFs), relative intra-document term frequencies (relative term frequency, RTFs), or proximity-based term frequencies (PTFs) to normalize/adjust term frequency factors to achieve efficient retrieval of documents. However, three adjustment factor methods of LTF, RTF and PTF are not combined together to provide a search model for information retrieval.

If three adjustment factor methods of LTF, RTF and PTF can be combined together, the characteristics of each method are integrated for information retrieval, and the problem of effective retrieval of documents with different lengths, different characteristics and different forms can be solved.

Disclosure of Invention

The invention aims at: providing an Ad-hoc information retrieval model based on a Triple word frequency scheme, wherein the Triple word frequency scheme (Triple estimates-based Term Frequency Scheme, triATF) is a method for linearly combining three regulating factors of LTF, RTF and PTF; the invention combines LTF and RTF with the query length factor, and finally combines with PTF linearly, and constructs a retrieval model in a word frequency weighting mode so as to realize effective retrieval of the document.

The technical content is as follows: an Ad-hoc information retrieval model based on a triple word frequency scheme, wherein the construction of the retrieval model comprises the following steps:

step one, calculating LTF, RTF and PTF ranking functions, and carrying out normalization processing to obtain corresponding representation functions TF1, TF2 and TF3; wherein LTF represents a length regularized term frequency; RTF represents relative term frequency within the document; PTF represents a proximity-based term frequency;

and step two, carrying out linear combination on the TF1, TF2 and TF3 functions, and obtaining a retrieval model based on the linear combination result of the TF1, TF2 and TF3 functions.

Further, in the normalization process in the first step, the normalization function used is f (x) =x/(1+x), where x represents one of LTF, RTF and PTF ranking functions, and f (x) represents the corresponding representation function TF1, TF2 or TF3.

Further, in the first step, the LTF ranking function is normalized to obtain a representation function TF ₁ The operation steps are as follows:

the LTF ranking function is expressed as:

wherein LTF (t, D) represents a length regularized term frequency; d represents a document; t represents a term in the document D; tf (t, D) represents the frequency of the vocabulary t in document D; avdl refers to the average document length in a given document collection C; |d| represents the length of the document;

using the function f (x) =x/(1+x) to normalize LTF (t, D), there is:

TF ₁ (t，D)＝LTF(t，D)/(1+LTF(t，D)) (2)

wherein TF is ₁ (t, D) represents normalized length regularized term frequencies.

Further, in the first step, the RTF ranking function is normalized to obtain a representation function TF ₂ The specific operation is as follows:

the RTF ranking function is expressed as:

wherein RTF (t, D) represents the relative term frequency within the document; d represents a document; t represents a term in the document D; tf (t, D) represents the frequency of the vocabulary t in document D; avgtf (D) represents the average word frequency of document D;

using the function f (x) =x/(1+x) to normalize RTF (t, D), there is:

TF ₂ (t，D)＝RTF(t，D)/(1+RTF(t，D)) (4)

wherein TF is ₂ (t, D) represents normalized relative intra-document term frequency.

Further, in the first step, the PTF ranking function is normalized to obtain a representation function TF ₃ The specific operation is as follows:

the PTF ranking function formula is:

PTF(t，D)＝∑ _{q∈Q，q≠t} ，Prox(t，q，D) (5)

wherein Q represents a query, Q represents one query term in Q; PTF (t, D) represents a proximity-based term frequency; d represents a document; t represents a term in the document D;

prox (t, q, D) represents the proximity between term t and query term q in document D, and the formula is:

wherein dist (t, q, D) is the distance between term t and query term q in document D, σ is a normalization parameter;

using the function f (x) =x/(1+x) to normalize the PTF (t, D), there is:

TF ₃ (t，D)＝RTF(t，D)/(1+RTF(t，D)) (8)

TF ₃ (t, D) represents normalized proximity-based termsFrequency.

Further, prox (t, q, D) is optimized by adding the IDF part of the query term, and the Prox (t, q, D) is optimized;

further, the specific operation steps of the second step are as follows:

step (1), using the query length factor ω=2/[ 1+log2 (1+|q|)]To combine TF ₁ And TF (TF) ₂ The method comprises the following steps of:

TF _(1，2) (t，D)＝(1-ω)·TF ₁ (t，D)+ω·TF ₂ (t，D) (9)

wherein D represents a document; t represents a term in the document D; omega represents a query length factor; TF (TF) ₁ (t, D) represents normalized length regularized term frequencies; TF (TF) ₂ (t, D) represents normalized relative intra-document term frequency;

step (2), then TF is reacted _(1,2) (t, D) and TF ₃ Linear combination, yielding the following results:

TF _tri (t，D)＝(1-λ)·(TF _(1，2) (t，D))+λ·TF ₃ (t，D) (10)

wherein lambda represents TF _(1,2) (t, D) and TF ₃ Weight coefficient of linear combination, value range [0,1]；

And (3) optimizing and improving the IDF by the following formula to obtain the TDF:

wherein IDF (t) represents the inverse document frequency, defined as log2[ (N-N (t) +0.5)/(N (t) +0.5) ], where N is the number of documents in the document collection and N (t) is the number of documents containing t; ctf (t) represents the frequency of t in the literature collection; TDF (t) represents an improved IDF (t);

step (4) combining TF ₁ 、TF ₂ And TF (TF) ₃ Function linearityCombining the result and optimizing the improved inverse document frequency to obtain a retrieval model:

Score(Q，D)＝∑ _t∈Q TF _tri (t，D)·TDF(t) (12)

wherein Score (Q, D) represents the obtained search model; TF (TF) _tri (t, D) represents TF _(1,2) (t, D) and TF ₃ Linear combination; TDF (t) represents an improved IDF (t).

Compared with the prior art, the invention has the beneficial effects that:

(1) The retrieval models TriATF and TriATF constructed by the invention linearly combine the LTF, RTF and PTF, fully utilize the advantages of word frequencies, realize more efficient document retrieval, and are simple and practical.

(2) The search models TriATF and TriATF are also capable of flexibly controlling the contribution degree of LTF, RTF and PTF to the model, and optimizing the search model. By adjusting interpolation parameters between the LTF and the RTF, the overall weights of the LTF and the RTF relative to the PTF, namely parameters applied to the PTF, are controlled, and the contribution weights of three word frequencies to the model are controlled.

Drawings

FIG. 1 shows sensitivity of λ in TriATF and TriATF constructed according to the present invention;

FIG. 2 shows the sensitivity of σ in the TriATF and TriATF constructed according to the present invention.

Detailed Description

The following detailed description of the technical solution of the present invention will be given with reference to the accompanying drawings and specific embodiments.

The invention aims to provide an Ad-hoc information retrieval model based on a triple word frequency scheme, namely a retrieval model constructed based on linear combination of LTF, RTF and PTF, so as to realize effective retrieval of documents. Essentially, LTF, RTF and PTF provide three different approaches to normalize/modulate TF factors. The invention combines LTF, RTF and PTF for the first time, adjusts interpolation parameters between them through linear combination, controls weight parameters between them, and further builds a more effective retrieval model.

In the present invention, let D, C and Q denote a document, a document collection, and a query, respectively (where Q is one query term in Q). Use | indicates their length/size. The term avdl refers to the average document length in a given document collection. Term frequency (tf), further defined as the frequency of query terms q in document D, avgtf (D) is the average term frequency for a given document D, ctf (q) represents the frequency of q in the document collection. The inverse document frequency IDF (q) is defined as log2[ (N-N (q) +0.5)/(N (q) +0.5) ], where N is the number of documents in the document collection and N (q) is the number of documents containing q.

Example 1

An Ad-hoc information retrieval model based on a triple word frequency scheme, wherein the construction of the retrieval model comprises the following steps:

step one, calculating LTF, RTF and PTF ranking functions, and carrying out normalization processing to obtain corresponding representation functions TF ₁ 、TF ₂ And TF (TF) ₃ ；

Wherein LTF represents a length regularized term frequency; RTF represents relative term frequency within the document; PTF represents a proximity-based term frequency;

(1) Normalizing the LTF ranking function to obtain a representation function TF ₁ The calculation steps are as follows:

assuming that the longer the document length, its relevance should be reduced for a given query. The LTF ranking function takes into account the aggregate average document length, and the frequency of a term in the average length document does not change its value in the function.

The LTF ranking function is expressed as:

wherein LTF (t, D) represents a length regularized term frequency; d represents a document; t represents document D

The term of (a); tf (t, D) represents the frequency of the vocabulary t in document D; avdl refers to the average document length in a given document collection C; |d| represents the length of the document;

using the function f (x) =x/(1+x) to normalize LTF (t, D), there is:

TF ₁ (t，D)＝LTF(t，D)/(1+LTF(t，D)) (2)

wherein TF is ₁ (t, D) represents normalized length regularized term frequencies; LTF (t, D) represents a length regularized term frequency; d represents a document; t represents a term in the document D;

(2) Normalizing the RTF ranking function to obtain a representation function TF ₂ The calculation steps are as follows:

the RTF ranking function measures the importance of a term by using the frequency of the term in the document relative to the average term frequency of the document; in the RTF ranking function, the representation function TF is obtained by normalization processing according to the number of unique terms in the document ₂ ；

The RTF ranking function is expressed as:

using the function f (x) =x/(1+x) to normalize RTF (t, D), there is:

TF ₂ (t，D)＝RTF(t，D)/(1+RTF(t，D) (4)

wherein TF is ₂ (t, D) represents normalized relative intra-document term frequency; RTF (t, D) represents the relative term frequency within the document; d represents a document; t represents a term in the document D;

(3) Normalizing the PTF ranking function to obtain a representation function TF ₃ The calculation steps are as follows:

the PTF ranking function represents the frequency of a term and its proximity (i.e., relative location information associated with the query term), and the PTF ranking function formula is:

PTF(t，D)＝∑ _{q∩Q，q≠t} ，Prox(t，q，D) (5)

wherein Q represents a query, Q represents one query term in Q; prox (t, q, D) represents the proximity between term t and query term q in document D; PTF (t, D) represents a proximity-based term frequency; d represents a document; t represents a term in the document D;

the ideal proximity function should be a convex function and satisfy monotonic decreases with respect to the distance between the term and the query term; the following exponential function satisfies the requirements:

wherein dist (t, q, D) is the distance between term t and query term q in document D, σ is a normalization parameter; d represents a document; t represents a term in the document D;

using the function f (x) =x/(1+x) to normalize the PTF (t, D), there is:

TF ₃ (t，D)＝PTF(t，D)/(1+PTF(t，D)) (8)

TF ₃ (t, D) represents the normalized proximity-based term frequency.

Step two, for TF ₁ 、TF ₂ And TF (TF) ₃ And (3) performing linear combination on the functions, and obtaining a retrieval model TriATF based on the linear combination result of the functions of TF1, TF2 and TF3.

The specific operation steps of the second step are as follows:

step (1), in view of the fact that LTF and RTF perform better with different query lengths, using the query length factor ω=2/[ 1+log2 (1+|Q|)]To combine TF ₁ And TF (TF) ₂ The method comprises the following steps of:

TF _(1，2) (t，D)＝(1-ω)·TF ₁ (t，D)+ω·TF ₂ (t，D) (9)

TF _tri (t，D)＝(1-λ)·(TF _(1，2) (t，D))+λ·TF ₃ (t，D) (10)

Obviously, the TF obtained _tri The (t, D) function provides an overall balanced consideration, both term independent and frequency of relevance to the query term.

step (4) combining TF ₁ 、TF ₂ And TF (TF) ₃ And (3) linearly combining the result with the optimized and improved inverse document frequency to obtain a retrieval model, and marking the retrieval model as TriATF:

Score(Q，D)＝∑ _t∈Q TF _tri (t，D）·TDF(t) (12)

wherein Score (Q, D) represents the search model proposed by the present invention; TF (TF) _tri (t, D) represents TF _(1,2) (t, D) and TF ₃ Linear combination; TDF (t) represents an improved IDF (t).

Example 2:

for equation (6) in example 1, the significance of the query term can be further considered by adding the IDF portion of the query term to optimize the function, equation (6) being optimized as:

wherein IDF (Q) represents the inverse document frequency, defined as log2[ (N-N (Q) +0.5)/(N (Q) +0.5) ], where N is the number of documents in document D, N (Q) is the number of documents containing Q, Q is a query term in Q;

when equation (7) is used instead of equation (6), the resulting search model is denoted as TriATF.

Example 3:

the effectiveness of the proposed model of the present invention was evaluated using the official evaluation index of TREC, i.e. mean average accuracy (Mean Average Precision, MAP).

The invention mainly uses 8 public data sets of TREC to test the TriATF and TriATF constructed in the embodiment 1 and the embodiment 2 and the existing retrieval models DLM, BM25 and MATF, CRTER2 to verify the validity of the TriATF and TriATF retrieval models constructed in the invention; the 8 public data sets using TREC are specifically AP88-89, DISK1&2, DISK4&5, ROBUST04, WSJ, W2G, W10G, GOV2. For each dataset, 50 to 250 queries were tested. In all test experiments, the title field of the TREC query was used for retrieval. The unavaluated query is deleted. And carrying out word drying treatment on each term by using a Porter English stem device. And deleting the standard English stop words. Finally, the present invention uses the official evaluation index MAP of TREC to evaluate the effectiveness of the proposed model of the present invention. All statistical tests were based on the Wilcoxon paired symbol rank test.

Table 1 is a MAP value comparing the constructed search models of the present invention, triATF and TriATF, with the existing search models DLM, BM25 and MATF. Wherein the best results are highlighted in bold; the significance level was 0.05 using the Wilcoxon paired symbol rank test. The superscripts "l", "b" and "m" represent DLM, BM25 and MATF, respectively, and the percentage improvement is listed.

TABLE 1

As shown in table 1, the search model tritf (and tritf) performs significantly better in almost all eight sets than the other models from a MAP perspective.

Table 2 is a MAP value comparing the constructed search models of the present invention, triATF and TriATF, with the existing search model CRTER2. Wherein the best results for each column are highlighted in bold, with a significance level of 0.05 using the Wilcoxon paired symbol rank test, highlighted with the diamond symbol "o".

TABLE 2

MAP	AP88-89	DISK1&2	DISK4&5	ROBUST04	WSJ	WT2G	WT10G	GOV2
									CRTER2	0.2934	0.2483	0.2334	0.2585	0.3548	0.3592	0.2207	0.3210
TriATF	0.2978	0.2494	0.2437 ^◇	0.2690 ^◇	0.3596	0.3669	0.2292	0.3327 ^◇
									TriATF ^＊	0.2953	0.2502	0.2434 ^◇	0.2697 ^◇	0.3554	0.3673	0.2278	0.3334 ^◇

Table 2 shows the comparison of the search models TriATF and TriATF with the existing search model CRTER2, which are superior to the existing search model CRTER2 in all sets of consideration from the MAP point of view.

Fig. 1 shows the sensitivity of the tri-atf and λ in tri-atf; fig. 1 plots the MAP values of the proposed models (TriATF and TriATF) over 8 sets as the interpolation coefficient λ varies from 0 to 1. As can be seen from the figure, the search model we propose (TriATF and TriATF x) generally performs well on most collections when lambda is fixed at 0.2. Considering that when λ is equal to 0, the TriATF is equivalent to MATF, it is clear that in all these comparisons, the TriATF is superior to MATF.

Fig. 2 is the sensitivity of σ in tritf and tritf; fig. 2 plots the MAP values for the tri and tri at different σ. The graph shows that the performance of the model proposed by the present invention is very stable over a large part of the sigma range. However, it was observed that the performance of the model was slightly degraded when the σ value was large. Thus, for a new set, the recommended parameter value should be λ 0.2 and σ5 without too much knowledge of the correlation.

The retrieval model TriATF (and variants thereof) presented in the present invention is a linear combination model based on term frequency normalized to document Length (LTF), relative intra-document term frequency (RTF), or proximity-based term frequency (PTF).

The retrieval model TriATF (and variants thereof) of the present invention can accurately reflect the characteristics of TF-based models, and is divided into three normalization/adjustment TF factors, namely term independent and query dependent (LTF and RTF) and those query dependent models (PTF), for the first time, combined together, and the retrieval results are more efficient than retrieval models with a single normalization/adjustment TF factor. In embodiment 3, the validity of the model proposed by the present invention is verified on 8 published TREC datasets, and the model proposed by the present invention is proved to be valid on the pseudo-correlation feedback models (DLM, BM25, MATF, CRTER 2) of the current level on the average accuracy MAP of the evaluation standard mean of the general search system evaluation index, and is most valid for almost all selected queries compared with the best prior art BM25, DLM, MATF, CRTER2, and can be applied to the fields of text search, information search, data mining, etc.

Claims

1. The Ad-hoc information retrieval model based on the triple word frequency scheme is characterized by comprising the following steps of:

2. The Ad-hoc information retrieval model according to claim 1, wherein the normalization function used in the normalization in the first step is f (x) =x/(1+x), where x represents one of LTF, RTF and PTF ranking functions, and f (x) represents the corresponding representation function TF1, TF2 or TF3.

3. The Ad-hoc information retrieval model based on the triple word frequency scheme according to claim 2, wherein in said step one, the LTF ranking function is normalized to obtain the representation function TF ₁ The operation steps are as follows:

the LTF ranking function is expressed as:

using the function f (x) =x/(1+x) to normalize LTF (t, D), there is:

TF ₁ (t，D)＝LTF(t，D)/(1+LTF(tD)) (2)

4. The Ad-hoc information retrieval model based on the triple word frequency scheme according to claim 2, wherein in said step one, the RTF ranking function is normalized to obtain the representation function TF ₂ The specific operation is as follows:

the RTF ranking function is expressed as:

using the function f (x) =x/(1+x) to normalize RTF (t, D), there is:

TF ₂ (t，D)＝RTF(t，D)/(1+RTF(tD)) (4)

5. The Ad-hoc information retrieval model based on the triple word frequency scheme according to claim 2, wherein in said step one, the PTF ranking function is normalized to obtain the representation function TF ₃ The specific operation is as follows:

the PTF ranking function formula is:

PTF(t，D)＝∑ _{q∈Q，q≠t} ，Prox(t，q，D) (5)

using the function f (x) =x/(1+x) to normalize the PTF (t, D), there is:

TF ₃ (t，D)＝PTF(t，D)/(1+PTF(t，D)) (8)

TF ₃ (t, D) represents the normalized proximity-based term frequency.

6. The Ad-hoc information retrieval model based on a triple word frequency scheme according to claim 5, wherein Prox (t, q, D) is optimized by adding IDF part of query term, prox (t, q, D) is optimized;

7. the Ad-hoc information retrieval model based on the triple word frequency scheme according to any one of claims 2-6, wherein the specific operation steps of the step two are as follows:

TF _(1，2) (t，D)＝(1-ω)·TF ₁ (t，D)+ω·TF ₂ (t，D) (9)

TF _tri (t，D)＝(1-λ)·(TF _(1，2) (t，D))+λ·TF ₃ (t，D) (10)

wherein lambda represents TF _(1，2) (t, D) and TF ₃ Weight coefficient of linear combination, value range [0,1]；

step (4) combining TF ₁ 、TF ₂ And TF (TF) ₃ And (3) linearly combining the result with the optimized and improved inverse document frequency to obtain a retrieval model:

Score(Q，D)＝∑ _t∈Q TF _tri (t，D)·TDF(t) (12)