CN116933766A - Ad-hoc information retrieval model based on triple word frequency scheme - Google Patents

Ad-hoc information retrieval model based on triple word frequency scheme Download PDF

Info

Publication number
CN116933766A
CN116933766A CN202310648601.2A CN202310648601A CN116933766A CN 116933766 A CN116933766 A CN 116933766A CN 202310648601 A CN202310648601 A CN 202310648601A CN 116933766 A CN116933766 A CN 116933766A
Authority
CN
China
Prior art keywords
document
term
frequency
ptf
rtf
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310648601.2A
Other languages
Chinese (zh)
Other versions
CN116933766B (en
Inventor
陈朝峰
叶保丹
郭乃瑄
曹瑞
王媛媛
周锋
徐森
王如刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yancheng Institute of Technology
Yancheng Institute of Technology Technology Transfer Center Co Ltd
Original Assignee
Yancheng Institute of Technology
Yancheng Institute of Technology Technology Transfer Center Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yancheng Institute of Technology, Yancheng Institute of Technology Technology Transfer Center Co Ltd filed Critical Yancheng Institute of Technology
Priority to CN202310648601.2A priority Critical patent/CN116933766B/en
Publication of CN116933766A publication Critical patent/CN116933766A/en
Application granted granted Critical
Publication of CN116933766B publication Critical patent/CN116933766B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Algebra (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an Ad-hoc information retrieval model based on a triple word frequency scheme, which belongs to the technical field of information, and the construction of the retrieval model comprises the following steps: step one, calculating LTF, RTF and PTF ranking functions, and carrying out normalization processing to obtain corresponding representation functions TF1, TF2 and TF3; wherein LTF represents a length regularized term frequency; RTF represents relative term frequency within the document; PTF represents a proximity-based term frequency; and step two, carrying out linear combination on the TF1, TF2 and TF3 functions, and obtaining a retrieval model based on the linear combination result of the TF1, TF2 and TF3 functions. The retrieval models TriATF and TriATF constructed by the invention linearly combine the LTF, RTF and PTF, fully utilize the advantages of word frequencies, realize more efficient document retrieval, and are simple and practical.

Description

Ad-hoc information retrieval model based on triple word frequency scheme
Technical Field
The invention belongs to the technical field of information, and particularly relates to an Ad-hoc information retrieval model based on a triple word frequency scheme.
Background
The core of modern document retrieval models, such as vector space models or probabilistic models, is the use of a document ranking function. Given any query document, its numerical score is calculated using a ranking function. Thus, all documents in a given collection may be correctly ranked. Most of these ranking functions adopt heuristic methods to solve the relationship among factors such as term-frequency (TF), document-frequency (DF), and Document Length (DL), wherein TF generally plays a central role in these functions.
In the vector space-based search model and the probability search model, the document length is directly applied to the normalized TF. Such a model is referred to herein as a document length normalized term frequency (document length normalized term frequency, LTF) model. A parameter is introduced in the ranking function of these models to control DL, thereby applying normalization to TF. However, in the language model, the probability of generating a term in a document is determined by directly using DL as a divisor. Regarding vector space models, it is generally believed that the significance of a term can be better captured when TF is not only normalized by DL, but also relative to the average term frequency limit within a document. Such models are referred to herein as relative document term frequency (relative term frequency, RTF) models. Meanwhile, a term frequency based on proximity is introduced in a framework based on BM25 and language modeling, wherein the associated proximity of two given terms in a document is quantitatively captured by measuring the distance between co-occurrences of the two terms, which is referred to herein as a term frequency based on Proximity (PTF) model.
Essentially, LTF, RTF and PTF provide three different approaches to normalize/modulate TF factors. LTFs and RTFs are term independent and are more suitable for long queries or short queries. The existing effective retrieval models based on vector space frames include MATF, MVD and PDM. And the PTF is a model for inquiring term correlation, calculates the number of times of term co-occurrence in a specified distance and is used for measuring the term relevance. In the prior art, the PTF has been integrated into a model based on BM25 and language modeling framework, and achieves good results. In the vector space framework, PTF is used with a pseudo-related feedback mechanism to measure candidate terms in a feedback document.
Weighting schemes based on word frequency are widely used in modern information retrieval models. Many existing models use document length normalized term frequencies (document length normalized term frequency, LTFs), relative intra-document term frequencies (relative term frequency, RTFs), or proximity-based term frequencies (PTFs) to normalize/adjust term frequency factors to achieve efficient retrieval of documents. However, three adjustment factor methods of LTF, RTF and PTF are not combined together to provide a search model for information retrieval.
If three adjustment factor methods of LTF, RTF and PTF can be combined together, the characteristics of each method are integrated for information retrieval, and the problem of effective retrieval of documents with different lengths, different characteristics and different forms can be solved.
Disclosure of Invention
The invention aims at: providing an Ad-hoc information retrieval model based on a Triple word frequency scheme, wherein the Triple word frequency scheme (Triple estimates-based Term Frequency Scheme, triATF) is a method for linearly combining three regulating factors of LTF, RTF and PTF; the invention combines LTF and RTF with the query length factor, and finally combines with PTF linearly, and constructs a retrieval model in a word frequency weighting mode so as to realize effective retrieval of the document.
The technical content is as follows: an Ad-hoc information retrieval model based on a triple word frequency scheme, wherein the construction of the retrieval model comprises the following steps:
step one, calculating LTF, RTF and PTF ranking functions, and carrying out normalization processing to obtain corresponding representation functions TF1, TF2 and TF3; wherein LTF represents a length regularized term frequency; RTF represents relative term frequency within the document; PTF represents a proximity-based term frequency;
and step two, carrying out linear combination on the TF1, TF2 and TF3 functions, and obtaining a retrieval model based on the linear combination result of the TF1, TF2 and TF3 functions.
Further, in the normalization process in the first step, the normalization function used is f (x) =x/(1+x), where x represents one of LTF, RTF and PTF ranking functions, and f (x) represents the corresponding representation function TF1, TF2 or TF3.
Further, in the first step, the LTF ranking function is normalized to obtain a representation function TF 1 The operation steps are as follows:
the LTF ranking function is expressed as:
wherein LTF (t, D) represents a length regularized term frequency; d represents a document; t represents a term in the document D; tf (t, D) represents the frequency of the vocabulary t in document D; avdl refers to the average document length in a given document collection C; |d| represents the length of the document;
using the function f (x) =x/(1+x) to normalize LTF (t, D), there is:
TF 1 (t,D)=LTF(t,D)/(1+LTF(t,D)) (2)
wherein TF is 1 (t, D) represents normalized length regularized term frequencies.
Further, in the first step, the RTF ranking function is normalized to obtain a representation function TF 2 The specific operation is as follows:
the RTF ranking function is expressed as:
wherein RTF (t, D) represents the relative term frequency within the document; d represents a document; t represents a term in the document D; tf (t, D) represents the frequency of the vocabulary t in document D; avgtf (D) represents the average word frequency of document D;
using the function f (x) =x/(1+x) to normalize RTF (t, D), there is:
TF 2 (t,D)=RTF(t,D)/(1+RTF(t,D)) (4)
wherein TF is 2 (t, D) represents normalized relative intra-document term frequency.
Further, in the first step, the PTF ranking function is normalized to obtain a representation function TF 3 The specific operation is as follows:
the PTF ranking function formula is:
PTF(t,D)=∑ q∈Q,q≠t ,Prox(t,q,D) (5)
wherein Q represents a query, Q represents one query term in Q; PTF (t, D) represents a proximity-based term frequency; d represents a document; t represents a term in the document D;
prox (t, q, D) represents the proximity between term t and query term q in document D, and the formula is:
wherein dist (t, q, D) is the distance between term t and query term q in document D, σ is a normalization parameter;
using the function f (x) =x/(1+x) to normalize the PTF (t, D), there is:
TF 3 (t,D)=RTF(t,D)/(1+RTF(t,D)) (8)
TF 3 (t, D) represents normalized proximity-based termsFrequency.
Further, prox (t, q, D) is optimized by adding the IDF part of the query term, and the Prox (t, q, D) is optimized;
further, the specific operation steps of the second step are as follows:
step (1), using the query length factor ω=2/[ 1+log2 (1+|q|)]To combine TF 1 And TF (TF) 2 The method comprises the following steps of:
TF (1,2) (t,D)=(1-ω)·TF 1 (t,D)+ω·TF 2 (t,D) (9)
wherein D represents a document; t represents a term in the document D; omega represents a query length factor; TF (TF) 1 (t, D) represents normalized length regularized term frequencies; TF (TF) 2 (t, D) represents normalized relative intra-document term frequency;
step (2), then TF is reacted (1,2) (t, D) and TF 3 Linear combination, yielding the following results:
TF tri (t,D)=(1-λ)·(TF (1,2) (t,D))+λ·TF 3 (t,D) (10)
wherein lambda represents TF (1,2) (t, D) and TF 3 Weight coefficient of linear combination, value range [0,1];
And (3) optimizing and improving the IDF by the following formula to obtain the TDF:
wherein IDF (t) represents the inverse document frequency, defined as log2[ (N-N (t) +0.5)/(N (t) +0.5) ], where N is the number of documents in the document collection and N (t) is the number of documents containing t; ctf (t) represents the frequency of t in the literature collection; TDF (t) represents an improved IDF (t);
step (4) combining TF 1 、TF 2 And TF (TF) 3 Function linearityCombining the result and optimizing the improved inverse document frequency to obtain a retrieval model:
Score(Q,D)=∑ t∈Q TF tri (t,D)·TDF(t) (12)
wherein Score (Q, D) represents the obtained search model; TF (TF) tri (t, D) represents TF (1,2) (t, D) and TF 3 Linear combination; TDF (t) represents an improved IDF (t).
Compared with the prior art, the invention has the beneficial effects that:
(1) The retrieval models TriATF and TriATF constructed by the invention linearly combine the LTF, RTF and PTF, fully utilize the advantages of word frequencies, realize more efficient document retrieval, and are simple and practical.
(2) The search models TriATF and TriATF are also capable of flexibly controlling the contribution degree of LTF, RTF and PTF to the model, and optimizing the search model. By adjusting interpolation parameters between the LTF and the RTF, the overall weights of the LTF and the RTF relative to the PTF, namely parameters applied to the PTF, are controlled, and the contribution weights of three word frequencies to the model are controlled.
Drawings
FIG. 1 shows sensitivity of λ in TriATF and TriATF constructed according to the present invention;
FIG. 2 shows the sensitivity of σ in the TriATF and TriATF constructed according to the present invention.
Detailed Description
The following detailed description of the technical solution of the present invention will be given with reference to the accompanying drawings and specific embodiments.
The invention aims to provide an Ad-hoc information retrieval model based on a triple word frequency scheme, namely a retrieval model constructed based on linear combination of LTF, RTF and PTF, so as to realize effective retrieval of documents. Essentially, LTF, RTF and PTF provide three different approaches to normalize/modulate TF factors. The invention combines LTF, RTF and PTF for the first time, adjusts interpolation parameters between them through linear combination, controls weight parameters between them, and further builds a more effective retrieval model.
In the present invention, let D, C and Q denote a document, a document collection, and a query, respectively (where Q is one query term in Q). Use | indicates their length/size. The term avdl refers to the average document length in a given document collection. Term frequency (tf), further defined as the frequency of query terms q in document D, avgtf (D) is the average term frequency for a given document D, ctf (q) represents the frequency of q in the document collection. The inverse document frequency IDF (q) is defined as log2[ (N-N (q) +0.5)/(N (q) +0.5) ], where N is the number of documents in the document collection and N (q) is the number of documents containing q.
Example 1
An Ad-hoc information retrieval model based on a triple word frequency scheme, wherein the construction of the retrieval model comprises the following steps:
step one, calculating LTF, RTF and PTF ranking functions, and carrying out normalization processing to obtain corresponding representation functions TF 1 、TF 2 And TF (TF) 3
Wherein LTF represents a length regularized term frequency; RTF represents relative term frequency within the document; PTF represents a proximity-based term frequency;
(1) Normalizing the LTF ranking function to obtain a representation function TF 1 The calculation steps are as follows:
assuming that the longer the document length, its relevance should be reduced for a given query. The LTF ranking function takes into account the aggregate average document length, and the frequency of a term in the average length document does not change its value in the function.
The LTF ranking function is expressed as:
wherein LTF (t, D) represents a length regularized term frequency; d represents a document; t represents document D
The term of (a); tf (t, D) represents the frequency of the vocabulary t in document D; avdl refers to the average document length in a given document collection C; |d| represents the length of the document;
using the function f (x) =x/(1+x) to normalize LTF (t, D), there is:
TF 1 (t,D)=LTF(t,D)/(1+LTF(t,D)) (2)
wherein TF is 1 (t, D) represents normalized length regularized term frequencies; LTF (t, D) represents a length regularized term frequency; d represents a document; t represents a term in the document D;
(2) Normalizing the RTF ranking function to obtain a representation function TF 2 The calculation steps are as follows:
the RTF ranking function measures the importance of a term by using the frequency of the term in the document relative to the average term frequency of the document; in the RTF ranking function, the representation function TF is obtained by normalization processing according to the number of unique terms in the document 2
The RTF ranking function is expressed as:
wherein RTF (t, D) represents the relative term frequency within the document; d represents a document; t represents a term in the document D; tf (t, D) represents the frequency of the vocabulary t in document D; avgtf (D) represents the average word frequency of document D;
using the function f (x) =x/(1+x) to normalize RTF (t, D), there is:
TF 2 (t,D)=RTF(t,D)/(1+RTF(t,D) (4)
wherein TF is 2 (t, D) represents normalized relative intra-document term frequency; RTF (t, D) represents the relative term frequency within the document; d represents a document; t represents a term in the document D;
(3) Normalizing the PTF ranking function to obtain a representation function TF 3 The calculation steps are as follows:
the PTF ranking function represents the frequency of a term and its proximity (i.e., relative location information associated with the query term), and the PTF ranking function formula is:
PTF(t,D)=∑ q∩Q,q≠t ,Prox(t,q,D) (5)
wherein Q represents a query, Q represents one query term in Q; prox (t, q, D) represents the proximity between term t and query term q in document D; PTF (t, D) represents a proximity-based term frequency; d represents a document; t represents a term in the document D;
the ideal proximity function should be a convex function and satisfy monotonic decreases with respect to the distance between the term and the query term; the following exponential function satisfies the requirements:
wherein dist (t, q, D) is the distance between term t and query term q in document D, σ is a normalization parameter; d represents a document; t represents a term in the document D;
using the function f (x) =x/(1+x) to normalize the PTF (t, D), there is:
TF 3 (t,D)=PTF(t,D)/(1+PTF(t,D)) (8)
TF 3 (t, D) represents the normalized proximity-based term frequency.
Step two, for TF 1 、TF 2 And TF (TF) 3 And (3) performing linear combination on the functions, and obtaining a retrieval model TriATF based on the linear combination result of the functions of TF1, TF2 and TF3.
The specific operation steps of the second step are as follows:
step (1), in view of the fact that LTF and RTF perform better with different query lengths, using the query length factor ω=2/[ 1+log2 (1+|Q|)]To combine TF 1 And TF (TF) 2 The method comprises the following steps of:
TF (1,2) (t,D)=(1-ω)·TF 1 (t,D)+ω·TF 2 (t,D) (9)
wherein D represents a document; t represents a term in the document D; omega represents a query length factor; TF (TF) 1 (t, D) represents normalized length regularized term frequencies; TF (TF) 2 (t, D) represents normalized relative intra-document term frequency;
step (2), then TF is reacted (1,2) (t, D) and TF 3 Linear combination, yielding the following results:
TF tri (t,D)=(1-λ)·(TF (1,2) (t,D))+λ·TF 3 (t,D) (10)
wherein lambda represents TF (1,2) (t, D) and TF 3 Weight coefficient of linear combination, value range [0,1];
Obviously, the TF obtained tri The (t, D) function provides an overall balanced consideration, both term independent and frequency of relevance to the query term.
And (3) optimizing and improving the IDF by the following formula to obtain the TDF:
wherein IDF (t) represents the inverse document frequency, defined as log2[ (N-N (t) +0.5)/(N (t) +0.5) ], where N is the number of documents in the document collection and N (t) is the number of documents containing t; ctf (t) represents the frequency of t in the literature collection; TDF (t) represents an improved IDF (t);
step (4) combining TF 1 、TF 2 And TF (TF) 3 And (3) linearly combining the result with the optimized and improved inverse document frequency to obtain a retrieval model, and marking the retrieval model as TriATF:
Score(Q,D)=∑ t∈Q TF tri (t,D)·TDF(t) (12)
wherein Score (Q, D) represents the search model proposed by the present invention; TF (TF) tri (t, D) represents TF (1,2) (t, D) and TF 3 Linear combination; TDF (t) represents an improved IDF (t).
Example 2:
for equation (6) in example 1, the significance of the query term can be further considered by adding the IDF portion of the query term to optimize the function, equation (6) being optimized as:
wherein IDF (Q) represents the inverse document frequency, defined as log2[ (N-N (Q) +0.5)/(N (Q) +0.5) ], where N is the number of documents in document D, N (Q) is the number of documents containing Q, Q is a query term in Q;
when equation (7) is used instead of equation (6), the resulting search model is denoted as TriATF.
Example 3:
the effectiveness of the proposed model of the present invention was evaluated using the official evaluation index of TREC, i.e. mean average accuracy (Mean Average Precision, MAP).
The invention mainly uses 8 public data sets of TREC to test the TriATF and TriATF constructed in the embodiment 1 and the embodiment 2 and the existing retrieval models DLM, BM25 and MATF, CRTER2 to verify the validity of the TriATF and TriATF retrieval models constructed in the invention; the 8 public data sets using TREC are specifically AP88-89, DISK1&2, DISK4&5, ROBUST04, WSJ, W2G, W10G, GOV2. For each dataset, 50 to 250 queries were tested. In all test experiments, the title field of the TREC query was used for retrieval. The unavaluated query is deleted. And carrying out word drying treatment on each term by using a Porter English stem device. And deleting the standard English stop words. Finally, the present invention uses the official evaluation index MAP of TREC to evaluate the effectiveness of the proposed model of the present invention. All statistical tests were based on the Wilcoxon paired symbol rank test.
Table 1 is a MAP value comparing the constructed search models of the present invention, triATF and TriATF, with the existing search models DLM, BM25 and MATF. Wherein the best results are highlighted in bold; the significance level was 0.05 using the Wilcoxon paired symbol rank test. The superscripts "l", "b" and "m" represent DLM, BM25 and MATF, respectively, and the percentage improvement is listed.
TABLE 1
As shown in table 1, the search model tritf (and tritf) performs significantly better in almost all eight sets than the other models from a MAP perspective.
Table 2 is a MAP value comparing the constructed search models of the present invention, triATF and TriATF, with the existing search model CRTER2. Wherein the best results for each column are highlighted in bold, with a significance level of 0.05 using the Wilcoxon paired symbol rank test, highlighted with the diamond symbol "o".
TABLE 2
MAP AP88-89 DISK1&2 DISK4&5 ROBUST04 WSJ WT2G WT10G GOV2
CRTER2 0.2934 0.2483 0.2334 0.2585 0.3548 0.3592 0.2207 0.3210
TriATF 0.2978 0.2494 0.2437 0.2690 0.3596 0.3669 0.2292 0.3327
TriATF 0.2953 0.2502 0.2434 0.2697 0.3554 0.3673 0.2278 0.3334
Table 2 shows the comparison of the search models TriATF and TriATF with the existing search model CRTER2, which are superior to the existing search model CRTER2 in all sets of consideration from the MAP point of view.
Fig. 1 shows the sensitivity of the tri-atf and λ in tri-atf; fig. 1 plots the MAP values of the proposed models (TriATF and TriATF) over 8 sets as the interpolation coefficient λ varies from 0 to 1. As can be seen from the figure, the search model we propose (TriATF and TriATF x) generally performs well on most collections when lambda is fixed at 0.2. Considering that when λ is equal to 0, the TriATF is equivalent to MATF, it is clear that in all these comparisons, the TriATF is superior to MATF.
Fig. 2 is the sensitivity of σ in tritf and tritf; fig. 2 plots the MAP values for the tri and tri at different σ. The graph shows that the performance of the model proposed by the present invention is very stable over a large part of the sigma range. However, it was observed that the performance of the model was slightly degraded when the σ value was large. Thus, for a new set, the recommended parameter value should be λ 0.2 and σ5 without too much knowledge of the correlation.
The retrieval model TriATF (and variants thereof) presented in the present invention is a linear combination model based on term frequency normalized to document Length (LTF), relative intra-document term frequency (RTF), or proximity-based term frequency (PTF).
The retrieval model TriATF (and variants thereof) of the present invention can accurately reflect the characteristics of TF-based models, and is divided into three normalization/adjustment TF factors, namely term independent and query dependent (LTF and RTF) and those query dependent models (PTF), for the first time, combined together, and the retrieval results are more efficient than retrieval models with a single normalization/adjustment TF factor. In embodiment 3, the validity of the model proposed by the present invention is verified on 8 published TREC datasets, and the model proposed by the present invention is proved to be valid on the pseudo-correlation feedback models (DLM, BM25, MATF, CRTER 2) of the current level on the average accuracy MAP of the evaluation standard mean of the general search system evaluation index, and is most valid for almost all selected queries compared with the best prior art BM25, DLM, MATF, CRTER2, and can be applied to the fields of text search, information search, data mining, etc.

Claims (7)

1. The Ad-hoc information retrieval model based on the triple word frequency scheme is characterized by comprising the following steps of:
step one, calculating LTF, RTF and PTF ranking functions, and carrying out normalization processing to obtain corresponding representation functions TF1, TF2 and TF3; wherein LTF represents a length regularized term frequency; RTF represents relative term frequency within the document; PTF represents a proximity-based term frequency;
and step two, carrying out linear combination on the TF1, TF2 and TF3 functions, and obtaining a retrieval model based on the linear combination result of the TF1, TF2 and TF3 functions.
2. The Ad-hoc information retrieval model according to claim 1, wherein the normalization function used in the normalization in the first step is f (x) =x/(1+x), where x represents one of LTF, RTF and PTF ranking functions, and f (x) represents the corresponding representation function TF1, TF2 or TF3.
3. The Ad-hoc information retrieval model based on the triple word frequency scheme according to claim 2, wherein in said step one, the LTF ranking function is normalized to obtain the representation function TF 1 The operation steps are as follows:
the LTF ranking function is expressed as:
wherein LTF (t, D) represents a length regularized term frequency; d represents a document; t represents a term in the document D; tf (t, D) represents the frequency of the vocabulary t in document D; avdl refers to the average document length in a given document collection C; |d| represents the length of the document;
using the function f (x) =x/(1+x) to normalize LTF (t, D), there is:
TF 1 (t,D)=LTF(t,D)/(1+LTF(tD)) (2)
wherein TF is 1 (t, D) represents normalized length regularized term frequencies.
4. The Ad-hoc information retrieval model based on the triple word frequency scheme according to claim 2, wherein in said step one, the RTF ranking function is normalized to obtain the representation function TF 2 The specific operation is as follows:
the RTF ranking function is expressed as:
wherein RTF (t, D) represents the relative term frequency within the document; d represents a document; t represents a term in the document D; tf (t, D) represents the frequency of the vocabulary t in document D; avgtf (D) represents the average word frequency of document D;
using the function f (x) =x/(1+x) to normalize RTF (t, D), there is:
TF 2 (t,D)=RTF(t,D)/(1+RTF(tD)) (4)
wherein TF is 2 (t, D) represents normalized relative intra-document term frequency.
5. The Ad-hoc information retrieval model based on the triple word frequency scheme according to claim 2, wherein in said step one, the PTF ranking function is normalized to obtain the representation function TF 3 The specific operation is as follows:
the PTF ranking function formula is:
PTF(t,D)=∑ q∈Q,q≠t ,Prox(t,q,D) (5)
wherein Q represents a query, Q represents one query term in Q; PTF (t, D) represents a proximity-based term frequency; d represents a document; t represents a term in the document D;
prox (t, q, D) represents the proximity between term t and query term q in document D, and the formula is:
wherein dist (t, q, D) is the distance between term t and query term q in document D, σ is a normalization parameter;
using the function f (x) =x/(1+x) to normalize the PTF (t, D), there is:
TF 3 (t,D)=PTF(t,D)/(1+PTF(t,D)) (8)
TF 3 (t, D) represents the normalized proximity-based term frequency.
6. The Ad-hoc information retrieval model based on a triple word frequency scheme according to claim 5, wherein Prox (t, q, D) is optimized by adding IDF part of query term, prox (t, q, D) is optimized;
7. the Ad-hoc information retrieval model based on the triple word frequency scheme according to any one of claims 2-6, wherein the specific operation steps of the step two are as follows:
step (1), using the query length factor ω=2/[ 1+log2 (1+|q|)]To combine TF 1 And TF (TF) 2 The method comprises the following steps of:
TF (1,2) (t,D)=(1-ω)·TF 1 (t,D)+ω·TF 2 (t,D) (9)
wherein D represents a document; t represents a term in the document D; omega represents a query length factor; TF (TF) 1 (t, D) represents normalized length regularized term frequencies; TF (TF) 2 (t, D) represents normalized relative intra-document term frequency;
step (2), then TF is reacted (1,2) (t, D) and TF 3 Linear combination, yielding the following results:
TF tri (t,D)=(1-λ)·(TF (1,2) (t,D))+λ·TF 3 (t,D) (10)
wherein lambda represents TF (1,2) (t, D) and TF 3 Weight coefficient of linear combination, value range [0,1];
And (3) optimizing and improving the IDF by the following formula to obtain the TDF:
wherein IDF (t) represents the inverse document frequency, defined as log2[ (N-N (t) +0.5)/(N (t) +0.5) ], where N is the number of documents in the document collection and N (t) is the number of documents containing t; ctf (t) represents the frequency of t in the literature collection; TDF (t) represents an improved IDF (t);
step (4) combining TF 1 、TF 2 And TF (TF) 3 And (3) linearly combining the result with the optimized and improved inverse document frequency to obtain a retrieval model:
Score(Q,D)=∑ t∈Q TF tri (t,D)·TDF(t) (12)
wherein Score (Q, D) represents the obtained search model; TF (TF) tri (t, D) represents TF (1,2) (t, D) and TF 3 Linear combination; TDF (t) represents an improved IDF (t).
CN202310648601.2A 2023-06-02 2023-06-02 Ad-hoc information retrieval model based on triple word frequency scheme Active CN116933766B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310648601.2A CN116933766B (en) 2023-06-02 2023-06-02 Ad-hoc information retrieval model based on triple word frequency scheme

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310648601.2A CN116933766B (en) 2023-06-02 2023-06-02 Ad-hoc information retrieval model based on triple word frequency scheme

Publications (2)

Publication Number Publication Date
CN116933766A true CN116933766A (en) 2023-10-24
CN116933766B CN116933766B (en) 2024-08-16

Family

ID=88388515

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310648601.2A Active CN116933766B (en) 2023-06-02 2023-06-02 Ad-hoc information retrieval model based on triple word frequency scheme

Country Status (1)

Country Link
CN (1) CN116933766B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000033215A1 (en) * 1998-11-30 2000-06-08 Justsystem Corporation Term-length term-frequency method for measuring document similarity and classifying text
CN107247745A (en) * 2017-05-23 2017-10-13 华中师范大学 A kind of information retrieval method and system based on pseudo-linear filter model
CN108520033A (en) * 2018-03-28 2018-09-11 华中师范大学 Enhancing pseudo-linear filter model information search method based on superspace simulation language
CN110008407A (en) * 2019-04-09 2019-07-12 苏州浪潮智能科技有限公司 A kind of information retrieval method and device
CN111444414A (en) * 2019-09-23 2020-07-24 天津大学 Information retrieval model for modeling various relevant characteristics in ad-hoc retrieval task
CN113761890A (en) * 2021-08-17 2021-12-07 汕头市同行网络科技有限公司 BERT context sensing-based multi-level semantic information retrieval method
CN114840639A (en) * 2022-04-12 2022-08-02 杭州电子科技大学 ConceptNet-based information retrieval query expansion method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000033215A1 (en) * 1998-11-30 2000-06-08 Justsystem Corporation Term-length term-frequency method for measuring document similarity and classifying text
CN107247745A (en) * 2017-05-23 2017-10-13 华中师范大学 A kind of information retrieval method and system based on pseudo-linear filter model
CN108520033A (en) * 2018-03-28 2018-09-11 华中师范大学 Enhancing pseudo-linear filter model information search method based on superspace simulation language
CN110008407A (en) * 2019-04-09 2019-07-12 苏州浪潮智能科技有限公司 A kind of information retrieval method and device
CN111444414A (en) * 2019-09-23 2020-07-24 天津大学 Information retrieval model for modeling various relevant characteristics in ad-hoc retrieval task
CN113761890A (en) * 2021-08-17 2021-12-07 汕头市同行网络科技有限公司 BERT context sensing-based multi-level semantic information retrieval method
CN114840639A (en) * 2022-04-12 2022-08-02 杭州电子科技大学 ConceptNet-based information retrieval query expansion method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
卜质琼等: "基于LDA模型的 Ad hoc 信息检索方法研究", 计算机应用研究, vol. 32, no. 5, 31 May 2015 (2015-05-31) *

Also Published As

Publication number Publication date
CN116933766B (en) 2024-08-16

Similar Documents

Publication Publication Date Title
US10803099B2 (en) Incremental maintenance of inverted indexes for approximate string matching
KR100609253B1 (en) Information retrieval and speech recognition based on language models
Becker et al. Topic-based vector space model
US10242071B2 (en) Preliminary ranker for scoring matching documents
US20080114750A1 (en) Retrieval and ranking of items utilizing similarity
US10691753B2 (en) Memory reduced string similarity analysis
US7962471B2 (en) User profile classification by web usage analysis
EP2176745B1 (en) Personalized query completion suggestion
US7734565B2 (en) Query string matching method and apparatus
JP4838529B2 (en) Enhanced clustering of multi-type data objects for search term proposal
EP3314464B1 (en) Storage and retrieval of data from a bit vector search index
CN101853272B (en) Search engine technology based on relevance feedback and clustering
US20060200460A1 (en) System and method for ranking search results using file types
US20100318531A1 (en) Smoothing clickthrough data for web search ranking
JP2005302041A (en) Verifying relevance between keywords and web site content
CN111832289A (en) Service discovery method based on clustering and Gaussian LDA
US20100125559A1 (en) Selectivity estimation of set similarity selection queries
US11748324B2 (en) Reducing matching documents for a search query
Li et al. Clustering words with the MDL principle
CN110377684A (en) A kind of spatial key personalization semantic query method based on user feedback
Wu et al. A differential privacy DNA motif finding method based on closed frequent patterns
CN116933766B (en) Ad-hoc information retrieval model based on triple word frequency scheme
US20160378808A1 (en) Updating a bit vector search index
CN105677664A (en) Compactness determination method and device based on web search
EP3314467B1 (en) Bit vector search index

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant