CN110619036A - Full-text retrieval system based on improved IF-IDF algorithm - Google Patents

Full-text retrieval system based on improved IF-IDF algorithm Download PDF

Info

Publication number
CN110619036A
CN110619036A CN201910787265.3A CN201910787265A CN110619036A CN 110619036 A CN110619036 A CN 110619036A CN 201910787265 A CN201910787265 A CN 201910787265A CN 110619036 A CN110619036 A CN 110619036A
Authority
CN
China
Prior art keywords
document
index
word
idf
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910787265.3A
Other languages
Chinese (zh)
Other versions
CN110619036B (en
Inventor
俞佳慧
何新
马轩
姜楠
王子龙
黄炎焱
项凯南
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Tech University
Original Assignee
Nanjing Tech University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Tech University filed Critical Nanjing Tech University
Priority to CN201910787265.3A priority Critical patent/CN110619036B/en
Publication of CN110619036A publication Critical patent/CN110619036A/en
Application granted granted Critical
Publication of CN110619036B publication Critical patent/CN110619036B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0623Item investigation
    • G06Q30/0625Directed, with specific intent or strategy
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a full-text retrieval system based on an improved IF-IDF algorithm. The invention adopts two types of search, namely commodity search and merchant search, and the search modes are various; the IKAnalyzer word segmentation device is adopted, so that the word segmentation speed is high, and the comprehensive performance is good; the indexer capable of setting the weight according to the service requirement is adopted, so that the method is more humanized; the improved TF-IDF algorithm is adopted for retrieval, and the algorithm has good performance and high accuracy.

Description

Full-text retrieval system based on improved IF-IDF algorithm
Technical Field
The invention relates to the technical field of full-text retrieval, in particular to a full-text retrieval system based on an improved IF-IDF algorithm.
Background
In the face of unstructured data with various formats, particularly under the condition that the data size is large, the traditional method needs to spend a large amount of time, and at the moment, the full-text retrieval technology is developed. The realization of the full-text retrieval technology is based on the full-text retrieval theory, and for unstructured data with various formats, the data source is reorganized to have a certain structure and then searched, so that the effect of improving the searching speed can be achieved.
Lucene is an open source full text search engine toolkit provided and supported by the project group of Apache software foundation 4jakarta, and is not a complete full text search engine but a full text search engine architecture, and provides a complete query engine, an index engine and a partial text analysis engine.
The method is characterized in that the method adopts a similarity model to score search results meeting conditions, is a key ring of a full-text search system, scores the similarity degree of each result corresponding to a query condition according to the corresponding similarity model and preset constraints, and sorts the results according to the score to preferentially return the results which are expected to be seen by a user.
Lucene uses a Vector Space Model (VSM). The vector space model simplifies the processing of the data similarity into vector operation in the vector space, and the similarity of two pieces of data in each attribute dimension is expressed by the spatial similarity, so that the model is visual and easy to understand. The model typically includes the following main attributes:
term: the vector space model generally divides a data document into N index items through word segmentation, each index item can be represented by Term, and for an e-commerce system, Term can be used for representing the index item Term of each data after the indexed field is segmented1、Term2、…TermnCan be abbreviated as Ti
Document:Document may represent a Document, and thus each piece of data may be segmented by Document { Term ═ Term1,Term2,…,TermnDenoted as space vector, which can be abbreviated as Di
Query: the Query condition representing the user can be represented by Query { Term after word segmentationa,Termb,TermcDenoted as space vector, which can be abbreviated as Q.
W: each Term has a corresponding weight W, so each space vector has its corresponding weight vector WDi={W1,W2,…,Wn}、Q={Wa,Wb,Wc}。
Cos(Q,Di): the cosine distance between two space vectors, which can be used in a vector space model to represent the degree of similarity of two space vectors. A set { Cos (Q, D) is obtained by calculating the cosine values of the space vectors1),Cos(Q,D2),…,Cos(Q,Dn) And in the set, which cosine value is large, the higher the correlation between the corresponding document and the query condition is, and the result which meets the expectation of the user is obtained by finding the document with the higher similarity according to the law.
Inside Lucene, a similarity model is based on the existing TF-IDF algorithm, an additional influence factor is added, most scenes can be applied, but some problems still exist:
1) for the problem of reduced searching accuracy caused by lack of consideration of position information of entries in the documents, on one hand, different index domains of the documents have different importance, if the commodity name has higher searching priority and the special product type priority is too high, the distinguishing degree of the documents is reduced, and the algorithm does not consider the influence of different index domains of the keywords on the grading; on the other hand, more importantly, after the document is subjected to word-stop operation of the system word-segmentation device, semantic information of the obtained new keyword may be changed, and then scores other than the expected scores are obtained, and the influence of the word-stop operation on the document scores is not considered by the algorithm.
2) The problem of insufficient reliability caused by lack of consideration on the relevance between the vocabulary entry and the document is solved. Keywords are always related to each other, and the appearance of "pastry" and "pastry" is often related. Since the similarity formula in Lucene is based on the cosine similarity formula in the vector space model, it assumes an implicit assumption about the relevance of the query term and the document, i.e. the default relevance is associated with the similarity between the two. This assumption is based only on experience and lacks support of a specific theoretical model, thus having a certain impact on the accuracy and reliability of the ranking of the search results.
Disclosure of Invention
The invention aims to provide a full-text retrieval system based on an improved TF-IDF algorithm, which solves the problem that different index domains where keywords are located have influence on scoring, solves the problem that word stopping operation has influence on document scoring, and solves the problem of insufficient reliability caused by lack of consideration on correlation between terms and documents.
The technical solution for realizing the purpose of the invention is as follows: a full-text retrieval system based on an improved TF-IDF algorithm comprises an index domain module, a word segmentation device module, an indexer module and a retriever module, wherein:
the index domain module is set according to the service requirement, and the retrieval content is carried out according to the correct domain name during indexing and retrieval;
the word segmentation device module is used for segmenting words according to the word bank;
the indexer module is used for setting index domain weight for the service data source, creating an index and determining a storage mode of an index document;
the retriever module is used for configuring a retriever, analyzing retrieval conditions, generating a syntax tree, retrieving and sorting, packaging a sorting result and returning the sorting result to the client through an interface for displaying; and the similarity score adopts a modified TF-IDF algorithm to set weight for the priority of the index domain.
Compared with the prior art, the invention has the following remarkable advantages: (1) by adopting an improved TF-IDF algorithm, aiming at the influence of different index domains where the keywords are located on the scoring, the scoring has higher discrimination on different documents by setting the weights of the index domains according to the priorities of commodity names, brands, keywords and the like; (2) aiming at the influence of word stopping operation on document scoring, an influence factor LocScore is considered to be introduced into the similarity model to reflect the influence of the word stopping operation on the position relation of the vocabulary entry; (3) aiming at the problem of insufficient reliability caused by lack of consideration of correlation between entries and documents, the idea of introducing a concept retrieval model is tried to be solved, and an influence factor SimScore based on a naive Bayes classification algorithm is introduced for a similarity model to reflect the influence of probability correlation on similarity scores.
Drawings
Fig. 1 is a schematic diagram of a full-text search system based on the improved IF-IDF algorithm of the present invention.
FIG. 2 is a schematic diagram of the tokenizer module of the present invention.
FIG. 3 is a schematic diagram of the indexer module index object interaction process of the present invention.
FIG. 4 is a schematic diagram of the Boolean query syntax tree of the retriever module of the present invention.
FIG. 5 is a graph comparing accuracy in examples.
FIG. 6 is a graph comparing recall in examples.
FIG. 7 is a comparison graph of F1 values in the example.
Detailed Description
As shown in fig. 1, a full-text retrieval system based on an improved TF-IDF algorithm comprises an index domain module, a word segmentation module, an indexer module, and a retriever module; wherein:
the index domain module is set according to the service requirement, so that the content can be retrieved according to the correct domain name during indexing and retrieval;
the word segmentation module adopts an IKAnalyzer word segmentation device to segment words according to the word stock;
the indexer module sets index domain weight for the service data source, creates an index and determines a storage mode of an index document;
the retriever module is mainly responsible for configuring a retriever, analyzing retrieval conditions, generating a syntax tree, retrieving and sorting, packaging sorting results, and returning the sorting results to the client side through an interface for displaying.
In a full-text retrieval system, the similarity scoring algorithm adopts an improved TF-IDF algorithm, weight is set for the priority of an index domain, and an influence factor LocScore is introduced to reflect the influence of stop word operation on the position relation of terms.
The index domain module comprises commodity search and merchant search, wherein the commodity search mainly searches contents in a commodity main table, and the merchant search mainly searches contents in a merchant main table.
The word segmentation module selects an IKAnalyzer word segmentation device which is a Chinese word segmentation packet based on Java, and performs word segmentation by adopting a forward iteration finest segmentation algorithm and combining a word stock.
The indexer module is mainly responsible for setting higher weight for the commodity name and merchant name fields and storing the index file after word segmentation.
The searcher module mainly analyzes the received query conditions sent by the client through the interface by using the word segmentation device, and then constructs a Boolean query syntax tree by using the keywords obtained by word segmentation according to a certain rule. And according to the syntax tree, the retriever performs matching to the index file and obtains a result set, data in the set is ranked according to the score after being scored by an improved TF-IDF algorithm, and the ranking result is packaged and then returned to the client through an interface for displaying.
Wherein, for the improved TF-IDF algorithm, different weights are set for the index domain on the basis of the TF-IDF algorithm; introducing an influence factor LocScore to reflect the influence of word stopping operation on the position relation of the vocabulary entries; and introducing an influence factor SimScore based on a naive Bayes classification algorithm to reflect the influence of the probability correlation on the similarity score. The improved TF-IDF algorithm is used for improving the problems that the influence of different index domains of keywords which are not considered in the original algorithm on the scoring, the influence of word stopping operation on the scoring of a document and the lack of reliability caused by the consideration of the correlation between entries and the document.
The improved TF-IDF algorithm is based on the TF-IDF algorithm, different weights are set for the index domain, and the principle of the TF-IDF algorithm is shown as the following formula:
tfidfi,j=tfi,j×idfi (3)
in the formula ni,jAs an entry tiIn document djFrequency of occurrence of (1), nk,jIs tiThe sum of the frequency of occurrence in all documents, D is the total number of documents in the index repository, { j: t }i∈djIs the index bank containing tiThe number of documents of (a);
the term frequency tf refers to the frequency of occurrence of terms in a certain document of the index library; the inverse text frequency idf is the proportion of the number of documents containing entries to the number of all documents;
the term t can be obtained by multiplying the word frequency tf and the inverse text frequency idfiAnd document djIs scored.
In Lucene, a similarity model is based on a TF-IDF algorithm, and the principle is shown as the formula (4):
in the formula, q is a matching condition of the input keywords provided by the user, d is a document where a matching result is located, t is an entry obtained by parsing after word segmentation of a word segmentation component, and td is TermtThe occurrence frequency of the terms in the document d is idf, the inverse text frequency of the terms is cN, the scoring factor is determined based on the number of query terms appearing in the document, qN is the sum of variances of the query terms, qB is the weight of the keyword, tB is the weight of the terms, norm is the normalization factor, dB is the document weight, and tf is the occurrence frequency of the terms in the document.
Introducing an influence factor LocScore into the similarity model to reflect the influence of stop word operation on the position relation of the vocabulary entries, wherein the calculation method of the LocScore comprises the following steps:
in the formula, a document which contains the keywords and is not subjected to word stop filtering is given a higher score of 1, and a document which contains the keywords and is subjected to word stop filtering is given a lower score of 0.7, so that the similarity relation between the terms and the document in position is reflected;
the similarity model introduces an influence factor SimScore based on a naive Bayes classification algorithm to reflect the influence of the probability correlation on the similarity score, wherein a Bayes formula is shown as a formula (6); the document D matched with each query item Q in the model can be divided into two groups of a relevant document set R and an irrelevant document set NR, so that P (R | D) is the conditional probability that the document D belongs to the relevant document set R, P (NR | D) is the conditional probability that the document D belongs to the irrelevant document set NR, and when P (R | D) > P (NR | D), the query item Q is relevant to the document D; according to the Bayesian formula, the following can be obtained:
when in useWhen the query term Q is related to the document D, anThe larger the value of (D), the higher the relevance of the document D;
defining document D as a set of binary vectors D ═ D1,d2,…,dn) Wherein d isi1 indicates that the keyword has appeared in the document, di0 denotes that the keyword does not appear in the document under the assumption of conditional independence between the attributes on which the bayesian classification is based, and the calculation method for defining the impact factor SimScore is as shown in equation (7):
in the formula piFor the probability of occurrence in a document in the set R of keywords i related documents,siThe probability that it appears in a document in the set of irrelevant documents NR;
the improved Lucene similarity model is as follows:
NewScore(q,d)=α×Score(q,d)+β×LocScore(q,d)+γ×SimScore(q,d) (8)
wherein α + β + γ is 1.
The invention carries out full-text retrieval by establishing the inverted index table for the database in advance, and has high retrieval efficiency and complete retrieval effect. And the similarity model based on the improved IF-IDF algorithm is adopted to score the result document, so that the algorithm performance is improved, and the accuracy is high.
The present invention will be described in detail with reference to examples.
Examples
A full-text retrieval system based on an improved TF-IDF algorithm is composed of four parts, including an index domain module, a word segmentation module, an indexer module and a retriever module.
The index domain module is the basis of a full-text retrieval system, an index structure in the Lucene is divided into four parts, namely an index segment, an index document, an index domain and an index item, wherein the index item, the index document and the index segment are automatically generated by the system when an index is created, and the index domain is composed of a domain name and an indexed content item and is usually set by developers according to business requirements, so that the positioning can be carried out according to a correct domain name when the index and the retrieval are carried out.
The search function of the system is divided into two types of commodity search and merchant search, and the index domain design of the two types of data is shown in the following tables 1 and 2:
TABLE 1 Commodity search index field
The product search function mainly searches the product master item table for contents, and since the user search conditions for products often include information such as "brand", "product name", and "product characteristics". Therefore, important data in the table is selected to establish an index, wherein the index is established after the commodity name _ name and the commodity feature keyword are participled, and the index is directly established for the commodity number item _ id, the price, the special product category class _ name to which the commodity belongs and the commodity brand without participling. The storage strategy of storing in the index document is adopted for most fields, and the commodity specification item _ params and the commodity detail description are not stored because of more contents, and are obtained from the database through the commodity ID after searching. When the system searches, the index items without the word segmentation are precisely matched.
TABLE 2 Merchant search index field
The merchant searching function mainly searches contents in a merchant main table shop table, and the retrieval conditions of the user on the commodities often include information such as 'store names', 'positions', 'main operation commodities', and the like. Therefore, indexes are established for important data in the region selection table, wherein the indexes are established after word segmentation is carried out on the shop name shop _ id, the affiliated business district location and the main business commodity category major, the indexes are directly established without word segmentation on the number shop _ id of the merchant and the per _ price of per consumption, and details note of the merchant are not stored.
As shown in fig. 2, the IKAnalyzer selected by the segmenter module is a Java-based chinese segmentation toolkit, and performs segmentation by using a forward iterative minimum granularity segmentation algorithm and combining a thesaurus, and has a high-speed processing capability of 80 ten thousand characters/second. The forward iteration finest granularity algorithm in the IKAnalyzer is a word segmentation algorithm adopting a stripe-by-stripe matching mode, the length of a word to be matched is not required to be known in advance, and when the method is operated, a text is segmented according to a layer-by-layer iterative searching mode from a maximum word to a minimum word with the same prefix in a dictionary until the word can not be segmented continuously, so that the finest granularity word segmentation effect is realized. It only needs to scan the input text word to get all possible word segmentation results.
The IKAnalyzer is compared with the standard dAnalyzer, the mmseg4j and the imdict to perform experiments on the plain text corpora on the 'corpus online', and the word segmentation speed and the word segmentation accuracy of each word segmentation device are respectively shown in tables 3 and 4:
TABLE 3 word segmentation speed comparison
TABLE 4 word segmentation accuracy comparison
As can be seen from the table, the word segmentation speed of each word segmentation device gradually increases along with the increase of the text length. The standardAnalyzer has a good effect on English but obtains an ideal word segmentation result on Chinese due to the unary word segmentation method. The simple mode word segmentation speed of the mmseg4j is high, but the accuracy is general; the accuracy rate is increased in the complex mode, but the word segmentation speed is reduced to some extent; the word segmentation accuracy of the image is high, but the word segmentation speed is low due to the complex algorithm. It can be seen that the word segmentation accuracy of the IKAnalyzer is between that of mmseg4j and imdict, the word segmentation speed is high, the comprehensive performance is better, and the IKAnalyzer is suitable for being used as a word segmentation device of the system.
The dictionary in the IKAnalyzer mainly comprises a main dictionary, a quantitative dictionary, a stop dictionary and an extended word dictionary, and in addition, a user can expand the own proprietary dictionary by configuring an IKAnalyzer.cfg.xml file, and the system automatically filters words in the stop dictionary and the extended stop dictionary in a document so as to enable keyword segmentation to have better granularity and better identify some proprietary nouns by means of the extended dictionary. In the system, a 'brand' word is added to the expansion stop word dictionary ext _ stopword.dic to filter and search possible influences of brand commodities, and various special product type names are added to the expansion stop word dictionary ext _ stopword.dic, so that words can be better segmented for special name words of special products.
As shown in fig. 3, the work flow of the indexer module is mainly divided into three steps. In the first step, data source conversion and the formats of service data sources may be various, so that in order to facilitate Lucene identification, uniform conversion into plain text character streams is required. Original data in the system is stored in different fields of a database, and according to an index Field, a storage strategy and an index strategy which are determined during the design of an index domain, contents are sequentially taken out from the different fields to construct a Field object. And then adding the file into the Document object Document, and setting weights for the file and the Document according to the business requirements. Each of the specific data records corresponds to a Document object, higher weights are set for the fields of the name of the commodity and the name of the merchant, and weights are set for different records according to the recimmend _ weight of the commodity and the merchant. And secondly, analyzing by a word segmentation device, configuring an IKAnalyzer as a word segmentation device of the IndexWriter for the created and configured Document object in the mode described in the previous chapter, and sending the Document object into the IndexWriter for word segmentation by an addDocument () method. And step three, storing the index file, wherein after the IndexWriter completes word segmentation, the generated index is automatically stored into an index file MyIndexer under the dir directory.
As shown in fig. 4, the main workflow of the retriever module is also divided into three steps. The method comprises the steps of firstly, configuring a searcher, generating an index searcher and a query generator according to different retrieval requirements, and assigning a path, a query field and a word segmentation device for the index searcher and the query generator, for example, querying ItemName, Brand and Keyword fields for commodity search, and querying fields such as price and score when screening. And secondly, analyzing the input conditions and generating a grammar tree, after the searcher is configured, analyzing the search conditions by using a word segmentation device to obtain a query object, and generating a Boolean query grammar tree for the query object according to the service requirements, such as adopting OR logic on commodity names, brands and keywords in commodity search, and adopting AND logic on screening conditions such as prices and the like. And thirdly, retrieving and sequencing, matching the index file through an IndexSearcher by the retriever according to the Query object constructed according to the Boolean Query tree, calculating the score of the matching result according to the TD-IDF algorithm in the Lucene, and finally returning the top n result sets TopDocs through the Search () method of the IndexSearcher. The result set is subsequently packaged into list by the interface and sent back to the client.
The similarity scoring algorithm is an important ring of the full-text retrieval system, and scores a result set obtained by retrieval to return a result which best meets the user expectation. In Lucene, a similarity model based on a TF-IDF algorithm is adopted to score result documents, and the TF-IDF algorithm is based on a vector space model and comprehensively considers the word frequency of a keyword and the resolution of the keyword in different documents to calculate the similarity of the keyword. The principle of the TF-IDF algorithm is shown below:
tfidfi,j=tfi,j×idfi (3)
in the formula ni,jAs an entry tiIn document djFrequency of occurrence of (1), nk,jIs tiThe sum of the frequency of occurrence in all documents, D is the total number of documents in the index repository, { j: t }i∈djIs the index bank containing tiThe number of documents. It can therefore be seen that the TF-IDF algorithm is mainly based on two points:
(1) word frequency tf
The frequency of occurrence of an entry in a certain document of the index library. the higher tf, the higher the degree of association of the entry with the article document.
(2) Inverse text frequency idf
The number of documents containing entries accounts for the proportion of the number of all documents. The larger idf is, the lower the specific gravity is, which indicates that the degree of distinction of the entry in the document is higher. For example, in a dataset containing 1000 records, 500 records contain entry a, and 100 records contain entry B, then entry a has better discrimination than B in the dataset. The term t can be obtained by multiplying the word frequency tf and the inverse text frequency idfiAnd document djIs scored.
In Lucene, a similarity model is based on a TF-IDF algorithm, and the principle is shown as the formula (4):
q in the formula, a user-provided entry keyword matching condition (query);
d-the document (document) where the matching result is located;
t is an entry (term) obtained by parsing after word segmentation of the word segmentation component;
td——Termtfrequency of occurrence in document d;
idf — inverse text frequency of the entry;
cN-a scoring factor, determined based on the number of query terms present in the document;
qN-the sum of variances of the query entries, qB is the weight of the keyword, tB is the weight of the entry;
norm, the normalization factor, dB, the document weight, and tf, the frequency of occurrence of terms in the document.
The similarity model in Lucene adds an additional influence factor on the basis of the TF-IDF algorithm, wherein a scoring factor cN (q, d) is based on the occurrence number of query terms after the search condition is participled in a document, so that the documents containing more query terms can obtain higher weight; the variance sum qn (q) of the query term does not affect the ranking result, but allows the user to set so that the score value becomes larger or smaller as a whole; the matching weight tB allows the user to set a higher weight for certain fields so that they can be preferentially matched; the length weighting factor norm (t, d) then allows the user to set to get a higher score for longer or shorter search results.
It can be seen that the similarity model in Lucene has the following characteristics:
(1) the higher the frequency of occurrence of keywords in the hit document, the higher the score of the document.
(2) The higher the frequency of occurrence of the keyword in documents other than the hit document, the lower the score of the hit document.
(3) The location where the keyword appears in the hit document does not affect the scoring of the document.
In most application scenarios, Lucene has good effect on scoring of a result set, but the similarity degree of a keyword and a document is related to the occurrence frequency of the keyword and the occurrence position characteristics of the keyword and the document, especially when the stop word filtering causes term splicing and the like. For example, when a user searches for "Yuhua tea", it is clear that the two commercial records of "Zhongshan Yuhua tea" and "Yuhua tea cake" should get a higher score for the former. However, due to the filtering of the stop word "brand", the word frequency of the keyword "Yuhua tea" in the two documents "Zhongshan Yuhua tea" and "Yuhua tea cake" is the same, and the latter may even get a higher score due to the different weights given to the documents. This may cause that during the operation of the commodity platform of the special product store, the returned result of the search may contain some special product categories which are not in accordance with the expectation, and thus the accuracy of the search is reduced.
On the other hand, keywords are always correlated, and the appearance of "pastry" and "pastry" is often correlated. Since the similarity formula in Lucene is based on the cosine similarity formula in the vector space model, it assumes an implicit assumption about the relevance of the query term and the document, i.e. the default relevance is associated with the similarity between the two. This assumption is based only on experience and lacks support of a specific theoretical model, thus having a certain impact on the accuracy and reliability of the ranking of the search results.
Therefore, the algorithm is improved in a correlation mode by starting from the position characteristics of the keywords and the probability correlation between the keywords and the documents.
For the problem of reduced searching accuracy caused by lack of consideration of position information of entries in the documents, on one hand, different index domains of the documents have different importance, if the commodity name has higher searching priority and the special product type priority is too high, the distinguishing degree of the documents is reduced, and the algorithm does not consider the influence of different index domains of the keywords on the grading; more importantly, the semantic information of the obtained new keyword can be changed after the document is subjected to word-stop operation of the system word segmenter, so that scores other than the expected scores are obtained, and the influence of the word-stop operation on the document scores is not considered by the algorithm.
For the former, Lucene provides different weight designs for each index domain, and weights are set for the index domains according to the priorities of 'commodity name', 'brand', 'keyword' and 'special product type', so that the scores have higher discrimination for different documents; regarding the latter, considering introducing an influence factor LocScore for the similarity model to reflect the influence of the stop word operation on the position relationship of the vocabulary entry, the calculation method of LocScore is shown as formula 5:
in the formula, a document containing the keywords and not subjected to word stop filtering is given a higher score of 1, and a document containing the keywords and subjected to word stop filtering is given a lower score of 0.7, so that the similarity relation of the terms and the document in position is reflected.
And for the problem of insufficient reliability caused by lack of consideration of the correlation between the entries and the documents, the idea of introducing a probability retrieval model is tried to solve. The main ideas of the probability retrieval model are as follows: the document set is divided into a relevant document set and a non-relevant document set, and then the measurement of the relevance is converted into a text classification problem, so that the retrieval of the system follows a probability ranking principle, and a ranking result is returned according to the relevance between a user retrieval condition and the document set.
The influence factor SimScore based on a naive Bayes classification algorithm is introduced into the similarity model to reflect the influence of the probability correlation on the similarity score, wherein the Bayes formula is shown as formula 6. The documents D matched for each query term Q in the model can be divided into two groups, a relevant document set R and an irrelevant document set NR, so that P (R | D) is the conditional probability that the document D belongs to the relevant document set R, P (NR | D) is the conditional probability that the document D belongs to the irrelevant document set NR, and when P (R | D) > P (NR | D), the query term Q is relevant to the document D. According to the Bayesian formula, the following can be obtained:
when in useWhen the query term Q is related to the document D, anThe larger the value of (D), the higher the relevance of the document D.
Defining document D as a set of binary vectors D ═ D1,d2,…,dn) Wherein d isi1 indicates that the keyword has appeared in the document, di0 denotes that the keyword does not appear in the document under the assumption of conditional independence between the attributes on which the bayesian classification is based, and the calculation method for defining the influence factor SimScore is shown in equation 7:
in the formula piIs the probability, s, of occurrence in a document in the set R of documents related to the keyword iiIs the probability that it appears in a certain document in the set of irrelevant documents NR.
The improved Lucene similarity model is shown as an 8-type model:
NewScore(q,d)=α×Score(q,d)+β×LocScore(q,d)+γ×SimScore(q,d) (8)
in the formula, α + β + γ is 1, the score of the original scoring formula in the model should be dominant, α is 0.8, five different sets of coefficients are selected in data verification to test five special product data in table 6, and the average F1 value is compared to reflect the search effect, and the experiment proves that the search result is more in line with the expectation when β is 0.08 and γ is 0.12.
For the performance evaluation of the retrieval system, indexes such as Precision (Precision), Recall (Recall), F1 value, MAP value, NDCG and the like are generally adopted in the industry for evaluation. In the system, the Lucene system using the improved TF-IDF algorithm is evaluated by adopting the accuracy, the recall rate and the F1 value.
The various types of data relevant to the assessment are classified according to table 5:
TABLE 5 evaluation data classification Table
The evaluation parameters can be calculated as follows:
(1) the accuracy rate P is the expected number of records in the returned result a/the total number of records in the returned result (a + B)
(2) Recall ratio R ═ number of commodity records in return result that meet expectations a/number of correct commodity records that should be returned (a + C)
(3) F1 value 2 recall ratio R/(accuracy P + recall ratio R)
In the data verification, 150 records are added to each type of special product commodity for the database, a special product name such as 'Yuhua tea' is input at a client, and then all search results are exported in the background. The results of the experiments are shown in the following table:
TABLE 6 Pre-improvement Performance of the Algorithm
TABLE 7 algorithm Performance after improvement
To analyze the search results more intuitively, comparative line graphs for the three evaluation indexes are shown in fig. 5 to 7. It can be found that after the influence factors are added, the accuracy and the F1 value of the algorithm for searching the special product commodities are increased, but the influence on the recall rate is small, which is probably caused by the fact that the names of most special product commodities are more standard and can be accurately searched, and the increase of the accuracy rate indicates that the algorithm does not filter some inaccurate search results caused by word segmentation influence, so that the performance of the algorithm is improved. In addition, the lower recall rate of the 'Yuhua stone' can be caused by the fact that the Yuhua stone commodities are more in types and the commodity names are not normative enough.
The invention designs a full-text retrieval system based on an improved IF-IDF algorithm, and the system sets different weights for an index domain; introducing an influence factor LocScore to reflect the influence of word stopping operation on the position relation of the vocabulary entries; and introducing an influence factor SimScore based on a naive Bayes classification algorithm to reflect the influence of the probability correlation on the similarity score so as to improve the retrieval result. All parts of the system are in modular design, and some functional modules can be added properly according to specific requirements. The system well overcomes the defects of low accuracy, inaccurate scoring and low speed in the existing retrieval system, and has good market prospect.

Claims (6)

1. A full-text retrieval system based on an improved TF-IDF algorithm is characterized by comprising an index domain module, a word segmentation module, an indexer module and a retriever module; wherein:
the index domain module is set according to the service requirement, and the retrieval content is carried out according to the correct domain name during indexing and retrieval;
the word segmentation device module is used for segmenting words according to the word bank;
the indexer module is used for setting index domain weight for the service data source, creating an index and determining a storage mode of an index document;
the retriever module is used for configuring a retriever, analyzing retrieval conditions, generating a syntax tree, retrieving and sorting, packaging a sorting result and returning the sorting result to the client through an interface for displaying; and the similarity score adopts a modified TF-IDF algorithm to set weight for the priority of the index domain.
2. The improved IF-IDF algorithm based full-text retrieval system of claim 1, wherein said index field module comprises a commodity search and a merchant search for retrieving contents in a commodity main table and a merchant main table, respectively.
3. The full-text retrieval system based on the improved IF-IDF algorithm as claimed in claim 2, wherein the index structure of the index domain module is divided into four parts of index segment, index document, index domain and index entry, wherein the index entry, index document and index segment are automatically generated by the system when the index is created, and the index domain is composed of domain name and content item being indexed.
4. The full-text retrieval system based on the improved IF-IDF algorithm as claimed in claim 1, wherein the tokenizer module is an IKAnalyzer tokenizer, which is a Java-based Chinese tokenization package, and the segmentation is performed by using a forward iterative finest-granularity segmentation algorithm in combination with a thesaurus.
5. The full-text retrieval system based on the improved IF-IDF algorithm as claimed in claim 1, wherein the retriever module analyzes the query conditions sent from the client via the interface by using the word segmenter after receiving the query conditions, and constructs a Boolean query syntax tree according to the rules of the keywords obtained by word segmentation; and according to the syntax tree, the retriever performs matching to the index file and obtains a result set, data in the set is ranked according to the score after being scored by an improved TF-IDF algorithm, and the ranking result is packaged and then returned to the client through an interface for displaying.
6. The improved IF-IDF algorithm based full text retrieval system of claim 5, wherein: the improved TF-IDF algorithm is based on the TF-IDF algorithm, different weights are set for the index domain, and the principle of the TF-IDF algorithm is shown as the following formula:
tfidfi,j=tfi,j×idfi (3)
in the formula ni,jAs an entry tiIn document djFrequency of occurrence of (1), nk,jIs tiThe sum of the frequency of occurrence in all documents, D is the total number of documents in the index repository, { j: t }i∈djIs the index bank containing tiThe number of documents of (a); the term frequency tf refers to the frequency of occurrence of terms in a certain document of the index library; the inverse text frequency idf is the proportion of the number of documents containing entries to the number of all documents;
the term t can be obtained by multiplying the word frequency tf and the inverse text frequency idfiAnd document dj(ii) similarity score of (d);
in Lucene, a similarity model is based on a TF-IDF algorithm, and the principle is shown as the formula (4):
in the formula, q is a matching condition of the input keywords provided by the user, d is a document where a matching result is located, t is an entry obtained by parsing after word segmentation of a word segmentation component, and td is TermtThe occurrence frequency of the terms in the document d is idf, the inverse text frequency of the terms is cN, the scoring factor is determined based on the number of query terms appearing in the document, qN is the sum of variances of the query terms, qB is the weight of the keyword, tB is the weight of the terms, norm is a normalization factor, dB is the document weight, and tf is the occurrence frequency of the terms in the document;
introducing an influence factor LocScore into the similarity model to reflect the influence of stop word operation on the position relation of the vocabulary entries, wherein the calculation method of the LocScore comprises the following steps:
the document containing the keywords and not subjected to word stop filtering is given a higher score of 1, and the document containing the keywords and subjected to word stop filtering is given a lower score of 0.7, so that the similarity relation of the terms and the document in position is reflected;
the similarity model introduces an influence factor SimScore based on a naive Bayes classification algorithm to reflect the influence of the probability correlation on the similarity score, wherein a Bayes formula is shown as a formula (6); the document D matched with each query item Q in the model can be divided into two groups of a relevant document set R and an irrelevant document set NR, so that P (R | D) is the conditional probability that the document D belongs to the relevant document set R, P (NR | D) is the conditional probability that the document D belongs to the irrelevant document set NR, and when P (R | D) > P (NR | D), the query item Q is relevant to the document D; according to the Bayesian formula, the following can be obtained:
when in useWhen the query term Q is related to the document D, anThe larger the value of (D), the higher the relevance of the document D;
defining document D as a set of binary vectors D ═ D1,d2,…,dn) Wherein d isi1 indicates that the keyword has appeared in the document, di0 denotes that the keyword does not appear in the document under the assumption of conditional independence between the attributes on which the bayesian classification is based, and the calculation method for defining the impact factor SimScore is as shown in equation (7):
in the formula piIs the probability, s, of occurrence in a document in the set R of documents related to the keyword iiThe probability that it appears in a document in the set of irrelevant documents NR;
the improved Lucene similarity model is as follows:
NewScore(q,d)=α×Score(q,d)+β×LocScore(q,d)+γ×SimScore(q,d) (8)
wherein α + β + γ is 1.
CN201910787265.3A 2019-08-25 2019-08-25 Full text retrieval system based on improved TF-IDF algorithm Active CN110619036B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910787265.3A CN110619036B (en) 2019-08-25 2019-08-25 Full text retrieval system based on improved TF-IDF algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910787265.3A CN110619036B (en) 2019-08-25 2019-08-25 Full text retrieval system based on improved TF-IDF algorithm

Publications (2)

Publication Number Publication Date
CN110619036A true CN110619036A (en) 2019-12-27
CN110619036B CN110619036B (en) 2023-07-18

Family

ID=68922470

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910787265.3A Active CN110619036B (en) 2019-08-25 2019-08-25 Full text retrieval system based on improved TF-IDF algorithm

Country Status (1)

Country Link
CN (1) CN110619036B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111599463A (en) * 2020-05-09 2020-08-28 吾征智能技术(北京)有限公司 Intelligent auxiliary diagnosis system based on sound cognition model
CN112183087A (en) * 2020-09-27 2021-01-05 武汉华工安鼎信息技术有限责任公司 System and method for sensitive text recognition
CN112883143A (en) * 2021-02-25 2021-06-01 华侨大学 Elasticissearch-based digital exhibition searching method and system
CN113343046A (en) * 2021-05-20 2021-09-03 成都美尔贝科技股份有限公司 Intelligent search sequencing system
CN113468393A (en) * 2021-06-09 2021-10-01 北京达佳互联信息技术有限公司 Index generation method and device, electronic equipment and storage medium
CN113486156A (en) * 2021-07-30 2021-10-08 北京鼎普科技股份有限公司 ES-based associated document retrieval method
CN115455147A (en) * 2022-09-09 2022-12-09 浪潮卓数大数据产业发展有限公司 Full-text retrieval method and system
CN117290460A (en) * 2023-11-24 2023-12-26 中孚信息股份有限公司 Method, system, device and storage medium for calculating similarity of massive texts

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778276A (en) * 2015-04-29 2015-07-15 北京航空航天大学 Multi-index combining and sequencing algorithm based on improved TF-IDF (term frequency-inverse document frequency)
CN106055546A (en) * 2015-10-08 2016-10-26 北京慧存数据科技有限公司 Optical disk library full-text retrieval system based on Lucene

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778276A (en) * 2015-04-29 2015-07-15 北京航空航天大学 Multi-index combining and sequencing algorithm based on improved TF-IDF (term frequency-inverse document frequency)
CN106055546A (en) * 2015-10-08 2016-10-26 北京慧存数据科技有限公司 Optical disk library full-text retrieval system based on Lucene

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111599463A (en) * 2020-05-09 2020-08-28 吾征智能技术(北京)有限公司 Intelligent auxiliary diagnosis system based on sound cognition model
CN112183087A (en) * 2020-09-27 2021-01-05 武汉华工安鼎信息技术有限责任公司 System and method for sensitive text recognition
CN112883143A (en) * 2021-02-25 2021-06-01 华侨大学 Elasticissearch-based digital exhibition searching method and system
CN113343046A (en) * 2021-05-20 2021-09-03 成都美尔贝科技股份有限公司 Intelligent search sequencing system
CN113343046B (en) * 2021-05-20 2023-08-25 成都美尔贝科技股份有限公司 Intelligent search ordering system
CN113468393A (en) * 2021-06-09 2021-10-01 北京达佳互联信息技术有限公司 Index generation method and device, electronic equipment and storage medium
CN113486156A (en) * 2021-07-30 2021-10-08 北京鼎普科技股份有限公司 ES-based associated document retrieval method
CN115455147A (en) * 2022-09-09 2022-12-09 浪潮卓数大数据产业发展有限公司 Full-text retrieval method and system
CN117290460A (en) * 2023-11-24 2023-12-26 中孚信息股份有限公司 Method, system, device and storage medium for calculating similarity of massive texts

Also Published As

Publication number Publication date
CN110619036B (en) 2023-07-18

Similar Documents

Publication Publication Date Title
CN110619036B (en) Full text retrieval system based on improved TF-IDF algorithm
US11741173B2 (en) Related notes and multi-layer search in personal and shared content
US8468156B2 (en) Determining a geographic location relevant to a web page
US9280535B2 (en) Natural language querying with cascaded conditional random fields
US8156102B2 (en) Inferring search category synonyms
Demidova et al. DivQ: diversification for keyword search over structured databases
US7765178B1 (en) Search ranking estimation
CN107729336B (en) Data processing method, device and system
KR101700585B1 (en) On-line product search method and system
US7409383B1 (en) Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems
US20090327223A1 (en) Query-driven web portals
US9734192B2 (en) Producing sentiment-aware results from a search query
US6286000B1 (en) Light weight document matcher
WO2021057250A1 (en) Commodity search query strategy generation method and apparatus
US20070226202A1 (en) Generating keywords
JPH1125108A (en) Automatic extraction device for relative keyword, document retrieving device and document retrieving system using these devices
US20100042610A1 (en) Rank documents based on popularity of key metadata
CN113486156A (en) ES-based associated document retrieval method
JP2001184358A (en) Device and method for retrieving information with category factor and program recording medium therefor
JP2004310561A (en) Information retrieval method, information retrieval system and retrieval server
CN107423298B (en) Searching method and device
JPH11154160A (en) Data retrieval system
Gupta et al. A survey of existing question answering techniques for Indian languages
JP2715981B2 (en) Search result evaluation device
Kim et al. Improving keyword match for semantic search

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant