CN110619036B - Full text retrieval system based on improved TF-IDF algorithm - Google Patents

Full text retrieval system based on improved TF-IDF algorithm Download PDF

Info

Publication number
CN110619036B
CN110619036B CN201910787265.3A CN201910787265A CN110619036B CN 110619036 B CN110619036 B CN 110619036B CN 201910787265 A CN201910787265 A CN 201910787265A CN 110619036 B CN110619036 B CN 110619036B
Authority
CN
China
Prior art keywords
document
index
documents
word segmentation
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910787265.3A
Other languages
Chinese (zh)
Other versions
CN110619036A (en
Inventor
俞佳慧
何新
马轩
姜楠
王子龙
黄炎焱
项凯南
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN201910787265.3A priority Critical patent/CN110619036B/en
Publication of CN110619036A publication Critical patent/CN110619036A/en
Application granted granted Critical
Publication of CN110619036B publication Critical patent/CN110619036B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0623Item investigation
    • G06Q30/0625Directed, with specific intent or strategy
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a full text retrieval system based on an improved IF-IDF algorithm, which consists of an index domain module, a word segmentation module, an indexer module and a retriever module. The invention adopts two types of commodity searching and merchant searching, and the searching modes are various; an IKAnalyzer word segmentation device is adopted, so that the word segmentation speed is high and the comprehensive performance is good; the indexer capable of setting weight according to service requirements is adopted, so that the method is more humanized; the improved TF-IDF algorithm is adopted for searching, so that the algorithm performance is good, and the accuracy is high.

Description

Full text retrieval system based on improved TF-IDF algorithm
Technical Field
The invention relates to the technical field of full text retrieval, in particular to a full text retrieval system based on an improved TF-IDF algorithm.
Background
In the face of unstructured data in various formats, particularly in the case of large data volumes, conventional methods require a lot of time, and full-text retrieval techniques have been developed. The realization of the full text retrieval technology is based on the full text retrieval theory, and for unstructured data with various formats, the unstructured data are reorganized by a certain structure of a data source and then searched, so that the effect of improving the searching speed can be achieved.
Lucene is a set of open source full text search engine toolkit provided and supported by the Apache software foundation 4jakarta project group, which is not a complete full text search engine, but rather a full text search engine architecture, providing complete query and index engines, and partial text analysis engines.
Scoring the search results meeting the conditions by adopting a similarity model is a key ring of the full-text search system, scoring the similarity degree of each result corresponding to the query conditions according to the corresponding similarity model and preset constraint, and sequencing the results according to the score so as to preferentially return the results which are more expected to be seen by the user.
Lucene uses a vector space model (Vector Space Model, VSM). The vector space model simplifies the processing of the data similarity into vector operation in a vector space, and expresses the similarity of two pieces of data in each attribute dimension by the spatial similarity, so that the method is visual and easy to understand. The model generally includes the following main attributes:
term: vector space model typically constructs a data document into N index terms by word segmentation, each index Term can be represented by Term, for E-commerce system, term can be used to represent all index Term of each data after word segmentation of the indexed field 1 、Term 2 、...Term n Can be abbreviated as T i
Document: documents may represent a Document, so that each piece of data may be segmented by documents= { Term 1 ,Term 2 ,...,Term n Represented as space vectors, abbreviated as D i
Query: the Query condition representing the user can be obtained by query= { Term after word segmentation a ,Term b ,Term c Denoted spatial vector, abbreviated as Q.
W: each Term has a corresponding weight W, so each space vector has its corresponding weight vector
Cos(Q,D i ): the cosine distance between two spatial vectors can be used in a vector space model to represent the degree of similarity of the two spatial vectors. A set { Cos (Q, D) can be obtained by calculating the cosine values of the spatial vectors 1 ),Cos(Q,D 2 ),...,Cos(Q,D n ) Document and query condition corresponding to which cosine value is bigger in the collectionThe higher the correlation, the more similar the legal document is found, which is more consistent with the expected result of the user.
Within Lucene, the similarity model is based on the existing TF-IDF algorithm, and adds additional influencing factors, so that most scenes can be applied, but some problems still exist:
1) For the problem of search accuracy reduction caused by lack of consideration of positional information of terms in a document, on one hand, importance of different index domains of the document is different, if commodity names should have higher search priority and special product types are too high, the degree of distinction of the document is reduced, and the influence of the difference of index domains of keywords on scores is not considered by an algorithm; on the other hand, more importantly, the semantic information of the obtained new keywords may be changed after the document is subjected to word stopping operation by the word system segmenter, so that scores beyond expectations are obtained, and the influence of the word stopping operation on the document scores is not considered by the algorithm.
2) For reliability deficiency caused by lack of consideration of correlation between the entry and the document. The keywords are always interrelated, such as the appearance of "pastries" and "pastries" are often interrelated. Since the similarity formula in Lucene is based on the cosine similarity formula in the vector space model, it assumes an implicit assumption about the relevance of the query term and the document, i.e., default relevance is associated with the similarity between the two. This assumption is based on experience only and lacks support for a specific theoretical model, thus affecting the accuracy and reliability of the ranking of search results.
Disclosure of Invention
The invention aims to provide a full-text retrieval system based on an improved TF-IDF algorithm, which solves the problem of influence of different index domains of keywords on scoring, solves the problem of influence of word stopping operation on scoring of documents, and solves the problem of insufficient reliability caused by lack of consideration of correlation between terms and documents.
The technical scheme for realizing the purpose of the invention is as follows: the full text retrieval system based on the improved TF-IDF algorithm comprises an index domain module, a word segmentation module, an indexer module and a retriever module, wherein:
the index domain module is set according to the service requirement, and the content is searched according to the correct domain name during indexing and searching;
the word segmentation device module is used for segmenting words of the search conditions according to the word stock;
the indexer module is used for setting index domain weights for the service data sources, creating indexes and determining the storage mode of the index documents;
the retriever module is used for configuring a retriever, analyzing the retrieval conditions, generating a grammar tree, retrieving and sequencing, packaging the sequencing result and returning the sequencing result to the client through an interface for display; the similarity score uses an improved TF-IDF algorithm to weight the priority of the index domain.
Compared with the prior art, the invention has the remarkable advantages that: (1) The improved TF-IDF algorithm is adopted, and weights are set for the index domains according to priorities of commodity names, brands, keywords and the like aiming at the influence of different indexes domains where the keywords are located, so that the scores have higher degree of distinction on different documents; (2) Aiming at the influence of word stopping operation on the scoring of the document, introducing an influence factor LocScore for the similarity model to reflect the influence of the word stopping operation on the positional relationship of the entry; (3) Aiming at the problem of insufficient reliability caused by lack of consideration of correlation between terms and documents, attempting to introduce ideas of a concept retrieval model to solve, and introducing an influence factor SimScare based on a naive Bayesian classification algorithm for a similarity model to reflect the influence of probability correlation on similarity scoring.
Drawings
Fig. 1 is a schematic diagram of a full text retrieval system based on an improved TF-IDF algorithm of the present invention.
Fig. 2 is a schematic diagram of a segmenter module of the present invention.
FIG. 3 is a schematic diagram of an indexer module indexing object interactions process of the present invention.
FIG. 4 is a schematic diagram of a Boolean query syntax tree of the retriever module of the present invention.
FIG. 5 is a graph of accuracy versus time in an example.
FIG. 6 is a graph comparing recall rates in an example.
FIG. 7 is a graph showing comparison of F1 values in examples.
Detailed Description
As shown in fig. 1, a full text retrieval system based on an improved TF-IDF algorithm includes an index domain module, a word segmentation module, an indexer module, and a retriever module; wherein:
the index domain module is set according to the service requirement, so that the content can be searched according to the correct domain name during indexing and searching;
the word segmentation device module adopts an IKAnalyzer word segmentation device to segment the search condition according to the word stock;
the indexer module sets index domain weights for the service data sources, creates indexes and determines the storage mode of the index documents;
the retriever module is mainly responsible for configuring the retriever, analyzing the retrieval conditions, generating a grammar tree, retrieving and sequencing, packaging the sequencing result and returning the sequencing result to the client for display through an interface.
In the full-text retrieval system, an improved TF-IDF algorithm is adopted in a similarity scoring algorithm, weights are set for priorities of index domains, and an influence factor LocScore is introduced to reflect influence of word stopping operation on term position relations.
The index domain module comprises commodity searching and merchant searching, wherein the commodity searching mainly searches contents in a commodity main table, and the merchant searching mainly searches contents in a merchant main table.
The word segmentation device module selects an IKAnalyzer word segmentation device, is a Chinese word segmentation package based on Java, adopts a forward iteration finest granularity segmentation algorithm, and performs word segmentation by combining a word stock.
The indexer module is mainly responsible for setting higher weights for commodity names and merchant name fields and storing index files after word segmentation.
The retriever module mainly analyzes the query condition sent by the client through the interface by using a word segmentation device after receiving the query condition, and then constructs a Boolean query grammar tree according to a certain rule by using keywords obtained by word segmentation. According to the grammar tree, the retriever matches the index file and obtains a result set, the data in the set is scored by the improved TF-IDF algorithm and then is ranked according to the score, and the ranked result is packaged and then returned to the client for display through an interface.
Wherein, for the improved TF-IDF algorithm, based on the TF-IDF algorithm, different weights are set for the index domain; introducing an influence factor LocScore to reflect the influence of word stopping operation on the positional relationship of the entry; an influence factor SimScore based on a naive bayes classification algorithm was introduced to reflect the influence of probability correlation on similarity scores. The improved TF-IDF algorithm aims at solving the problems of insufficient reliability caused by consideration of correlation between the entry and the document and the influence of stop word operation on the scoring of the document aiming at the influence of different index domains of key words which are not considered in the original algorithm.
The improved TF-IDF algorithm is based on the TF-IDF algorithm, different weights are set for the index domain, and the principle of the TF-IDF algorithm is shown as follows:
tfidf i,j =tf i,j ×idf i (3)
in n i,j For the entry t i In document d j Frequency of occurrence of n k,j At t i The sum of the frequency of occurrence in all documents, D is the total number of documents in the index base, { j: t i ∈d j The index library contains t i Is a number of documents;
the term frequency tf refers to the occurrence frequency of terms in a certain document in the index base; the inverse text frequency idf refers to the proportion of the number of documents containing the term to all the number of documents;
the term t can be obtained by multiplying the term frequency tf by the inverse text frequency idf i And document d j Is a similarity score of (2).
Within Lucene, the similarity model is based on TF-IDF algorithm, and its principle is shown in formula (4):
wherein q is the matching condition of the input key words provided by the user, d is the document where the matching result is located, t is the entry which is analyzed after the word segmentation by the word segmentation component, and td is Term t The frequency of occurrence in the document d, idf is the inverse text frequency of the term, cN is a scoring factor, determined based on the number of query terms occurring in the document, qN is the sum of variances of each query term, qB is the weight of the keyword, tB is the weight of the term, norm is a normalization factor, dB is the document weight, and tf is the frequency of occurrence of the term in the document.
Introducing an influence factor LocScore for the similarity model to reflect the influence of word stopping operation on the positional relationship of the entry, wherein the calculation method of the LocScore comprises the following steps:
wherein, a higher score 1 is given to the documents which contain the keywords and are not subjected to stop word filtering, and a lower score 0.7 is given to the documents which contain the keywords and are subjected to stop word filtering, so that the similarity relation between the terms and the documents in position is reflected;
the similarity model introduces an influence factor SimScare based on a naive Bayesian classification algorithm to reflect the influence of probability correlation on similarity scores, wherein a Bayesian formula is shown in a formula (6); the documents D matched with each query term Q in the model can be divided into two groups of related document sets R and non-related document sets NR, so that P (R|D) is the conditional probability that the document D belongs to the related document sets R, P (NR|D) is the conditional probability that the document D belongs to the non-related document sets NR, and when P (R|D) > P (NR|D), the query term Q is related to the document D; from the Bayesian formula, it can be derived that:
when (when)At this time, query term Q is related to document D, and +.>The larger the value of (2), the higher the relevance of document D;
define document D as a set of binary vectors d= (D) 1 ,d 2 ,...,d n ) Wherein d is i =1 indicates that the keyword appears in the document, d i =0 indicates that no condition independence assumption between attributes based on bayesian classification occurs in the document for the keyword, and the calculation method for defining the influence factor SimScore is as shown in formula (7):
in p i For the probability of occurrence in a document in the set R of documents related to the keyword i, s i Then the probability that it appears in a document in the set of irrelevant documents NR;
the improved Lucene similarity model is as follows:
NewScore(q,d)=α×Score(q,d)+β×LocScore(q,d)+γ×SimScore(q,d) (8)
where α+β+γ=1.
The invention carries out full-text retrieval by establishing the inverted index table for the database in advance, and has high retrieval efficiency and complete retrieval effect. And the similarity model based on the improved TF-IDF algorithm is adopted to score the result document, so that the algorithm performance is improved, and the accuracy is high.
The present invention will be described in detail with reference to examples.
Examples
The full text retrieval system based on the improved TF-IDF algorithm comprises four parts, including an index domain module, a word segmentation module, an indexer module and a retriever module.
The index domain module is the basis of a full-text retrieval system, and an index structure in Lucene is divided into four parts of an index segment, an index document, an index domain and an index item, wherein the index item, the index document and the index segment are automatically generated by the system when an index is created, and the index domain is composed of a domain name and an indexed content item and is usually set by a developer according to service requirements, so that the index and the index can be positioned according to a correct domain name when the index and the index are carried out.
The search function of the system is divided into two types of commodity search and merchant search, and index fields of the two types of data are designed as shown in the following tables 1 and 2:
table 1 commodity search index field
The commodity searching function mainly searches the commodity main list table for content, and the searching condition of the user on the commodity often comprises information such as 'brand', 'trade name', 'commodity feature', and the like. Therefore, the important data in the selection list is indexed, wherein the index is established after the trade name item_name and the commodity characteristic keyword are segmented, and the index is directly established without the segmentation of the commodity number item_id, the price and the commodity belonging to the commodity classification class_name. The storage strategy of storing in the index file is adopted for most domains, and the commodity specification item_params and the commodity detail description are not stored due to the fact that the content of the commodity specification item_params and the commodity detail description are more, and the commodity specification item_params and the commodity detail description are obtained through the commodity ID to the database after searching. The index items which are not segmented are accurately matched when the system searches.
Table 2 merchant search index field
The merchant searching function mainly searches the content in the merchant main table shop table, and the search condition of the user on the commodity often comprises information such as 'store name', 'position', 'main commodity', and the like. Therefore, the important data in the selection area table is indexed, wherein the index is built after the store name shop_id, the affiliated business circle location and the major commodity category major are segmented, and directly establishing an index for the merchant number shop_id and the per-price non-word segmentation of the people consumption, and not storing merchant details note.
As shown in FIG. 2, the IKAnalyzer is a Java-based Chinese word segmentation kit, which adopts a forward iteration finest granularity segmentation algorithm and performs word segmentation by combining a word stock, and has high-speed processing capacity of 80 ten thousand words/second. The forward iterative finest granularity algorithm in the IKAnalyzer is a word segmentation algorithm adopting a piece-by-piece matching mode, the length of a word to be matched is not required to be known in advance, and text is segmented according to a layer-by-layer iterative search mode from a maximum word to a minimum word with the same prefix in a dictionary in operation until the word cannot be continuously segmented, so that the word segmentation effect of the finest granularity is realized. It only needs to scan the input text word to obtain all possible word segmentation results.
Through comparing the IKAnalyzer with StandardAnalyzer, mmseg j and the image to the plain text corpus on the "corpus on line", the word segmentation speed and the word segmentation accuracy of each word segmentation device are respectively shown in tables 3 and 4:
TABLE 3 word segmentation speed contrast
Table 4 word segmentation accuracy contrast
As can be seen from the table, the word segmentation speed of each word segmentation device gradually increases with the increase of the text length. The standard Analyzer has a good effect on English but obtains an ideal word segmentation result on Chinese because of the unitary word segmentation method. The simple mode word segmentation of mmseg4j is fast, but the accuracy is general; the accuracy rate is increased in the complex mode, but the word segmentation speed is reduced; but the word segmentation accuracy of the im is higher, but the word segmentation speed is slower due to the complex algorithm. The IKAnalyzer word segmentation accuracy is between mmseg4j and the im, the word segmentation speed is high, the comprehensive performance is good, and the IKAnalyzer is suitable for being used as a word segmentation device of the system.
The dictionary in the IKAnalyzer mainly comprises a main dictionary, a graduated word dictionary, a stop word dictionary and an expanded word dictionary, in addition, a user can expand own proprietary dictionary by configuring an IKAnalyzer. Cfg. Xml file, and the system automatically filters words in the stop word dictionary and the expanded stop word dictionary in the document so that keyword word segmentation has better granularity and simultaneously better recognizes some proprietary nouns by means of the expanded dictionary. In the system, the 'brand' word is added for the extended stop word dictionary ext_stop word. Dic to filter the possible influence of searching brand goods, and each special product category name is added for the extended word dictionary ext. Dic, so that proper nouns of the special products can be better segmented.
As shown in fig. 3, the indexer module is mainly divided into three steps. In the first step, the data source is converted, and the format of the service data source may be varied, so that in order to facilitate the identification of Lucene, the data source needs to be converted into a plain text character stream. Raw data in the system is stored in different fields of a database, and Field objects are constructed after contents are sequentially fetched from the different fields according to an index Field, a storage strategy and an index strategy determined during index Field design. The file is then added to the Document object Document and weights are set for the file and Document according to the business requirements. The specific data record of each piece corresponds to a Document object, sets higher weight for the commodity name and merchant name field, and sets weight for different records according to the commodity and merchant's recommendable_weight. Secondly, analyzing by a word segmentation device, configuring an IKAnalyzer as a word segmentation device of an index writer in the mode described in the previous chapter for the created and configured Document object, and sending the Document object into the index writer for word segmentation by an addDocument () method. Thirdly, storing an index file, and automatically storing the generated index into an index file MyIndxer under the dir directory after the IndexWriter finishes word segmentation.
As shown in fig. 4, the main workflow of the retriever module is also divided into three steps. The first step, a retriever is configured, an index searcher and a query generator are required to be generated according to different retrieval requirements, and a path, a query field and a word segmentation device are specified for the index searcher and the query generator, for example, a commodity is searched for the fields of query ItemName, brand and Keyword, and fields such as price, grading and the like are queried during screening. And secondly, analyzing the input conditions and generating a grammar tree, after the searcher is configured, analyzing the search conditions by using the word segmentation device to obtain a query object, and generating a Boolean query grammar tree for the query object according to service requirements, wherein the or logic is adopted for commodity names, brands and keywords in commodity searching, and the AND logic is adopted for screening conditions such as prices. Thirdly, searching and sequencing, namely matching index files by a searcher through an IndexSearcher according to Query objects constructed according to a Boolean Query tree, calculating scores of matching results according to a TD-IDF algorithm in Lucene, and returning the top n result sets TopDOCs by a Search () method of the IndexSearcher. The result set is then packaged by the interface into a list and sent back to the client.
The similarity scoring algorithm is an important part of the full-text retrieval system, and scores the result set obtained by retrieval to return the result most in line with the user's expectations. Scoring the result document by adopting a similarity model based on a TF-IDF algorithm in Lucene, wherein the TF-IDF algorithm is based on a vector space model, and calculating the similarity of the keywords by comprehensively considering the word frequency of the keywords and the resolution thereof in different documents. The principle of the TF-IDF algorithm is shown as follows:
tfidf i,j =tf i,j ×idf i (3)
in n i,j For the entry t i In document d j Frequency of occurrence of n k,j At t i The sum of the frequency of occurrence in all documents, D is the total number of documents in the index base, { j: t i ∈d j The index library contains t i Is a number of documents. It can thus be seen that the TF-IDF algorithm is based mainly on two points:
(1) Word frequency tf
The frequency of occurrence of entry in a document in the index library. the higher tf, the higher the relevance of the term to the bar document.
(2) Inverse text frequency idf
The number of documents containing the term accounts for the specific weight of all the numbers of documents. The greater the idf, the lower the specific gravity, indicating a higher degree of distinction of the term in the document. For example, in a dataset containing 1000 records, 500 records contain the term a, and 100 records contain the term B, then the term a has a better differentiation than B in the dataset. The term t can be obtained by multiplying the term frequency tf by the inverse text frequency idf i And document d j Is a similarity score of (2).
Within Lucene, the similarity model is based on TF-IDF algorithm, and its principle is shown in formula (4):
q-user provided matching condition (query) of the input key words;
d-document (document) where the matching result is located;
t-an entry (term) obtained by analysis after word segmentation by the word segmentation component;
td——Term t frequency of occurrence in document d;
idf-the inverse text frequency of the entry;
cn—a scoring factor determined based on the number of query terms present in the document;
qN, the sum of variances of all query entries, qB is the weight of the key word, and tB is the weight of the entry;
norm-normalization factor, dB is document weight, tf is the frequency of occurrence of entry in the document.
The similarity model in Lucene adds an additional influence factor on the basis of a TF-IDF algorithm, wherein a scoring factor cN (q, d) is based on the occurrence number of query terms subjected to word segmentation in the documents under the search condition, so that the documents containing more query terms can obtain higher weight; the variance and qN (q) of the query term, while not affecting the ranking result, allow the user to set so that the score value as a whole becomes larger or smaller; the matching weight tB allows the user to set higher weights for certain fields so that they can be preferentially matched; the length weighting factor norm (t, d) then allows the user to set up to get a higher score for longer or shorter search results.
It can be seen that the similarity model in Lucene has the following characteristics:
(1) The higher the frequency of occurrence of keywords in hit documents, the higher the score of the document.
(2) The higher the frequency of occurrence of keywords in other documents than the hit document, the lower the score of the hit document.
(3) The location where the keywords appear in the hit document does not affect the scoring of the document.
In most application scenes, lucene has good effect on scoring of a result set, but the similarity degree of keywords and documents is not only related to the occurrence frequency of the keywords and the occurrence position characteristics of the keywords in the documents, especially when stop word filtering causes entry splicing and the like. For example, when a user searches for "Yuhua tea", it is apparent that the former should be scored higher for both the commodity records of "Zhong Shanpai Yuhua tea" and "Yuhua brand tea cake". However, due to the filtering of the deactivated word "cards", the keyword "rain flower tea" has the same word frequency in both documents "Zhong Shanyu flower tea" and "rain flower tea cake", which may even obtain a higher score due to the different weights assigned to the documents. This may result in the search returning results that may contain some undesirable specialty categories during the operation of the specialty store commodity platform, thereby reducing the accuracy of the search.
On the other hand, keywords are always interrelated, such as the occurrence of "pastries" and "pastries" are often interrelated. Since the similarity formula in Lucene is based on the cosine similarity formula in the vector space model, it assumes an implicit assumption about the relevance of the query term and the document, i.e., default relevance is associated with the similarity between the two. This assumption is based on experience only and lacks support for a specific theoretical model, thus affecting the accuracy and reliability of the ranking of search results.
Therefore, the algorithm is improved in a correlation way by considering the position characteristics of the keywords and the probability correlation between the keywords and the documents.
For the problem of search accuracy reduction caused by lack of consideration of positional information of terms in a document, on one hand, importance of different index domains of the document is different, if commodity names should have higher search priority and special product types are too high, the degree of distinction of the document is reduced, and the influence of the difference of index domains of keywords on scores is not considered by an algorithm; more importantly, the semantic information of the obtained new keywords may be changed after the document is subjected to word stopping operation by the word stopping system, so that scores beyond expectations are obtained, and the influence of the word stopping operation on the document scores is not considered by the algorithm.
For the former, different weight designs for each index domain are provided in Lucene, and the index domains are weighted according to the priorities of commodity names, brands, keywords and special product types, so that the scores have higher differentiation degree for different documents; for the latter, consider that an influence factor LocScore is introduced for the similarity model to reflect the influence of word stopping operation on the positional relationship of the entry, and the calculation method of the LocScore is shown in formula 5:
in the formula, a higher score of 1 is given to a document which contains keywords and is not subjected to stop word filtering, and a slightly lower score of 0.7 is given to a document which contains keywords but is subjected to stop word filtering, so that the similarity relation between the position of an entry and the document is reflected.
And the problem of insufficient reliability caused by lack of consideration of correlation between the entry and the document is solved by attempting to introduce the thought of the probability retrieval model. The main ideas of the probability search model are: the document set is divided into a related document set and an uncorrelated document set, and the relevance measurement is converted into a text classification problem, so that the retrieval of the system follows a probability ordering principle, and an ordering result is returned according to the relevance of the user retrieval condition and the document set.
Consider introducing a naive bayes classification algorithm-based influence factor SimScore for the similarity model to reflect the influence of probability correlation on the similarity score, wherein the bayes formula is shown in formula 6. The documents D matched by each query term Q in the model can be divided into two groups of related document sets R and non-related document sets NR, so that P (R|D) is the conditional probability that the document D belongs to the related document set R, P (NR|D) is the conditional probability that the document D belongs to the non-related document set NR, and when P (R|D) > P (NR|D), the query term Q is related to the document D. From the Bayesian formula, it can be derived that:
when (when)At this time, query term Q is related to document D, and +.>The larger the value of (c), the higher the relevance of document D.
/>
Define document D as a set of binary vectors d= (D) 1 ,d 2 ,...,d n ) Wherein d is i =1 indicates that the keyword appears in the document, d i =0 indicates that the keyword does not appear in the document according to bayesian scoreThe calculation method for defining the influence factor SimScore based on the condition independence assumption among the attributes on which the class is based is shown in formula 7:
in p i For the probability of occurrence in a document in the set R of documents related to the keyword i, s i Then it is the probability that it will appear in a document in the set of irrelevant documents NR.
The improved Lucene similarity model is shown in formula 8:
NewScore(q,d)=α×Score(q,d)+β×LocScore(q,d)+γ×SimScore(q,d) (8)
in the formula, the score of the original scoring formula in the model is dominant, alpha=0.8 is taken, five sets of different coefficients are selected in data verification to test five types of special data in the table 6, the average F1 value is compared to reflect the retrieval effect, and experiments prove that when beta=0.08 and gamma=0.12, the retrieval result is more in line with expectations.
For performance evaluation of a retrieval system, indexes such as accuracy (Precision), recall rate (Recall), F1 value, MAP value, NDCG and the like are generally adopted for evaluation in the industry. In the system, the accuracy, recall and F1 value are adopted to evaluate the Lucene system using the improved TF-IDF algorithm.
The various data evaluated for coherence are classified as in table 5:
table 5 evaluate data classification table
The evaluation parameters can be calculated as follows:
(1) Accuracy p=number of product records in return meeting expectations a/total number of product records in return (a+b)
(2) Recall r=number of records of commodity in return that matches the expected number of records a/number of records of correct commodity that should be returned (a+c) in the result
(3) F1 value = 2 accuracy P recall R/(accuracy P + recall R)
In data verification, 150 records are added for each type of special products for the database, a certain type of special product name such as "rain flower tea" is input at a client, and then all search results are exported in the background. The experimental results are shown in the following table:
table 6 algorithm pre-improvement performance
/>
Table 7 algorithm improved performance
For more visual analysis of the search results, comparison line graphs for the three evaluation indexes are shown in fig. 5 to 7. It can be found that after the influence factors are added, the algorithm increases the accuracy and F1 value of searching the special products, but has little influence on the recall rate, which is probably caused by that most of special products are named more normally and can be accurately searched, and the accuracy-increasing description algorithm truly filters out some inaccurate search results caused by word segmentation influence, so that the algorithm performance is improved. In addition, the lower recall rate of "Yuhua" may be caused by the fact that Yuhua stone has a large variety of commodity, and commodity naming is not standard enough.
The invention designs a full text retrieval system based on an improved TF-IDF algorithm, and the system sets different weights for an index domain; introducing an influence factor LocScore to reflect the influence of word stopping operation on the positional relationship of the entry; an influence factor SimScare based on a naive Bayesian classification algorithm is introduced to reflect the influence of probability relevance on similarity scoring so as to improve the retrieval result. The system is of modular design, and functional modules can be added appropriately according to specific requirements. The system well overcomes the defects of low accuracy, inaccurate grading and low speed in the existing retrieval system, and has good market prospect.

Claims (2)

1. The full-text retrieval system based on the improved TF-IDF algorithm is characterized by comprising an index domain module, a word segmentation module, an indexer module and a retriever module; wherein:
the index domain module is set according to the service requirement, and the content is searched according to the correct domain name during indexing and searching; the index domain module comprises commodity searching and merchant searching, and the content is searched in a commodity main table and a merchant main table respectively; the index structure of the index domain module is divided into four parts, namely an index segment, an index document, an index domain and an index item, wherein the index item, the index document and the index segment are automatically generated by a system when an index is created, and the index domain consists of a domain name and an indexed content item;
the word segmentation device module is used for segmenting words of the search conditions according to the word stock; the word segmentation device module selects an IKAnalyzer word segmentation device, is a Chinese word segmentation packet based on Java, adopts a forward iteration finest granularity segmentation algorithm, and performs word segmentation by combining a word stock;
the indexer module is used for setting index domain weights for the service data sources, creating indexes and determining the storage mode of the index documents;
the retriever module is used for configuring a retriever, analyzing the retrieval conditions, generating a grammar tree, retrieving and sequencing, packaging the sequencing result and returning the sequencing result to the client through an interface for display; the similarity scoring adopts an improved TF-IDF algorithm, and weights are set for the priority of the index domains; the retriever module analyzes the query condition sent by the client through the interface by using a word segmentation device after receiving the query condition, and constructs a Boolean query grammar tree according to rules by using keywords obtained by word segmentation; according to the grammar tree, the retriever matches the index file and obtains a result set, the data in the set is scored by the improved TF-IDF algorithm and then is ranked according to the score, and the ranked result is packaged and then returned to the client for display through an interface.
2. The improved TF-IDF algorithm based full text retrieval system according to claim 1, wherein: the improved TF-IDF algorithm is based on the TF-IDF algorithm, different weights are set for the index domain, and the principle of the TF-IDF algorithm is shown as follows:
in n i,j For the entry t i In document d j Frequency of occurrence of n k,j At t i The sum of the frequency of occurrence in all documents, D is the total number of documents in the index base, { j: t i ∈d j The index library contains t i Is a number of documents; the term frequency tf refers to the occurrence frequency of terms in a certain document in the index base; the inverse text frequency idf refers to the proportion of the number of documents containing the term to all the number of documents;
the term t can be obtained by multiplying the term frequency tf by the inverse text frequency idf i And document d j Is a similarity score of (2);
within Lucene, the similarity model is based on TF-IDF algorithm, and its principle is shown in formula (4):
wherein q is the matching condition of the input key words provided by the user, d is the document where the matching result is located, and t is the entry which is analyzed after the word segmentation of the word segmentation componentTd is Term t The frequency of occurrence in the document d, idf is the inverse text frequency of the term, cN is a scoring factor, qN is the sum of variances of the query terms, tB is the weight of the term, and norm is a normalization factor;
introducing an influence factor LocScore for the similarity model to reflect the influence of word stopping operation on the positional relationship of the entry, wherein the calculation method of the LocScore comprises the following steps:
documents containing keywords and not subject to stop word filtering are given a higher score of 1, and documents containing keywords but subject to stop word filtering are given a slightly lower score of 0.7, so that similarity relation between terms and documents in position is reflected;
the similarity model introduces an influence factor SimScare based on a naive Bayesian classification algorithm to reflect the influence of probability correlation on similarity scores, wherein a Bayesian formula is shown in a formula (6); the documents D matched with each query term Q in the model can be divided into two groups of related document sets R and non-related document sets NR, so that P (R|D) is the conditional probability that the document D belongs to the related document sets R, P (NR|D) is the conditional probability that the document D belongs to the non-related document sets NR, and when P (R|D) > P (NR|D), the query term Q is related to the document D; from the Bayesian formula, it can be derived that:
when (when)At this time, query term Q is related to document D, and +.>The larger the value of (2), the higher the relevance of document D;
definition textThe gear D is a set of binary vectors d= (D) 1 ,d 2 ,…,d n ) Wherein d is i =1 indicates that the keyword appears in the document, d i =0 indicates that no condition independence assumption between attributes based on bayesian classification occurs in the document for the keyword, and the calculation method for defining the influence factor SimScore is as shown in formula (7):
in p i For the probability of occurrence in a document in the set R of documents related to the keyword i, s i Then the probability that it appears in a document in the set of irrelevant documents NR;
the improved Lucene similarity model is as follows:
NewScore(q,d)=α×Score(q,d)+β×LocScore(q,d)+γ×SimScore(q,d) (8)
where α+β+γ=1.
CN201910787265.3A 2019-08-25 2019-08-25 Full text retrieval system based on improved TF-IDF algorithm Active CN110619036B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910787265.3A CN110619036B (en) 2019-08-25 2019-08-25 Full text retrieval system based on improved TF-IDF algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910787265.3A CN110619036B (en) 2019-08-25 2019-08-25 Full text retrieval system based on improved TF-IDF algorithm

Publications (2)

Publication Number Publication Date
CN110619036A CN110619036A (en) 2019-12-27
CN110619036B true CN110619036B (en) 2023-07-18

Family

ID=68922470

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910787265.3A Active CN110619036B (en) 2019-08-25 2019-08-25 Full text retrieval system based on improved TF-IDF algorithm

Country Status (1)

Country Link
CN (1) CN110619036B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111599463B (en) * 2020-05-09 2023-07-14 吾征智能技术(北京)有限公司 Intelligent auxiliary diagnosis system based on sound cognition model
CN112183087A (en) * 2020-09-27 2021-01-05 武汉华工安鼎信息技术有限责任公司 System and method for sensitive text recognition
CN112883143A (en) * 2021-02-25 2021-06-01 华侨大学 Elasticissearch-based digital exhibition searching method and system
CN113343046B (en) * 2021-05-20 2023-08-25 成都美尔贝科技股份有限公司 Intelligent search ordering system
CN113468393A (en) * 2021-06-09 2021-10-01 北京达佳互联信息技术有限公司 Index generation method and device, electronic equipment and storage medium
CN113486156A (en) * 2021-07-30 2021-10-08 北京鼎普科技股份有限公司 ES-based associated document retrieval method
CN115455147A (en) * 2022-09-09 2022-12-09 浪潮卓数大数据产业发展有限公司 Full-text retrieval method and system
CN117290460A (en) * 2023-11-24 2023-12-26 中孚信息股份有限公司 Method, system, device and storage medium for calculating similarity of massive texts

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778276A (en) * 2015-04-29 2015-07-15 北京航空航天大学 Multi-index combining and sequencing algorithm based on improved TF-IDF (term frequency-inverse document frequency)
CN106055546A (en) * 2015-10-08 2016-10-26 北京慧存数据科技有限公司 Optical disk library full-text retrieval system based on Lucene

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778276A (en) * 2015-04-29 2015-07-15 北京航空航天大学 Multi-index combining and sequencing algorithm based on improved TF-IDF (term frequency-inverse document frequency)
CN106055546A (en) * 2015-10-08 2016-10-26 北京慧存数据科技有限公司 Optical disk library full-text retrieval system based on Lucene

Also Published As

Publication number Publication date
CN110619036A (en) 2019-12-27

Similar Documents

Publication Publication Date Title
CN110619036B (en) Full text retrieval system based on improved TF-IDF algorithm
US8156102B2 (en) Inferring search category synonyms
US8468156B2 (en) Determining a geographic location relevant to a web page
US7765178B1 (en) Search ranking estimation
US7496567B1 (en) System and method for document categorization
US9734192B2 (en) Producing sentiment-aware results from a search query
US7620627B2 (en) Generating keywords
US20090094223A1 (en) System and method for classifying search queries
US20080133479A1 (en) Method and system for information retrieval with clustering
US20100042610A1 (en) Rank documents based on popularity of key metadata
JP2016532173A (en) Semantic information, keyword expansion and related keyword search method and system
US9208236B2 (en) Presenting search results based upon subject-versions
WO2021057250A1 (en) Commodity search query strategy generation method and apparatus
US20180032608A1 (en) Flexible summarization of textual content
CN110390094B (en) Method, electronic device and computer program product for classifying documents
CN109885813A (en) A kind of operation method, system, server and the storage medium of the text similarity based on word coverage
CN111026710A (en) Data set retrieval method and system
CN113486156A (en) ES-based associated document retrieval method
CN112507133A (en) Method, device, processor and storage medium for realizing association search based on financial product knowledge graph
CN115858731A (en) Method, device and system for matching laws and regulations of law and regulation library
JP2001184358A (en) Device and method for retrieving information with category factor and program recording medium therefor
CN115630144A (en) Document searching method and device and related equipment
CN111831884A (en) Matching system and method based on information search
CN107423298B (en) Searching method and device
CN112613320A (en) Method and device for acquiring similar sentences, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant