WO2020115530A1 - Scatter-gather approach to flat clustering for frequent text based search queries - Google Patents

Scatter-gather approach to flat clustering for frequent text based search queries Download PDF

Info

Publication number
WO2020115530A1
WO2020115530A1 PCT/IB2018/059693 IB2018059693W WO2020115530A1 WO 2020115530 A1 WO2020115530 A1 WO 2020115530A1 IB 2018059693 W IB2018059693 W IB 2018059693W WO 2020115530 A1 WO2020115530 A1 WO 2020115530A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
documents
term
frequency
matrix
Prior art date
Application number
PCT/IB2018/059693
Other languages
French (fr)
Inventor
Pratik Sharma
Original Assignee
Pratik Sharma
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pratik Sharma filed Critical Pratik Sharma
Priority to PCT/IB2018/059693 priority Critical patent/WO2020115530A1/en
Publication of WO2020115530A1 publication Critical patent/WO2020115530A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems

Definitions

  • Term- Document or Document-Term matrix which is a matrix that describes the frequency of terms that occur in a collection of documents.
  • TF- IDF Term Frequency-Inverse Document Frequency
  • the Term Frequency-Inverse Document Frequency (TF-IDF) value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word.
  • TF-IDF Term Frequency-Inverse Document Frequency

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

In this invention we use scatter-gather approach to flat clustering for frequently used text based search queries. Now for a given set of documents we compute the Term-Document or Document-Term matrix which is a matrix that describes the frequency of terms that occur in a collection of documents. Now from the above Term-Document matrix we create a Term Frequency-Inverse Document Frequency (TF-IDF) matrix which is used to compute document similarity and create a flat set of clusters of documents which relate to each other. Now in the scatter-gather approach for a particular frequently used text based search query of a user we group the clusters of documents with relevant information and the resulting set is again clustered. The above process is repeated until a cluster of interest is found.

Description

Scatter-Gather Approach To Flat Clustering For Frequent Text Based Search Queries
Flat clustering creates a flat set of clusters without any explicit structure that would relate clusters to each other. In this invention we use scatter-gather approach to flat clustering for frequently used text based search queries. Now for a given set of documents we compute the Term- Document or Document-Term matrix which is a matrix that describes the frequency of terms that occur in a collection of documents. (Note here we do not consider frequently occurring insignificant terms like the, of, for, etc.) Also here we have Term Frequency-Inverse Document Frequency (TF- IDF) which is a numerical statistic that is intended to reflect how important a word is to a document in a collection of documents. The Term Frequency-Inverse Document Frequency (TF-IDF) value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word. Now from the above Term-Document matrix we create a Term Frequency-Inverse Document Frequency (TF-IDF) matrix which is used to compute document similarity and create a flat set of clusters of documents which relate to each other. Now in the scatter-gather approach for a particular frequently used text based search query of a user we group the clusters of documents with relevant information and the resulting set is again clustered. The above process is repeated until a cluster of interest is found.

Claims

Claims Following is the claim for this invention:-
1. Flat clustering creates a flat set of clusters without any explicit structure that would relate clusters to each other. In this invention we use scatter-gather approach to flat clustering for frequently used text based search queries. Now for a given set of documents we compute the Term-Document or Document-Term matrix which is a matrix that describes the frequency of terms that occur in a collection of documents. (Note here we do not consider frequently occurring insignificant terms like the, of, for, etc.) Also here we have Term Frequency- Inverse Document Frequency (TF-IDF) which is a numerical statistic that is intended to reflect how important a word is to a document in a collection of documents. The Term Frequency-Inverse Document Frequency (TF-IDF) value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word. Now from the above Term-Document matrix we create a Term Frequency -Inverse Document Frequency (TF-IDF) matrix which is used to compute document similarity and create a flat set of clusters of documents which relate to each other. Now in the scatter-gather approach for a particular frequently used text based search query of a user we group the clusters of documents with relevant information and the resulting set is again clustered. The above process is repeated until a cluster of interest is found. The above novel technique of using scatter-gather approach to flat clustering for frequently used text based search queries is the claim for this invention.
PCT/IB2018/059693 2018-12-06 2018-12-06 Scatter-gather approach to flat clustering for frequent text based search queries WO2020115530A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IB2018/059693 WO2020115530A1 (en) 2018-12-06 2018-12-06 Scatter-gather approach to flat clustering for frequent text based search queries

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IB2018/059693 WO2020115530A1 (en) 2018-12-06 2018-12-06 Scatter-gather approach to flat clustering for frequent text based search queries

Publications (1)

Publication Number Publication Date
WO2020115530A1 true WO2020115530A1 (en) 2020-06-11

Family

ID=70974613

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2018/059693 WO2020115530A1 (en) 2018-12-06 2018-12-06 Scatter-gather approach to flat clustering for frequent text based search queries

Country Status (1)

Country Link
WO (1) WO2020115530A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914536A (en) * 2020-08-06 2020-11-10 北京嘀嘀无限科技发展有限公司 Viewpoint analysis method, viewpoint analysis device, viewpoint analysis equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6038561A (en) * 1996-10-15 2000-03-14 Manning & Napier Information Services Management and analysis of document information text

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6038561A (en) * 1996-10-15 2000-03-14 Manning & Napier Information Services Management and analysis of document information text

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914536A (en) * 2020-08-06 2020-11-10 北京嘀嘀无限科技发展有限公司 Viewpoint analysis method, viewpoint analysis device, viewpoint analysis equipment and storage medium

Similar Documents

Publication Publication Date Title
US20160357860A1 (en) Natural language search results for intent queries
Toda et al. A search result clustering method using informatively named entities
US7720837B2 (en) System and method for multi-dimensional aggregation over large text corpora
WO2017196419A8 (en) Searching structured and unstructured data sets
WO2007002412A3 (en) Systems and methods for retrieving data
NZ601132A (en) Systems and methods for ranking documents
TW200715152A (en) Systems for and methods of finding relevant documents by analyzing tags
Ale Ebrahim Optimize Your Article for Search Engine
NZ578672A (en) Information-retrieval systems, methods, and software with concept-based searching and ranking
WO2007103191A3 (en) Comparative web search
CN105843960B (en) Indexing method and system based on semantic tree
JP6722615B2 (en) Query clustering device, method, and program
Sandhya et al. Analysis of similarity measures with wordnet based text document clustering
WO2020115530A1 (en) Scatter-gather approach to flat clustering for frequent text based search queries
Galkó et al. Biomedical question answering via weighted neural network passage retrieval
Zhao et al. BJUT at TREC 2014 Temporal Summarization Track.
Bai et al. An analysis of document clustering algorithms
Zhou et al. Fast result enumeration for keyword queries on XML data
WO2020121026A2 (en) Querying schema-less datastore using structured query language
WO2019171126A1 (en) Document ranking service based on search terms
Mirzal The limitation of the SVD for latent semantic indexing
Liu et al. Automatic acquisition of chinese words’ property of times
Suganya et al. Analysis on Clustering Techniques Based on Similarity of Text Documents
Tsay et al. Term selection with distributional clustering for Chinese text categorization using n-grams
WO2009011068A1 (en) Entry assist system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18942477

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18942477

Country of ref document: EP

Kind code of ref document: A1