WO2020115530A1 - Scatter-gather approach to flat clustering for frequent text based search queries - Google Patents
Scatter-gather approach to flat clustering for frequent text based search queries Download PDFInfo
- Publication number
- WO2020115530A1 WO2020115530A1 PCT/IB2018/059693 IB2018059693W WO2020115530A1 WO 2020115530 A1 WO2020115530 A1 WO 2020115530A1 IB 2018059693 W IB2018059693 W IB 2018059693W WO 2020115530 A1 WO2020115530 A1 WO 2020115530A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- document
- documents
- term
- frequency
- matrix
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/93—Document management systems
Definitions
- Term- Document or Document-Term matrix which is a matrix that describes the frequency of terms that occur in a collection of documents.
- TF- IDF Term Frequency-Inverse Document Frequency
- the Term Frequency-Inverse Document Frequency (TF-IDF) value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word.
- TF-IDF Term Frequency-Inverse Document Frequency
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
In this invention we use scatter-gather approach to flat clustering for frequently used text based search queries. Now for a given set of documents we compute the Term-Document or Document-Term matrix which is a matrix that describes the frequency of terms that occur in a collection of documents. Now from the above Term-Document matrix we create a Term Frequency-Inverse Document Frequency (TF-IDF) matrix which is used to compute document similarity and create a flat set of clusters of documents which relate to each other. Now in the scatter-gather approach for a particular frequently used text based search query of a user we group the clusters of documents with relevant information and the resulting set is again clustered. The above process is repeated until a cluster of interest is found.
Description
Scatter-Gather Approach To Flat Clustering For Frequent Text Based Search Queries
Flat clustering creates a flat set of clusters without any explicit structure that would relate clusters to each other. In this invention we use scatter-gather approach to flat clustering for frequently used text based search queries. Now for a given set of documents we compute the Term- Document or Document-Term matrix which is a matrix that describes the frequency of terms that occur in a collection of documents. (Note here we do not consider frequently occurring insignificant terms like the, of, for, etc.) Also here we have Term Frequency-Inverse Document Frequency (TF- IDF) which is a numerical statistic that is intended to reflect how important a word is to a document in a collection of documents. The Term Frequency-Inverse Document Frequency (TF-IDF) value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word. Now from the above Term-Document matrix we create a Term Frequency-Inverse Document Frequency (TF-IDF) matrix which is used to compute document similarity and create a flat set of clusters of documents which relate to each other. Now in the scatter-gather approach for a particular frequently used text based search query of a user we group the clusters of documents with relevant information and the resulting set is again clustered. The above process is repeated until a cluster of interest is found.
Claims
1. Flat clustering creates a flat set of clusters without any explicit structure that would relate clusters to each other. In this invention we use scatter-gather approach to flat clustering for frequently used text based search queries. Now for a given set of documents we compute the Term-Document or Document-Term matrix which is a matrix that describes the frequency of terms that occur in a collection of documents. (Note here we do not consider frequently occurring insignificant terms like the, of, for, etc.) Also here we have Term Frequency- Inverse Document Frequency (TF-IDF) which is a numerical statistic that is intended to reflect how important a word is to a document in a collection of documents. The Term Frequency-Inverse Document Frequency (TF-IDF) value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word. Now from the above Term-Document matrix we create a Term Frequency -Inverse Document Frequency (TF-IDF) matrix which is used to compute document similarity and create a flat set of clusters of documents which relate to each other. Now in the scatter-gather approach for a particular frequently used text based search query of a user we group the clusters of documents with relevant information and the resulting set is again clustered. The above process is repeated until a cluster of interest is found. The above novel technique of using scatter-gather approach to flat clustering for frequently used text based search queries is the claim for this invention.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/IB2018/059693 WO2020115530A1 (en) | 2018-12-06 | 2018-12-06 | Scatter-gather approach to flat clustering for frequent text based search queries |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/IB2018/059693 WO2020115530A1 (en) | 2018-12-06 | 2018-12-06 | Scatter-gather approach to flat clustering for frequent text based search queries |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020115530A1 true WO2020115530A1 (en) | 2020-06-11 |
Family
ID=70974613
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB2018/059693 WO2020115530A1 (en) | 2018-12-06 | 2018-12-06 | Scatter-gather approach to flat clustering for frequent text based search queries |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2020115530A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111914536A (en) * | 2020-08-06 | 2020-11-10 | 北京嘀嘀无限科技发展有限公司 | Viewpoint analysis method, viewpoint analysis device, viewpoint analysis equipment and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6038561A (en) * | 1996-10-15 | 2000-03-14 | Manning & Napier Information Services | Management and analysis of document information text |
-
2018
- 2018-12-06 WO PCT/IB2018/059693 patent/WO2020115530A1/en active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6038561A (en) * | 1996-10-15 | 2000-03-14 | Manning & Napier Information Services | Management and analysis of document information text |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111914536A (en) * | 2020-08-06 | 2020-11-10 | 北京嘀嘀无限科技发展有限公司 | Viewpoint analysis method, viewpoint analysis device, viewpoint analysis equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20160357860A1 (en) | Natural language search results for intent queries | |
Toda et al. | A search result clustering method using informatively named entities | |
US7720837B2 (en) | System and method for multi-dimensional aggregation over large text corpora | |
WO2017196419A8 (en) | Searching structured and unstructured data sets | |
WO2007002412A3 (en) | Systems and methods for retrieving data | |
NZ601132A (en) | Systems and methods for ranking documents | |
TW200715152A (en) | Systems for and methods of finding relevant documents by analyzing tags | |
Ale Ebrahim | Optimize Your Article for Search Engine | |
NZ578672A (en) | Information-retrieval systems, methods, and software with concept-based searching and ranking | |
WO2007103191A3 (en) | Comparative web search | |
CN105843960B (en) | Indexing method and system based on semantic tree | |
JP6722615B2 (en) | Query clustering device, method, and program | |
Sandhya et al. | Analysis of similarity measures with wordnet based text document clustering | |
WO2020115530A1 (en) | Scatter-gather approach to flat clustering for frequent text based search queries | |
Galkó et al. | Biomedical question answering via weighted neural network passage retrieval | |
Zhao et al. | BJUT at TREC 2014 Temporal Summarization Track. | |
Bai et al. | An analysis of document clustering algorithms | |
Zhou et al. | Fast result enumeration for keyword queries on XML data | |
WO2020121026A2 (en) | Querying schema-less datastore using structured query language | |
WO2019171126A1 (en) | Document ranking service based on search terms | |
Mirzal | The limitation of the SVD for latent semantic indexing | |
Liu et al. | Automatic acquisition of chinese words’ property of times | |
Suganya et al. | Analysis on Clustering Techniques Based on Similarity of Text Documents | |
Tsay et al. | Term selection with distributional clustering for Chinese text categorization using n-grams | |
WO2009011068A1 (en) | Entry assist system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18942477 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18942477 Country of ref document: EP Kind code of ref document: A1 |