WO2020115530A1

WO2020115530A1 - Scatter-gather approach to flat clustering for frequent text based search queries

Info

Publication number: WO2020115530A1
Application number: PCT/IB2018/059693
Authority: WO
Inventors: Pratik Sharma
Original assignee: Pratik Sharma
Priority date: 2018-12-06
Filing date: 2018-12-06
Publication date: 2020-06-11

Abstract

In this invention we use scatter-gather approach to flat clustering for frequently used text based search queries. Now for a given set of documents we compute the Term-Document or Document-Term matrix which is a matrix that describes the frequency of terms that occur in a collection of documents. Now from the above Term-Document matrix we create a Term Frequency-Inverse Document Frequency (TF-IDF) matrix which is used to compute document similarity and create a flat set of clusters of documents which relate to each other. Now in the scatter-gather approach for a particular frequently used text based search query of a user we group the clusters of documents with relevant information and the resulting set is again clustered. The above process is repeated until a cluster of interest is found.

Description

Scatter-Gather Approach To Flat Clustering For Frequent Text Based Search Queries

Flat clustering creates a flat set of clusters without any explicit structure that would relate clusters to each other. In this invention we use scatter-gather approach to flat clustering for frequently used text based search queries. Now for a given set of documents we compute the Term- Document or Document-Term matrix which is a matrix that describes the frequency of terms that occur in a collection of documents. (Note here we do not consider frequently occurring insignificant terms like the, of, for, etc.) Also here we have Term Frequency-Inverse Document Frequency (TF- IDF) which is a numerical statistic that is intended to reflect how important a word is to a document in a collection of documents. The Term Frequency-Inverse Document Frequency (TF-IDF) value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word. Now from the above Term-Document matrix we create a Term Frequency-Inverse Document Frequency (TF-IDF) matrix which is used to compute document similarity and create a flat set of clusters of documents which relate to each other. Now in the scatter-gather approach for a particular frequently used text based search query of a user we group the clusters of documents with relevant information and the resulting set is again clustered. The above process is repeated until a cluster of interest is found.

Claims

Claims Following is the claim for this invention:-

1. Flat clustering creates a flat set of clusters without any explicit structure that would relate clusters to each other. In this invention we use scatter-gather approach to flat clustering for frequently used text based search queries. Now for a given set of documents we compute the Term-Document or Document-Term matrix which is a matrix that describes the frequency of terms that occur in a collection of documents. (Note here we do not consider frequently occurring insignificant terms like the, of, for, etc.) Also here we have Term Frequency- Inverse Document Frequency (TF-IDF) which is a numerical statistic that is intended to reflect how important a word is to a document in a collection of documents. The Term Frequency-Inverse Document Frequency (TF-IDF) value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word. Now from the above Term-Document matrix we create a Term Frequency -Inverse Document Frequency (TF-IDF) matrix which is used to compute document similarity and create a flat set of clusters of documents which relate to each other. Now in the scatter-gather approach for a particular frequently used text based search query of a user we group the clusters of documents with relevant information and the resulting set is again clustered. The above process is repeated until a cluster of interest is found. The above novel technique of using scatter-gather approach to flat clustering for frequently used text based search queries is the claim for this invention.