US20210349929A1 - Recursive agglomerative clustering of time-structured communications - Google Patents

Recursive agglomerative clustering of time-structured communications Download PDF

Info

Publication number
US20210349929A1
US20210349929A1 US17/384,972 US202117384972A US2021349929A1 US 20210349929 A1 US20210349929 A1 US 20210349929A1 US 202117384972 A US202117384972 A US 202117384972A US 2021349929 A1 US2021349929 A1 US 2021349929A1
Authority
US
United States
Prior art keywords
label
document
clusters
term
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US17/384,972
Inventor
Viacheslav Seledkin
David Yan
Marina Chilingaryan
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Findo Inc
Visier Solutions Inc
Original Assignee
Findo Inc
Yvaai Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Findo Inc, Yvaai Inc filed Critical Findo Inc
Priority to US17/384,972 priority Critical patent/US20210349929A1/en
Assigned to YVA.AI, INC. reassignment YVA.AI, INC. CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: Findo Inc.
Assigned to FINDO, INC. reassignment FINDO, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: Chilingaryan, Marina, YAN, DAVID, Seledkin, Viacheslav
Publication of US20210349929A1 publication Critical patent/US20210349929A1/en
Assigned to VISIER SOLUTIONS INC. reassignment VISIER SOLUTIONS INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: YVA.AI, INC.
Priority to US17/950,067 priority patent/US20230078263A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/358Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems

Definitions

  • the present disclosure is generally related to computer systems, and is more specifically related to systems and methods of processing structured communications.
  • An example method of document clustering may comprise: representing each document of a plurality of documents by a vector comprising a first plurality of real values, wherein each real value of the first plurality of real values reflects a first frequency-based metric of a term comprised by the document; partitioning the plurality of documents into a first set of document clusters based on distances between vectors representing the documents; representing each document cluster of the first set of document clusters by a vector comprising a second plurality of real values, wherein each real value of the second plurality of real values reflects a second frequency-based metric of a term comprised by the document cluster; and partitioning the first set of document clusters into a second set of document clusters based on distances between vectors representing the document clusters of the first set of document clusters.
  • Another example method of document clustering may comprise: representing each document cluster of a first set of document clusters by a vector comprising a plurality of real values, wherein each real value reflects a frequency-based metric of a term comprised by the document cluster, wherein the frequency-based metric if provided by a function of a ratio of a number of largest document clusters in the set of document clusters and a number of the largest clusters which include the term; and partitioning the first set of document clusters into a second set of document clusters based on distances between vectors representing document clusters of the set of document clusters.
  • Another example method of document clustering may comprise: representing each document of a plurality of documents by a vector comprising a plurality of real values, wherein each real value reflects a frequency-based metric of a term comprised by the document; and partitioning the plurality of documents into a set of document clusters based on distances between vectors representing the documents, wherein a distance between a first vector representing a first document of the plurality of documents and a second vector representing a second document of the plurality of documents is provided by a function of a time-sensitive factor and a content-sensitive factor, wherein the time-sensitive factor is determined based on at least one of: a first time identifier associated with the first document and a second time identifier associated with the second document.
  • An example method of document cluster labeling may comprise: selecting a current document cluster of a plurality of document clusters; initializing a label associated with the current document cluster; selecting a term from a list of terms comprised by the document cluster; appending the term to the label associated with the current document cluster; responsive to determining that the label is found in a label dictionary, iteratively selecting a next term from the list of terms comprised by the document cluster and appending the next term to the label associated with the current document cluster; and responsive to failing to locate the label in the label dictionary, inserting the label into the label dictionary; and associating the label with the current document cluster.
  • FIG. 1 schematically illustrates an example recursive agglomerative clustering procedure implemented in accordance with one or more aspects of the present disclosure
  • FIG. 2 depicts a flow diagram of an example method of recursive clustering, in accordance with one or more aspects of the present disclosure
  • FIG. 3 depicts a flow diagram of an example method of document cluster labeling, in accordance with one or more aspects of the present disclosure.
  • FIG. 4 schematically illustrates a component diagram of an example computer system which may perform the methods described herein.
  • Described herein are systems and methods for recursive agglomerative clustering of time-structured communications.
  • Document clustering is a classification methodology which involves grouping a set of documents into a plurality of clusters, such that the number of clusters and/or distinguishing characteristics of each clusters may not a priori be known.
  • Results of document clustering may be visualized by representing each document by a vector (or a point) in the hyperspace of document features.
  • Various document clustering methodologies are based on the notion of the local density in the vicinity of the point representing a document, where the density is measured by the number of neighboring points found within the vicinity of a given point.
  • a cluster may be represented by a group of points that has a relatively higher density than its surrounding areas.
  • the documents that are not assigned to any clusters may be considered as outliers conveying the informational noise.
  • documents may be assigned to clusters by a procedure that groups together the points that have a relatively high number of nearby neighbors (e.g., the number of neighbors exceeding a threshold value), marking as outliers the points that lie in low-density regions.
  • the algorithm preserves mutual reachability of documents within a single cluster—that is, for any pair of documents from a certain cluster, there should be a path which is completely contained within the cluster and that passes through the core of the cluster.
  • OPTICS algorithm the problem of detecting meaningful clusters in a data set of varying density is addressed by linearly ordering the points such that the points which are spatially closest become neighbors in the ordering. Additionally, a special value is stored for each point that represents the density which needs to be accepted for a cluster in order to have both points belong to the same cluster.
  • Electronic mail messages represent a special type of textual documents, in that they follow a certain structure, which specifies certain mandatory fields (such as sender, receiver, one or more timestamps, etc.) and optional fields which may be left blank (such as the subject of the message, the body of the message, reference to related messages, etc.).
  • Bodies of electronic mail messages are usually relatively shorter than those of other document types, which may impair the ability of common document classification methods to produce useful results when applied to electronic mail messages, since common classification methods usually operate on document features that are extracted from document bodies.
  • common classification methods may fail to extract and utilize some useful information that may be conveyed by various metadata fields of electronic mail messages.
  • the present disclosure addresses the above-noted and other deficiencies of common document classification methods, by providing methods of recursive agglomerative clustering which take into account document metadata, such as timestamps, message subjects, and sending/receiving party identifiers, as described in more detail herein below.
  • document metadata such as timestamps, message subjects, and sending/receiving party identifiers, as described in more detail herein below.
  • implementations of the present disclosure represent improvements to the functionality of general purpose and/or specialized computer systems.
  • a clustering procedure may operate on the document features that are extracted from the sender and recipient identifiers specified by each message, such as the sender address (specified by From: field of the electronic mail message header) and one or more recipient addresses (specified by To: and Cc: fields of the electronic mail message header).
  • the clustering procedure may include several consecutive stages, such that each stage employs a special technique of re-weighting the components of the document feature vector. Clustering methods of the present disclosure do not require any supervised learning, thus efficiently implementing the data-driven approach to data classification.
  • the systems and methods described herein may be implemented by hardware (e.g., general purpose and/or specialized processing devices, and/or other devices and associated circuitry), software (e.g., instructions executable by a processing device), or a combination thereof.
  • hardware e.g., general purpose and/or specialized processing devices, and/or other devices and associated circuitry
  • software e.g., instructions executable by a processing device
  • Various aspects of the methods and systems are described herein by way of examples, rather than by way of limitation. In particular, certain specific examples are referenced and described herein for illustrative purposes only and do not limit the scope of the present disclosure to any particular bus width values.
  • a document e.g., an electronic mail message
  • a named entity extraction pipeline may be employed to extract the named entities from To:, Cc:, and/or From: fields of a corpus of electronic mail messages (e.g., a user's electronic mailbox).
  • another named entity extraction pipeline may be employed to extract the named entities from the body and/or subject line of the electronic messages.
  • yet another extraction pipeline may be employed for extracting document timestamps.
  • Each extracted entity name may be case-normalized and transformed into one or more terms, such that each term would comprise one or more tokens (words) of the entity name.
  • words tokens
  • the entity name “John Smith” would produce the following terms: “John,” “Smith,” and “John Smith.”
  • Electronic mail addresses may be tokenized into the name part and domain part.
  • JohnSmith@data.services.com would produce the following name terms: “John,” “Smith,” and “John Smith” and the following domain terms: “Data,” “Services,” “Data Services.”
  • the top-most domain e.g., .com, .org, etc.
  • Every document may then be mapped to a multi-dimensional sparse vector in the hyperspace of the document features, e.g., using the Term Frequency-Inverse Document Frequency (TF-IDF) weighting scheme, according to which each document is represented by a vector of term frequency-inverse document frequency (TF-IDF) values.
  • TF-IDF Term Frequency-Inverse Document Frequency
  • Term frequency represents the frequency of occurrence of a given word in the document:
  • n t is the number of occurrences of the word t within document d
  • ⁇ n k is the total number of words within document d.
  • IDF Inverse document frequency
  • N d is the number of documents in the corpus being analyzed
  • df t is the number of documents which contain the word t
  • each document may be represented by a vector of TF-IDF values corresponding to the words comprised by the document:
  • V d [ w 1 , w 2 , ... ⁇ , w n ]
  • w t tf t ⁇ log ⁇ N d d ⁇ f t
  • N d is the number of documents
  • df t is the number of documents containing term t.
  • the clustering procedure may further take into account the timestamps of the documents. Accordingly, the distance between two documents in the hyperspace of the document features may be represented by a product of the time-sensitive factor and the content-sensitive factor as follows:
  • T is the time sensitivity parameter
  • t d1 , t d2 document timestamps
  • V d1 , V d2 document vectors
  • the normalized angular form of S con instead of the cosine similarity is chosen in order to produce a normalized distance metric whose values would range from 0 to 1.
  • the clustering procedure may include several consecutive stages, such that each stage employs a special technique of re-weighting the components of the document feature vector.
  • FIG. 1 schematically illustrates an example recursive agglomerative clustering procedure implemented in accordance with one or more aspects of the present disclosure.
  • the clustering procedure may start by utilizing the above-described or a similar distance metric to perform the initial clustering operation 110 for partitioning a large number of input documents into a relatively large number of clusters.
  • reweighting operation 120 of FIG. 1 may re-calculate the TF-IDF metrics as described in more detail herein below.
  • the IDF component of the term weight may be defined as follows:
  • NC 0 is the number of clusters produced by the initial clustering operation
  • cf t is the number of clusters containing term t.
  • cf top,t,0 is the number of top clusters containing term t.
  • IDF opt by design has small value for terms shared by large number of top clusters.
  • the IDF metric may be modified as follows:
  • IDF opt,t IDF top,t ,if IDF top,t ⁇ LC
  • LC is a global clustering parameter which balances the choice between noisy and information-bearing terms.
  • Clustering operation 130 of FIG. 1 treats every initial cluster as a document and associates the following vector with every cluster:
  • V c,0 [ w 1,0 ,w 2,0 , . . . ,w n,0 ]
  • tf t,0 is term frequency of term t in cluster c.
  • the resulting vectors are then clusterized by a density-based clustering procedure.
  • documents may be assigned to clusters by a procedure that groups together the points that have a relatively high number of nearby neighbors (e.g., the number of neighbors found within a specified vicinity of a given point should exceed a threshold value), marking as outliers the points that lie in the remaining low-density regions.
  • clustering operation 130 of FIG. 1 produces a significantly lower number of clusters as compared to the initial number of clusters: while some of the clusters produced by initial clustering operation 110 may survive the subsequent clustering operation 130 , at least some of the initial clusters would be merged by the subsequent clustering operation 130 .
  • the reweighting and clustering operations 120 - 130 may be iteratively repeated until the number of clusters has stabilized (i.e., is not significantly changed by performing the last reweighting/clustering operation). Iteratively applying clustering and reweighting steps gradually improves the clustering quality through aggregation of small clusters produced by the previous iteration, followed by discrimination of noisy features. Thus, the clustering procedure produces a relatively small number of large clusters reflecting the user activity structured by communication and temporal aspects.
  • FIG. 2 depicts a flow diagram of an example method 200 of recursive clustering, in accordance with one or more aspects of the present disclosure.
  • Method 200 produces the initial sets of document clusters and then iteratively treats the clusters produced by the previous iteration as documents which are further clusterized, as described in more detail herein above with reference to FIG. 1 .
  • Method 200 and/or each of its individual functions, routines, subroutines, or operations may be performed by one or more processors of the computer system (e.g., the computer system 1000 of FIG. 4 ) implementing the method. In certain implementations, method 200 may be performed by a single processing thread.
  • method 200 may be performed by two or more processing threads, each thread executing one or more individual functions, routines, subroutines, or operations of the method.
  • the processing threads implementing method 200 may be synchronized (e.g., using semaphores, critical sections, and/or other thread synchronization mechanisms).
  • the processing threads implementing method 200 may be executed asynchronously with respect to each other.
  • the computer system implementing the method may receive a document corpus comprising a plurality of documents.
  • the document corpus may be provided by an electronic mailbox comprising a plurality of electronic mail messages.
  • the computer system may associate each document of the document corpus with a vector of real values, such that each real value reflects a frequency-based metric of a term comprised by the document.
  • the term may be provided by an identifier of a named entity comprised by the document or a time identifier (such as a timestamp) associated with the document.
  • the frequency-based metric may be provided by a TF-IDF metric, as described in more detail herein above.
  • the computer system may partition the corpus of documents into an initial set of document clusters by a density-based clustering procedure which utilizes distance-based metric reflecting distances between the vectors representing the documents.
  • the distance between two vectors representing two documents may be reflected by a function of a time-sensitive factor and a content-sensitive factor.
  • the time-sensitive factor may take into account the difference between the timestamps of the documents.
  • the content-sensitive factor may be computed based on the TF-IDF metric values of the terms comprised by the documents.
  • the distance metric may be expressed by the following equations:
  • the computer system may represent by a vector of real values each document cluster of the set of document clusters produced by the previous iteration, such that each real value reflects a frequency-based metric of a term comprised by the document cluster.
  • the frequency-based metric may be provided by a function which reflects the ratio of the number of largest document clusters in the set of document clusters and the number of the largest clusters which include the term, which may be expressed by the following equations
  • IDF opt,t IDF top,t , if IDF top,t ⁇ LC
  • IDF opt,t IDF t otherwise, as described in more detail herein above.
  • the computer system may partition the set of document clusters produced by the previous iteration into a new set of document clusters by a density-based clustering procedure which utilizes a distance-based metric reflecting distances between the vectors representing the document clusters of the initial set of document clusters.
  • each cluster may be represented by the following vector:
  • V c,0 [ w 1,0 ,w 2,0 , . . . ,w n,0 ]
  • tf t,0 is term frequency of term t in cluster c.
  • the same distance metric as described herein above with reference to block 230 may be utilized for performing operations of block 250 .
  • the method may terminate; otherwise, the method may loop back to block 240 .
  • evaluating the terminating condition may involve ascertaining that the number of clusters has stabilized (i.e., has not significantly changed by performing the last reweighting/clustering operation), as described in more detail herein above.
  • the classification results may be visually represented via a graphical user interface.
  • Visually representing the clusters may involve assigning a human-readable label to every cluster. Such a label should be short, it should reflect the cluster content, and should be distinctive from other cluster labels.
  • the cluster labeling method operating in accordance with one or more aspects of the present disclosure may start by sorting the clusters by the respective numbers of documents comprised by each cluster. For each cluster starting from the topmost one, a sorted list of terms may be built according to the term weights. All partial features introduced by the above-described tokenization procedure, such as parts of entity names, may be discarded when producing the sorted lists of terms.
  • the labeling method may initialize and maintain a dictionary of labels that have already been used as cluster labels. For each cluster starting from the topmost one, the first label from its sorted list of terms may be designated as the label for the cluster. If the cluster label is not found in the label dictionary, the label may be appended to the label dictionary, and the method may loop back to processing the next cluster on the list. Otherwise, if the cluster label has already been found in the label dictionary, the next term from the sorted list of terms may be appended to the cluster label, which may be repeated iteratively until the modified label is not found in the label dictionary, as described in more detail herein below with reference to FIG. 3 .
  • FIG. 3 depicts a flow diagram of an example method 300 of document cluster labeling, in accordance with one or more aspects of the present disclosure.
  • Method 300 produces the initial sets of document clusters and then iteratively treats the clusters produced by the previous iteration as documents which are further clusterized, as described in more detail herein above.
  • Method 300 and/or each of its individual functions, routines, subroutines, or operations may be performed by one or more processors of the computer system (e.g., the computer system 1000 of FIG. 4 ) implementing the method.
  • method 300 may be performed by a single processing thread.
  • method 300 may be performed by two or more processing threads, each thread executing one or more individual functions, routines, subroutines, or operations of the method.
  • the processing threads implementing method 300 may be synchronized (e.g., using semaphores, critical sections, and/or other thread synchronization mechanisms). Alternatively, the processing threads implementing method 300 may be executed asynchronously with respect to each other.
  • the computer system implementing the method may initialize, with an empty list, a label dictionary associated with a plurality of document clusters.
  • the computer system may sort, in the descending order, the plurality of document clusters by the respective number of documents comprised by each cluster.
  • the computer system may initialize the pointer to the sorted list of clusters to select the first cluster from the sorted list of clusters.
  • the computer system may initialize, with an empty value, a label associated with the currently selected cluster.
  • the computer system may sort by the term weight, in the descending order, the list of terms of the currently selected cluster. All partial features introduced by the above-described tokenization procedure, such as parts of entity names, may be discarded when producing the sorted lists of terms.
  • the computer system may initialize the pointer to the sorted list of terms to select the first term from the sorted list of terms of the currently selected cluster.
  • the computer system may append the currently selected term to the label associated with the currently selected cluster.
  • the computer system may, at block 350 , increment the pointer to the list of terms, and the method may loop back to block 340 . Otherwise, responsive to determining, at block 345 , that the label is not found in the label dictionary, the computer system may, at block 355 , insert the label into the label dictionary.
  • the computer system may associate the label with the currently selected cluster.
  • the computer system may increment the pointer to the sorted list of clusters. Responsive to determining, at block 370 , that the list of cluster has not yet been exhausted, the method may loop back to block 325 ; otherwise, the method may terminate.
  • FIG. 4 schematically illustrates a component diagram of an example computer system 1000 which may perform the methods described herein.
  • Example computer system 1000 may be connected to other computer systems in a LAN, an intranet, an extranet, and/or the Internet.
  • Computer system 1000 may operate in the capacity of a server in a client-server network environment.
  • Computer system 1000 may be a personal computer (PC), a set-top box (STB), a server, a network router, switch or bridge, or any device capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that device.
  • PC personal computer
  • STB set-top box
  • server a server
  • network router switch or bridge
  • Example computer system 1000 may comprise a processing device 1002 (also referred to as a processor or CPU), a main memory 1004 (e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM), etc.), a static memory 1006 (e.g., flash memory, static random access memory (SRAM), etc.), and a secondary memory (e.g., a data storage device 1018 ), which may communicate with each other via a bus 1030 .
  • a processing device 1002 also referred to as a processor or CPU
  • main memory 1004 e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM), etc.
  • DRAM dynamic random access memory
  • SDRAM synchronous DRAM
  • static memory e.g., flash memory, static random access memory (SRAM), etc.
  • secondary memory e.g., a data storage device 1018
  • Processing device 1002 represents one or more general-purpose processing devices such as a microprocessor, central processing unit, or the like. More particularly, processing device 1002 may be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, processor implementing other instruction sets, or processors implementing a combination of instruction sets. Processing device 1002 may also be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. In accordance with one or more aspects of the present disclosure, processing device 1002 may be configured to execute instructions implementing method 200 of recursive clustering and/or method 300 of document cluster labeling, in accordance with one or more aspects of the present disclosure.
  • CISC complex instruction set computing
  • RISC reduced instruction set computing
  • VLIW very long instruction word
  • processing device 1002 may also be one or more special-purpose processing devices such as
  • Example computer system 1000 may further comprise a network interface device 1008 , which may be communicatively coupled to a network 1020 .
  • Example computer system 1000 may further comprise a video display 1010 (e.g., a liquid crystal display (LCD), a touch screen, or a cathode ray tube (CRT)), an alphanumeric input device 1012 (e.g., a keyboard), a cursor control device 1014 (e.g., a mouse), and an acoustic signal generation device 1016 (e.g., a speaker).
  • a video display 1010 e.g., a liquid crystal display (LCD), a touch screen, or a cathode ray tube (CRT)
  • an alphanumeric input device 1012 e.g., a keyboard
  • a cursor control device 1014 e.g., a mouse
  • an acoustic signal generation device 1016 e.g., a speaker
  • Data storage device 1018 may include a computer-readable storage medium (or more specifically a non-transitory computer-readable storage medium) 1028 on which is stored one or more sets of executable instructions 1026 .
  • executable instructions 1026 may comprise executable instructions encoding various functions of method 200 of recursive clustering and/or method 300 of document cluster labeling, in accordance with one or more aspects of the present disclosure.
  • Executable instructions 1026 may also reside, completely or at least partially, within main memory 1004 and/or within processing device 1002 during execution thereof by example computer system 1000 , main memory 1004 and processing device 1002 also constituting computer-readable storage media. Executable instructions 1026 may further be transmitted or received over a network via network interface device 1008 .
  • While computer-readable storage medium 1028 is shown in FIG. 4 as a single medium, the term “computer-readable storage medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of VM operating instructions.
  • the term “computer-readable storage medium” shall also be taken to include any medium that is capable of storing or encoding a set of instructions for execution by the machine that cause the machine to perform any one or more of the methods described herein.
  • the term “computer-readable storage medium” shall accordingly be taken to include, but not be limited to, solid-state memories, and optical and magnetic media.
  • Examples of the present disclosure also relate to an apparatus for performing the methods described herein.
  • This apparatus may be specially constructed for the required purposes, or it may be a general purpose computer system selectively programmed by a computer program stored in the computer system.
  • a computer program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk including optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic disk storage media, optical storage media, flash memory devices, other type of machine-accessible storage media, or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.

Abstract

An example method of method of document cluster labeling comprises: selecting a current document cluster of a plurality of document clusters; initializing a label associated with the current document cluster; selecting a term from a list of terms comprised by the document cluster; appending the term to the label associated with the current document cluster; responsive to determining that the label is found in a label dictionary, iteratively selecting a next term from the list of terms comprised by the document cluster and appending the next term to the label associated with the current document cluster; responsive to failing to locate the label in the label dictionary, inserting the label into the label dictionary; and associating the label with the current document cluster.

Description

    REFERENCE TO RELATED APPLICATIONS
  • This application is a divisional of U.S. patent application Ser. No. 15/972,952, filed May 7, 2018, which claims the benefit of U.S. Patent Application No. 62/504,390, filed May 10, 2017. The above-referenced applications are incorporated by reference herein in their respective entireties.
  • TECHNICAL FIELD
  • The present disclosure is generally related to computer systems, and is more specifically related to systems and methods of processing structured communications.
  • BACKGROUND
  • In the digital age, users of electronic communication systems, such as electronic mail and other messaging systems, are forced to deal with unprecedentedly large volumes of information; this volume of information grows exponentially through the increasing number of files, contacts, documents, and other types of data communicated between the users on a daily basis. This dramatic increase can be explained by a number of reasons. The number of various activities and projects that the users are involved in keeps growing; on the other hand, electronic communication solutions at the users' disposition have expanded, ranging from electronic mail and messengers to integrated business communication platforms, while the amount of data sources grows in proportion with the number of the technological and software advancements.
  • SUMMARY
  • An example method of document clustering may comprise: representing each document of a plurality of documents by a vector comprising a first plurality of real values, wherein each real value of the first plurality of real values reflects a first frequency-based metric of a term comprised by the document; partitioning the plurality of documents into a first set of document clusters based on distances between vectors representing the documents; representing each document cluster of the first set of document clusters by a vector comprising a second plurality of real values, wherein each real value of the second plurality of real values reflects a second frequency-based metric of a term comprised by the document cluster; and partitioning the first set of document clusters into a second set of document clusters based on distances between vectors representing the document clusters of the first set of document clusters.
  • Another example method of document clustering may comprise: representing each document cluster of a first set of document clusters by a vector comprising a plurality of real values, wherein each real value reflects a frequency-based metric of a term comprised by the document cluster, wherein the frequency-based metric if provided by a function of a ratio of a number of largest document clusters in the set of document clusters and a number of the largest clusters which include the term; and partitioning the first set of document clusters into a second set of document clusters based on distances between vectors representing document clusters of the set of document clusters.
  • Another example method of document clustering may comprise: representing each document of a plurality of documents by a vector comprising a plurality of real values, wherein each real value reflects a frequency-based metric of a term comprised by the document; and partitioning the plurality of documents into a set of document clusters based on distances between vectors representing the documents, wherein a distance between a first vector representing a first document of the plurality of documents and a second vector representing a second document of the plurality of documents is provided by a function of a time-sensitive factor and a content-sensitive factor, wherein the time-sensitive factor is determined based on at least one of: a first time identifier associated with the first document and a second time identifier associated with the second document.
  • An example method of document cluster labeling may comprise: selecting a current document cluster of a plurality of document clusters; initializing a label associated with the current document cluster; selecting a term from a list of terms comprised by the document cluster; appending the term to the label associated with the current document cluster; responsive to determining that the label is found in a label dictionary, iteratively selecting a next term from the list of terms comprised by the document cluster and appending the next term to the label associated with the current document cluster; and responsive to failing to locate the label in the label dictionary, inserting the label into the label dictionary; and associating the label with the current document cluster.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The present disclosure is illustrated by way of examples, and not by way of limitation, and may be more fully understood with references to the following detailed description when considered in connection with the figures, in which:
  • FIG. 1 schematically illustrates an example recursive agglomerative clustering procedure implemented in accordance with one or more aspects of the present disclosure;
  • FIG. 2 depicts a flow diagram of an example method of recursive clustering, in accordance with one or more aspects of the present disclosure;
  • FIG. 3 depicts a flow diagram of an example method of document cluster labeling, in accordance with one or more aspects of the present disclosure; and
  • FIG. 4 schematically illustrates a component diagram of an example computer system which may perform the methods described herein.
  • DETAILED DESCRIPTION
  • Described herein are systems and methods for recursive agglomerative clustering of time-structured communications.
  • The efficiency of handling large volumes of information conveyed by multiple documents may be improved by performing document classification, i.e., associating each textual document with a category of documents. Document clustering is a classification methodology which involves grouping a set of documents into a plurality of clusters, such that the number of clusters and/or distinguishing characteristics of each clusters may not a priori be known.
  • Results of document clustering may be visualized by representing each document by a vector (or a point) in the hyperspace of document features. Various document clustering methodologies are based on the notion of the local density in the vicinity of the point representing a document, where the density is measured by the number of neighboring points found within the vicinity of a given point. Thus, a cluster may be represented by a group of points that has a relatively higher density than its surrounding areas. The documents that are not assigned to any clusters may be considered as outliers conveying the informational noise.
  • In an illustrative example, according to DBSCAN algorithm, documents may be assigned to clusters by a procedure that groups together the points that have a relatively high number of nearby neighbors (e.g., the number of neighbors exceeding a threshold value), marking as outliers the points that lie in low-density regions. The algorithm preserves mutual reachability of documents within a single cluster—that is, for any pair of documents from a certain cluster, there should be a path which is completely contained within the cluster and that passes through the core of the cluster. In another illustrative example, according to OPTICS algorithm, the problem of detecting meaningful clusters in a data set of varying density is addressed by linearly ordering the points such that the points which are spatially closest become neighbors in the ordering. Additionally, a special value is stored for each point that represents the density which needs to be accepted for a cluster in order to have both points belong to the same cluster.
  • However, the inventors noted that applying various local density-based clustering methods to electronic mail messages does not always produce satisfactory results. Electronic mail messages represent a special type of textual documents, in that they follow a certain structure, which specifies certain mandatory fields (such as sender, receiver, one or more timestamps, etc.) and optional fields which may be left blank (such as the subject of the message, the body of the message, reference to related messages, etc.). Bodies of electronic mail messages are usually relatively shorter than those of other document types, which may impair the ability of common document classification methods to produce useful results when applied to electronic mail messages, since common classification methods usually operate on document features that are extracted from document bodies. Furthermore, being unaware of the electronic mail message structure that describes various metadata fields, common classification methods may fail to extract and utilize some useful information that may be conveyed by various metadata fields of electronic mail messages.
  • The present disclosure addresses the above-noted and other deficiencies of common document classification methods, by providing methods of recursive agglomerative clustering which take into account document metadata, such as timestamps, message subjects, and sending/receiving party identifiers, as described in more detail herein below. Thus, implementations of the present disclosure represent improvements to the functionality of general purpose and/or specialized computer systems.
  • The systems and methods described herein facilitate efficient navigation through large collections of documents, by classifying the documents and visually representing the classification results. In certain implementations, a clustering procedure may operate on the document features that are extracted from the sender and recipient identifiers specified by each message, such as the sender address (specified by From: field of the electronic mail message header) and one or more recipient addresses (specified by To: and Cc: fields of the electronic mail message header). In order to further improve the clustering quality, the clustering procedure may include several consecutive stages, such that each stage employs a special technique of re-weighting the components of the document feature vector. Clustering methods of the present disclosure do not require any supervised learning, thus efficiently implementing the data-driven approach to data classification.
  • The systems and methods described herein may be implemented by hardware (e.g., general purpose and/or specialized processing devices, and/or other devices and associated circuitry), software (e.g., instructions executable by a processing device), or a combination thereof. Various aspects of the methods and systems are described herein by way of examples, rather than by way of limitation. In particular, certain specific examples are referenced and described herein for illustrative purposes only and do not limit the scope of the present disclosure to any particular bus width values.
  • As noted herein above, a document (e.g., an electronic mail message) may be represented by a vector of features, which are derived from the terms extracted from the document body and/or document metadata. Accordingly, a named entity extraction pipeline may be employed to extract the named entities from To:, Cc:, and/or From: fields of a corpus of electronic mail messages (e.g., a user's electronic mailbox). In certain implementations, another named entity extraction pipeline may be employed to extract the named entities from the body and/or subject line of the electronic messages. In certain implementations, yet another extraction pipeline may be employed for extracting document timestamps.
  • Each extracted entity name may be case-normalized and transformed into one or more terms, such that each term would comprise one or more tokens (words) of the entity name. In an illustrative example, the entity name “John Smith” would produce the following terms: “John,” “Smith,” and “John Smith.”
  • Electronic mail addresses may be tokenized into the name part and domain part. In an illustrative example, the electronic mail address JohnSmith@data.services.com would produce the following name terms: “John,” “Smith,” and “John Smith” and the following domain terms: “Data,” “Services,” “Data Services.” The top-most domain (e.g., .com, .org, etc.) may be discarded as it usually does not convey any useful information.
  • Every document may then be mapped to a multi-dimensional sparse vector in the hyperspace of the document features, e.g., using the Term Frequency-Inverse Document Frequency (TF-IDF) weighting scheme, according to which each document is represented by a vector of term frequency-inverse document frequency (TF-IDF) values.
  • Term frequency (TF) represents the frequency of occurrence of a given word in the document:

  • tf(t,d)=n t /Σn k
  • where t is the word identifier,
  • d is the document identifier,
  • nt is the number of occurrences of the word t within document d, and
  • Σnk is the total number of words within document d.
  • Inverse document frequency (IDF) is the logarithmic ratio of the number of documents in the analyzed corpus to the number of documents containing the given word:

  • idf(t,d)=log(N d /df t)
  • where Nd is the number of documents in the corpus being analyzed, and
  • dft is the number of documents which contain the word t
  • Thus, each document may be represented by a vector of TF-IDF values corresponding to the words comprised by the document:
  • V d = [ w 1 , w 2 , , w n ] where w t = tf t · log N d d f t
  • tft is the term frequency of term t in document d,
  • Nd is the number of documents, and
  • dft is the number of documents containing term t.
  • Communication between people may be viewed as a time structured process, hence, in certain implementations, the clustering procedure may further take into account the timestamps of the documents. Accordingly, the distance between two documents in the hyperspace of the document features may be represented by a product of the time-sensitive factor and the content-sensitive factor as follows:
  • S ( V ¯ d 1 , V ¯ d 2 ) = S t i m e * S c o n S t i m e = 1 + t d 1 - t d 2 T S c o n = 2 π * arccos ( V ¯ d 1 * V ¯ d 2 V ¯ d 1 * V ¯ d 2 )
  • Where T is the time sensitivity parameter, and
  • td1, td2—document timestamps, V d1, V d2—document vectors.
  • The normalized angular form of Scon instead of the cosine similarity is chosen in order to produce a normalized distance metric whose values would range from 0 to 1.
  • While various implementations of clustering procedures may suffer from very high computational complexity due to the need of computing distance metric values for a large number for document pairs, the methods and systems of the present disclosure alleviate this issue by avoiding to compute the computationally expensive Scom component if the computationally cheap Stime component exceeds a certain threshold.
  • In order to further improve the clustering quality, the clustering procedure may include several consecutive stages, such that each stage employs a special technique of re-weighting the components of the document feature vector. FIG. 1 schematically illustrates an example recursive agglomerative clustering procedure implemented in accordance with one or more aspects of the present disclosure. The clustering procedure may start by utilizing the above-described or a similar distance metric to perform the initial clustering operation 110 for partitioning a large number of input documents into a relatively large number of clusters.
  • The inventors noted that terms which are shared by large amount of clusters are noisy, and reducing their weight may be beneficial for increasing the clustering quality. The inventors further noted that the majority of such noisy terms are within a small amount of large clusters formed by the initial clustering operation. Based on these observations, reweighting operation 120 of FIG. 1 may re-calculate the TF-IDF metrics as described in more detail herein below.
  • Treating every cluster as a document, the IDF component of the term weight may be defined as follows:
  • ID F t = log N C 0 c f t , 0
  • where NC0 is the number of clusters produced by the initial clustering operation, and
  • cft is the number of clusters containing term t.
  • Furthermore, taking only NCtop clusters into account:
  • ID F t o p , t = log N C top , 0 c f top , t , 0
  • where cftop,t,0 is the number of top clusters containing term t.
  • IDFopt by design has small value for terms shared by large number of top clusters.
  • In order to alleviate the negative effect of noisy terms, the IDF metric may be modified as follows:

  • IDF opt,t =IDF top,t,if IDF top,t <LC
  • and IDF opt,t =IDF t otherwise
  • where LC is a global clustering parameter which balances the choice between noisy and information-bearing terms.
  • Clustering operation 130 of FIG. 1 treats every initial cluster as a document and associates the following vector with every cluster:

  • V c,0=[w 1,0 ,w 2,0 , . . . ,w n,0]
  • where wt,0=tft,0·IDFopt,t
  • tft,0 is term frequency of term t in cluster c.
  • The resulting vectors are then clusterized by a density-based clustering procedure. In an illustrative example, documents may be assigned to clusters by a procedure that groups together the points that have a relatively high number of nearby neighbors (e.g., the number of neighbors found within a specified vicinity of a given point should exceed a threshold value), marking as outliers the points that lie in the remaining low-density regions. Thus, clustering operation 130 of FIG. 1 produces a significantly lower number of clusters as compared to the initial number of clusters: while some of the clusters produced by initial clustering operation 110 may survive the subsequent clustering operation 130, at least some of the initial clusters would be merged by the subsequent clustering operation 130.
  • In certain implementations the reweighting and clustering operations 120-130 may be iteratively repeated until the number of clusters has stabilized (i.e., is not significantly changed by performing the last reweighting/clustering operation). Iteratively applying clustering and reweighting steps gradually improves the clustering quality through aggregation of small clusters produced by the previous iteration, followed by discrimination of noisy features. Thus, the clustering procedure produces a relatively small number of large clusters reflecting the user activity structured by communication and temporal aspects.
  • FIG. 2 depicts a flow diagram of an example method 200 of recursive clustering, in accordance with one or more aspects of the present disclosure. Method 200 produces the initial sets of document clusters and then iteratively treats the clusters produced by the previous iteration as documents which are further clusterized, as described in more detail herein above with reference to FIG. 1. Method 200 and/or each of its individual functions, routines, subroutines, or operations may be performed by one or more processors of the computer system (e.g., the computer system 1000 of FIG. 4) implementing the method. In certain implementations, method 200 may be performed by a single processing thread. Alternatively, method 200 may be performed by two or more processing threads, each thread executing one or more individual functions, routines, subroutines, or operations of the method. In an illustrative example, the processing threads implementing method 200 may be synchronized (e.g., using semaphores, critical sections, and/or other thread synchronization mechanisms). Alternatively, the processing threads implementing method 200 may be executed asynchronously with respect to each other.
  • At block 210, the computer system implementing the method may receive a document corpus comprising a plurality of documents. In an illustrative example, the document corpus may be provided by an electronic mailbox comprising a plurality of electronic mail messages.
  • At block 220, the computer system may associate each document of the document corpus with a vector of real values, such that each real value reflects a frequency-based metric of a term comprised by the document. In various illustrative examples, the term may be provided by an identifier of a named entity comprised by the document or a time identifier (such as a timestamp) associated with the document. The frequency-based metric may be provided by a TF-IDF metric, as described in more detail herein above.
  • At block 230, the computer system may partition the corpus of documents into an initial set of document clusters by a density-based clustering procedure which utilizes distance-based metric reflecting distances between the vectors representing the documents. In an illustrative example, the distance between two vectors representing two documents may be reflected by a function of a time-sensitive factor and a content-sensitive factor. The time-sensitive factor may take into account the difference between the timestamps of the documents The content-sensitive factor may be computed based on the TF-IDF metric values of the terms comprised by the documents. Thus, the distance metric may be expressed by the following equations:
  • S ( V ¯ d 1 , V ¯ d 2 ) = S t i m e * S c o n S t i m e = 1 + t d 1 - t d 2 T S c o n = 2 π * arc cos ( V ¯ d 1 * V ¯ d 2 V ¯ d 1 * V ¯ d 2 ) ,
  • as described in more detail herein above.
  • At block 240, the computer system may represent by a vector of real values each document cluster of the set of document clusters produced by the previous iteration, such that each real value reflects a frequency-based metric of a term comprised by the document cluster. In an illustrative example, the frequency-based metric may be provided by a function which reflects the ratio of the number of largest document clusters in the set of document clusters and the number of the largest clusters which include the term, which may be expressed by the following equations

  • IDF opt,t =IDF top,t, if IDF top,t <LC
  • and IDFopt,t=IDFt otherwise, as described in more detail herein above.
  • At block 250, the computer system may partition the set of document clusters produced by the previous iteration into a new set of document clusters by a density-based clustering procedure which utilizes a distance-based metric reflecting distances between the vectors representing the document clusters of the initial set of document clusters. In an illustrative example, each cluster may be represented by the following vector:

  • V c,0=[w 1,0 ,w 2,0 , . . . ,w n,0]
  • where wt,0=tft,0·IDFopt,t
  • tft,0 is term frequency of term t in cluster c.
  • The same distance metric as described herein above with reference to block 230 may be utilized for performing operations of block 250.
  • Responsive to determining, at block 260, that a terminating condition has been met, the method may terminate; otherwise, the method may loop back to block 240. In an illustrative example, evaluating the terminating condition may involve ascertaining that the number of clusters has stabilized (i.e., has not significantly changed by performing the last reweighting/clustering operation), as described in more detail herein above.
  • As noted herein above, the classification results may be visually represented via a graphical user interface. Visually representing the clusters may involve assigning a human-readable label to every cluster. Such a label should be short, it should reflect the cluster content, and should be distinctive from other cluster labels.
  • The cluster labeling method operating in accordance with one or more aspects of the present disclosure may start by sorting the clusters by the respective numbers of documents comprised by each cluster. For each cluster starting from the topmost one, a sorted list of terms may be built according to the term weights. All partial features introduced by the above-described tokenization procedure, such as parts of entity names, may be discarded when producing the sorted lists of terms.
  • The labeling method may initialize and maintain a dictionary of labels that have already been used as cluster labels. For each cluster starting from the topmost one, the first label from its sorted list of terms may be designated as the label for the cluster. If the cluster label is not found in the label dictionary, the label may be appended to the label dictionary, and the method may loop back to processing the next cluster on the list. Otherwise, if the cluster label has already been found in the label dictionary, the next term from the sorted list of terms may be appended to the cluster label, which may be repeated iteratively until the modified label is not found in the label dictionary, as described in more detail herein below with reference to FIG. 3.
  • FIG. 3 depicts a flow diagram of an example method 300 of document cluster labeling, in accordance with one or more aspects of the present disclosure. Method 300 produces the initial sets of document clusters and then iteratively treats the clusters produced by the previous iteration as documents which are further clusterized, as described in more detail herein above. Method 300 and/or each of its individual functions, routines, subroutines, or operations may be performed by one or more processors of the computer system (e.g., the computer system 1000 of FIG. 4) implementing the method. In certain implementations, method 300 may be performed by a single processing thread. Alternatively, method 300 may be performed by two or more processing threads, each thread executing one or more individual functions, routines, subroutines, or operations of the method. In an illustrative example, the processing threads implementing method 300 may be synchronized (e.g., using semaphores, critical sections, and/or other thread synchronization mechanisms). Alternatively, the processing threads implementing method 300 may be executed asynchronously with respect to each other.
  • At block 310, the computer system implementing the method may initialize, with an empty list, a label dictionary associated with a plurality of document clusters.
  • At block 315, the computer system may sort, in the descending order, the plurality of document clusters by the respective number of documents comprised by each cluster.
  • At block 320, the computer system may initialize the pointer to the sorted list of clusters to select the first cluster from the sorted list of clusters.
  • At block 325, the computer system may initialize, with an empty value, a label associated with the currently selected cluster.
  • At block 330, the computer system may sort by the term weight, in the descending order, the list of terms of the currently selected cluster. All partial features introduced by the above-described tokenization procedure, such as parts of entity names, may be discarded when producing the sorted lists of terms.
  • At block 335, the computer system may initialize the pointer to the sorted list of terms to select the first term from the sorted list of terms of the currently selected cluster.
  • At block 340, the computer system may append the currently selected term to the label associated with the currently selected cluster.
  • Responsive to determining, at block 345, that the label is found in the label dictionary, the computer system may, at block 350, increment the pointer to the list of terms, and the method may loop back to block 340. Otherwise, responsive to determining, at block 345, that the label is not found in the label dictionary, the computer system may, at block 355, insert the label into the label dictionary.
  • At block 360, the computer system may associate the label with the currently selected cluster.
  • At block 365, the computer system may increment the pointer to the sorted list of clusters. Responsive to determining, at block 370, that the list of cluster has not yet been exhausted, the method may loop back to block 325; otherwise, the method may terminate.
  • FIG. 4 schematically illustrates a component diagram of an example computer system 1000 which may perform the methods described herein. Example computer system 1000 may be connected to other computer systems in a LAN, an intranet, an extranet, and/or the Internet. Computer system 1000 may operate in the capacity of a server in a client-server network environment. Computer system 1000 may be a personal computer (PC), a set-top box (STB), a server, a network router, switch or bridge, or any device capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that device. Further, while only a single example computer system is illustrated, the term “computer” shall also be taken to include any collection of computers that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methods discussed herein.
  • Example computer system 1000 may comprise a processing device 1002 (also referred to as a processor or CPU), a main memory 1004 (e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM), etc.), a static memory 1006 (e.g., flash memory, static random access memory (SRAM), etc.), and a secondary memory (e.g., a data storage device 1018), which may communicate with each other via a bus 1030.
  • Processing device 1002 represents one or more general-purpose processing devices such as a microprocessor, central processing unit, or the like. More particularly, processing device 1002 may be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, processor implementing other instruction sets, or processors implementing a combination of instruction sets. Processing device 1002 may also be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. In accordance with one or more aspects of the present disclosure, processing device 1002 may be configured to execute instructions implementing method 200 of recursive clustering and/or method 300 of document cluster labeling, in accordance with one or more aspects of the present disclosure.
  • Example computer system 1000 may further comprise a network interface device 1008, which may be communicatively coupled to a network 1020. Example computer system 1000 may further comprise a video display 1010 (e.g., a liquid crystal display (LCD), a touch screen, or a cathode ray tube (CRT)), an alphanumeric input device 1012 (e.g., a keyboard), a cursor control device 1014 (e.g., a mouse), and an acoustic signal generation device 1016 (e.g., a speaker).
  • Data storage device 1018 may include a computer-readable storage medium (or more specifically a non-transitory computer-readable storage medium) 1028 on which is stored one or more sets of executable instructions 1026. In accordance with one or more aspects of the present disclosure, executable instructions 1026 may comprise executable instructions encoding various functions of method 200 of recursive clustering and/or method 300 of document cluster labeling, in accordance with one or more aspects of the present disclosure.
  • Executable instructions 1026 may also reside, completely or at least partially, within main memory 1004 and/or within processing device 1002 during execution thereof by example computer system 1000, main memory 1004 and processing device 1002 also constituting computer-readable storage media. Executable instructions 1026 may further be transmitted or received over a network via network interface device 1008.
  • While computer-readable storage medium 1028 is shown in FIG. 4 as a single medium, the term “computer-readable storage medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of VM operating instructions. The term “computer-readable storage medium” shall also be taken to include any medium that is capable of storing or encoding a set of instructions for execution by the machine that cause the machine to perform any one or more of the methods described herein. The term “computer-readable storage medium” shall accordingly be taken to include, but not be limited to, solid-state memories, and optical and magnetic media.
  • Some portions of the detailed descriptions above are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
  • It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise, as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as “identifying,” “determining,” “storing,” “adjusting,” “causing,” “returning,” “comparing,” “creating,” “stopping,” “loading,” “copying,” “throwing,” “replacing,” “performing,” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
  • Examples of the present disclosure also relate to an apparatus for performing the methods described herein. This apparatus may be specially constructed for the required purposes, or it may be a general purpose computer system selectively programmed by a computer program stored in the computer system. Such a computer program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk including optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic disk storage media, optical storage media, flash memory devices, other type of machine-accessible storage media, or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.
  • The methods and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct a more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear as set forth in the description below. In addition, the scope of the present disclosure is not limited to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present disclosure.
  • It is to be understood that the above description is intended to be illustrative, and not restrictive. Many other implementation examples will be apparent to those of skill in the art upon reading and understanding the above description. Although the present disclosure describes specific examples, it will be recognized that the systems and methods of the present disclosure are not limited to the examples described herein, but may be practiced with modifications within the scope of the appended claims. Accordingly, the specification and drawings are to be regarded in an illustrative sense rather than a restrictive sense. The scope of the present disclosure should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims (20)

What is claimed is:
1. A method of document cluster labeling, the method comprising:
selecting, by a processing device, a current document cluster of a plurality of document clusters;
initializing a label associated with the current document cluster;
selecting a term from a list of terms comprised by the document cluster;
appending the term to the label associated with the current document cluster;
responsive to determining that the label is found in a label dictionary, iteratively selecting a next term from the list of terms comprised by the document cluster and appending the next term to the label associated with the current document cluster;
responsive to failing to locate the label in the label dictionary, inserting the label into the label dictionary; and
associating the label with the current document cluster.
2. The method of claim 1, further comprising:
sorting the plurality of document clusters by a number of documents comprised by a respective document cluster.
3. The method of claim 1, further comprising:
sorting the list of terms by a respective term weight.
4. The method of claim 1, further comprising:
excluding, from the list of terms, a term comprising at least part of an entity name.
5. The method of claim 1, further comprising:
visually representing, via a graphical user interface, one or more clusters of the plurality of document clusters in a visual association with respective labels.
6. The method of claim 1, wherein the plurality of document clusters comprise a plurality of electronic mail messages.
7. The method of claim 1, wherein the plurality of document clusters comprise a plurality of documents represented by respective vectors in a hyperspace of document features.
8. A system, comprising:
a memory; and
a processor coupled to the memory, wherein the processor is configured to:
select a current document cluster of a plurality of document clusters;
initialize a label associated with the current document cluster;
select a term from a list of terms comprised by the document cluster;
append the term to the label associated with the current document cluster;
responsive to determining that the label is found in a label dictionary, iteratively select a next term from the list of terms comprised by the document cluster and appending the next term to the label associated with the current document cluster;
responsive to failing to locate the label in the label dictionary, insert the label into the label dictionary; and
associate the label with the current document cluster.
9. The system of claim 8, wherein the processor is further configured to:
sort the plurality of document clusters by a number of documents comprised by a respective document cluster.
10. The system of claim 8, wherein the processor is further configured to:
sort the list of terms by a respective term weight.
11. The system of claim 8, wherein the processor is further configured to:
exclude, from the list of terms, a term comprising at least part of an entity name.
12. The system of claim 8, wherein the processor is further configured to:
visually represent, via a graphical user interface, one or more clusters of the plurality of document clusters in a visual association with respective labels.
13. The system of claim 8, wherein the plurality of document clusters comprise a plurality of electronic mail messages.
14. The system of claim 8, wherein the plurality of document clusters comprise a plurality of documents represented by respective vectors in a hyperspace of document features.
15. A non-transitory computer-readable storage medium comprising executable instructions that, when executed by a computer system, cause the computer system to:
select a current document cluster of a plurality of document clusters;
initialize a label associated with the current document cluster;
select a term from a list of terms comprised by the document cluster;
append the term to the label associated with the current document cluster;
responsive to determining that the label is found in a label dictionary, iteratively select a next term from the list of terms comprised by the document cluster and appending the next term to the label associated with the current document cluster;
responsive to failing to locate the label in the label dictionary, insert the label into the label dictionary; and
associate the label with the current document cluster.
16. The non-transitory computer-readable storage medium of claim 15, further comprising executable instructions that, when executed by the computer system, cause the computer system to:
sort the plurality of document clusters by a number of documents comprised by a respective document cluster.
17. The non-transitory computer-readable storage medium of claim 15, further comprising executable instructions that, when executed by the computer system, cause the computer system to:
sort the list of terms by a respective term weight.
18. The non-transitory computer-readable storage medium of claim 15, further comprising executable instructions that, when executed by the computer system, cause the computer system to:
exclude, from the list of terms, a term comprising at least part of an entity name.
19. The non-transitory computer-readable storage medium of claim 15, further comprising executable instructions that, when executed by the computer system, cause the computer system to:
visually represent, via a graphical user interface, one or more clusters of the plurality of document clusters in a visual association with respective labels.
20. The non-transitory computer-readable storage medium of claim 15, wherein the plurality of document clusters comprise a plurality of electronic mail messages.
US17/384,972 2017-05-10 2021-07-26 Recursive agglomerative clustering of time-structured communications Abandoned US20210349929A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US17/384,972 US20210349929A1 (en) 2017-05-10 2021-07-26 Recursive agglomerative clustering of time-structured communications
US17/950,067 US20230078263A1 (en) 2017-05-10 2022-09-21 Recursive agglomerative clustering of time-structured communications

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201762504390P 2017-05-10 2017-05-10
US15/972,952 US11074285B2 (en) 2017-05-10 2018-05-07 Recursive agglomerative clustering of time-structured communications
US17/384,972 US20210349929A1 (en) 2017-05-10 2021-07-26 Recursive agglomerative clustering of time-structured communications

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US15/972,952 Division US11074285B2 (en) 2017-05-10 2018-05-07 Recursive agglomerative clustering of time-structured communications

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/950,067 Continuation-In-Part US20230078263A1 (en) 2017-05-10 2022-09-21 Recursive agglomerative clustering of time-structured communications

Publications (1)

Publication Number Publication Date
US20210349929A1 true US20210349929A1 (en) 2021-11-11

Family

ID=64097885

Family Applications (2)

Application Number Title Priority Date Filing Date
US15/972,952 Active 2038-12-05 US11074285B2 (en) 2017-05-10 2018-05-07 Recursive agglomerative clustering of time-structured communications
US17/384,972 Abandoned US20210349929A1 (en) 2017-05-10 2021-07-26 Recursive agglomerative clustering of time-structured communications

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US15/972,952 Active 2038-12-05 US11074285B2 (en) 2017-05-10 2018-05-07 Recursive agglomerative clustering of time-structured communications

Country Status (1)

Country Link
US (2) US11074285B2 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020081343A1 (en) 2018-10-15 2020-04-23 Ventana Medical Systems, Inc. Systems and methods for cell classification
CN110389932B (en) * 2019-07-02 2023-01-13 华北电力科学研究院有限责任公司 Automatic classification method and device for power files

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6185551B1 (en) * 1997-06-16 2001-02-06 Digital Equipment Corporation Web-based electronic mail service apparatus and method using full text and label indexing
US6442545B1 (en) * 1999-06-01 2002-08-27 Clearforest Ltd. Term-level text with mining with taxonomies
US20030061200A1 (en) * 2001-08-13 2003-03-27 Xerox Corporation System with user directed enrichment and import/export control
WO2005045564A2 (en) * 2003-10-30 2005-05-19 Microsoft Corporation Term database extension for label system
US7117432B1 (en) * 2001-08-13 2006-10-03 Xerox Corporation Meta-document management system with transit triggered enrichment
US8583747B2 (en) * 2004-03-31 2013-11-12 Google Inc. Labeling messages of conversations and snoozing labeled conversations in a conversation-based email system
US20140015855A1 (en) * 2012-07-16 2014-01-16 Canon Kabushiki Kaisha Systems and methods for creating a semantic-driven visual vocabulary
US9002848B1 (en) * 2011-12-27 2015-04-07 Google Inc. Automatic incremental labeling of document clusters
US20160103885A1 (en) * 2014-10-10 2016-04-14 Workdigital Limited System for, and method of, building a taxonomy

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9275129B2 (en) * 2006-01-23 2016-03-01 Symantec Corporation Methods and systems to efficiently find similar and near-duplicate emails and files
US9355171B2 (en) * 2009-10-09 2016-05-31 Hewlett Packard Enterprise Development Lp Clustering of near-duplicate documents
WO2011044662A1 (en) * 2009-10-15 2011-04-21 2167959 Ontario Inc. System and method for grouping multiple streams of data
US9020271B2 (en) * 2012-07-31 2015-04-28 Hewlett-Packard Development Company, L.P. Adaptive hierarchical clustering algorithm
US10437869B2 (en) * 2014-07-14 2019-10-08 International Business Machines Corporation Automatic new concept definition
US20180024968A1 (en) * 2016-07-22 2018-01-25 Xerox Corporation System and method for domain adaptation using marginalized stacked denoising autoencoders with domain prediction regularization
US20180276294A1 (en) * 2017-03-24 2018-09-27 Nec Personal Computers, Ltd. Information processing apparatus, information processing system, and information processing method

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6185551B1 (en) * 1997-06-16 2001-02-06 Digital Equipment Corporation Web-based electronic mail service apparatus and method using full text and label indexing
US6442545B1 (en) * 1999-06-01 2002-08-27 Clearforest Ltd. Term-level text with mining with taxonomies
US20030061200A1 (en) * 2001-08-13 2003-03-27 Xerox Corporation System with user directed enrichment and import/export control
US7117432B1 (en) * 2001-08-13 2006-10-03 Xerox Corporation Meta-document management system with transit triggered enrichment
WO2005045564A2 (en) * 2003-10-30 2005-05-19 Microsoft Corporation Term database extension for label system
US8583747B2 (en) * 2004-03-31 2013-11-12 Google Inc. Labeling messages of conversations and snoozing labeled conversations in a conversation-based email system
US9002848B1 (en) * 2011-12-27 2015-04-07 Google Inc. Automatic incremental labeling of document clusters
US20140015855A1 (en) * 2012-07-16 2014-01-16 Canon Kabushiki Kaisha Systems and methods for creating a semantic-driven visual vocabulary
US20160103885A1 (en) * 2014-10-10 2016-04-14 Workdigital Limited System for, and method of, building a taxonomy

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Ankerst, M., Breunig, M. M., Kriegel, H. P., & Sander, J. (1999, June). OPTICS: ordering points to identify the clustering structure. In ACM Sigmod record (Vol. 28, No. 2, pp. 49-60). ACM. *

Also Published As

Publication number Publication date
US11074285B2 (en) 2021-07-27
US20180329989A1 (en) 2018-11-15

Similar Documents

Publication Publication Date Title
US20190012629A1 (en) Team performance supervisor
US10387455B2 (en) On-the-fly pattern recognition with configurable bounds
US10078688B2 (en) Evaluating text classifier parameters based on semantic features
US8762375B2 (en) Method for calculating entity similarities
US20210349929A1 (en) Recursive agglomerative clustering of time-structured communications
US10721201B2 (en) Systems and methods for generating a message topic training dataset from user interactions in message clients
US7908283B2 (en) Finding superlatives in an unordered list
Suleiman et al. SMS spam detection using H2O framework
US10460041B2 (en) Efficient string search
US20220207483A1 (en) Automatic document classification
US10936638B2 (en) Random index pattern matching based email relations finder system
Hadi et al. Aobtm: Adaptive online biterm topic modeling for version sensitive short-texts analysis
Proskurnia et al. Template induction over unstructured email corpora
CN113836316B (en) Processing method, training method, device, equipment and medium for ternary group data
CN111198983A (en) Sensitive information detection method, device and storage medium
Bordino et al. Advancing NLP via a distributed-messaging approach
Shehu et al. Enhancements to language modeling techniques for adaptable log message classification
CN113641823A (en) Text classification model training method, text classification device, text classification equipment and medium
US20230078263A1 (en) Recursive agglomerative clustering of time-structured communications
Hong et al. The adaptive SPAM mail detection system using clustering based on text mining
US20230325708A1 (en) Pairwise feature attribution for interpretable information retrieval
CN109840320B (en) Customization of text
Chen et al. Multi-granularity user interest modeling and interest drift detection
CN115146070A (en) Key value generation method, knowledge graph generation method, device, equipment and medium
CN114492393A (en) Text theme determination method and device and electronic equipment

Legal Events

Date Code Title Description
AS Assignment

Owner name: YVA.AI, INC., CALIFORNIA

Free format text: CHANGE OF NAME;ASSIGNOR:FINDO INC.;REEL/FRAME:056986/0596

Effective date: 20181002

Owner name: FINDO, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SELEDKIN, VIACHESLAV;YAN, DAVID;CHILINGARYAN, MARINA;SIGNING DATES FROM 20181004 TO 20190114;REEL/FRAME:056983/0757

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

AS Assignment

Owner name: VISIER SOLUTIONS INC., CANADA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:YVA.AI, INC.;REEL/FRAME:059777/0733

Effective date: 20220426

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION