Connect public, paid and private patent data with Google Patents Public Datasets

Information exploration systems and methods

Info

Publication number
CA2629999A1
CA2629999A1 CA 2629999 CA2629999A CA2629999A1 CA 2629999 A1 CA2629999 A1 CA 2629999A1 CA 2629999 CA2629999 CA 2629999 CA 2629999 A CA2629999 A CA 2629999A CA 2629999 A1 CA2629999 A1 CA 2629999A1
Authority
CA
Grant status
Application
Patent type
Prior art keywords
cluster
information
phrase
document
clusters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CA 2629999
Other languages
French (fr)
Other versions
CA2629999C (en )
Inventor
Kevin B. Thompson
Matthew S. Sommer
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kroll Ontrack Inc
Original Assignee
Engenium Corporation
Kevin B. Thompson
Matthew S. Sommer
Kroll Ontrack Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor ; File system structures therefor
    • G06F17/3061Information retrieval; Database structures therefor ; File system structures therefor of unstructured textual data
    • G06F17/30705Clustering or classification
    • G06F17/3071Clustering or classification including class or cluster creation or modification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99935Query augmenting and refining, e.g. inexact access

Abstract

Disclosed information exploration system and method embodiments operate on a document set to determine a document cluster hierarchy. An exclusionary phrase index is determined for each cluster, and representative phrases are selected from the indexes. The selection process may enforce pathwise uniqueness and balanced sub-cluster representation. The representative phrases may be used as cluster labels in an interactive information exploration interface.

Description

Information Exploration Systems and Methods CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] The present application relates to U.S. Patent 6,847,966, "Method and System for Optimally Searching A Document Database Using a Representative Semantic Space", filed April 24, 2002, by inventors Matthew Sommer and Kevin Thompson, and hereby incorporated herein by reference.

BACKGROUND

[0002] Information commonly comes in large quantities. In many cases, such as published reference works, cataloged library systems, or well-designed databases, the information is organized and indexed for easy information retrieval. In many other cases, such as litigation discovery, personal document collections, electronic records on local networks, and internet search results (to name just a few), the information is poorly organized and not indexed at all, malcing it difficult to locate and retrieve desired information.

[0003] In the past, information providers have scanned documents or otherwise obtained documents in electronic form and have applied automated searching techniques to aid users in their quest for information. Generally, such information providers employ term searching with Boolean operations (AND, OR, and NOT). Though computationally efficient, this automated searching technique suffers from a number of drawbacks. The primary drawback is the sensitivity of the search results to the choice of search terms. In a body of documents, the sought-after information may be hidden by its use of synonyins, misspellings, and different word forms (e.g., ice, iced, ices, icing, deice, re-ice, ...). A second major drawback is this search technique's failure to discern differences in term usage, and consequently this search tedhnique'Tdfti''rns a" Mg6 "percentage of irrelevant results (e.g., "icing"
refers to frost formation, a sugared calce topping, and a hockey penalty).

[0004] These drawbacks can usually be overcome by a person having great familiarity with the information being sought, e.g., by structuring a query using terms commonly used in the souglit-after document's subject area. Unfortunately, such familiarity is commonly not possessed by the searcher. Accordingly, information providers seelc alternative searching techniques to offer their users. A searching teclnnique would greatly benefit such infonnation providers if it enabled users to find their desired information without necessitating some preexisting degree of familiarity with the sought after information or the searching tool itself.
BRIEF DESCRIPTION OF THE DRAWINGS

[0005] A better understanding of the various disclosed embodiments can be obtained when the following detailed description is considered in conjunction with the following drawings, in which:

Fig. 1 shows an illustrative information exploration system embodied as a desktop computer;

Fig. 2 shows a block diagram of an illustrative information exploration system;
Fig. 3 shows an illustrative set of documents;

Fig. 4 shows an illustrative set of documents represented as vectors in a three dimensional concept space;

Fig. 5 shows a flow diagram of an illustrative information exploration method;
Fig. 6A shows an illustrative hierarchy of clusters in dendrogram form;

Fig. 6B shows an illustrative clustering tree with a branching factor of three;
Fig. 7A shows an illustrative suffix tree;

Fig. 7B shows an illustrative phrase index derived from the suffix tree of Fig. 7A;

1l'Yg: Z5'snows"an-t'n'Ustrative phrase-to-leaf index;

Fig. 9 shows a flow diagrain of an illustrative ch.ister-naining method; and Figs. 1 OA and 1 OB show an illustrative infonnation exploration interface [0006] While the invention is susceptible to various modifications and alternative fonns, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the drawings and detailed description thereto are not intended to limit the invention to the particular fonn disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention as defined by the appended claims.

TERMINOLOGY

[0007] In the following discussion and in the claims, the terms "including"
and "comprising"
are used in an open-ended fashion, and thus should be interpreted to mean "including, but not limited to...".

[0008] Also, the tenn "couple" or "couples" is intended to mean either an indirect or direct electrical connection. Thus, if a first device couples to a second device, that connection may be through a direct electrical connection, or through an indirect electrical connection via other devices and connections.

[0009] The tenn "set" is intended to mean one or more items forming a conceptual group.
The tenn "phrase" is intended to mean a sequence of one or more words. The term "document" refers to a set of phrases. The term "cluster" is intended to mean a set of documents grouped on the basis of some similarity measure.

DETAILED DESCRIPTION

[0010] Infonnation exploration methods and systems are disclosed herein. In some embodiments, these methods and systems take a set of documents and determine a hierarchy o't cuisters' representirig vanous document subsets. The clusters are labeled with phrases that identify themes coimnon to the documents associated with the cluster. The cluster set's hierarchical nature enables a user to explore document set infozmation at many levels. For example, a root cluster is labeled with phrases representing themes or characteristics shared by the document set as a whole, and subordinate clusters are labeled with phrases representing themes or characteristics that unify significant document subsets. By exploring the document set's themes and characteristics in a progression from general to specific, a user becomes familiarized with the document set in a fashion that efficiently guides the user to sought-after information. In other embodiments and variations, themes and characteristics representative of a selected document set are determined for a user in a dynamic, set-by-set fashion as the user identifies a document set of possible interest. These dynamic embodiments enable a user to quickly discern whether a selected document set is worthy of further analysis or not.

[0011] Figure 1 shows an illustrative system 100 for information exploration.
System 100 is shown as a desktop computer 100, although any electronic device having some amount of computing power coupled to a user interface may be configured to carry out the methods disclosed herein. Among other things, servers, portable computers, personal digital assistants (PDAs) and mobile phones may be configured to carry out aspects of the disclosed methods.

[0012] As shown, illustrative system 100 comprises a chassis 102, a display 104, and an input device 106. The chassis 102 comprises a processor, memory, and information storage devices.
One or more of the information storage devices may store programs and data on removable storage media such as a floppy disk 108 or an optical disc 110. The chassis 102 may fiirther comprise a network interface that allows the system 100 to receive information via a wired or wireless network, represented in Figure 1 by a phone jack 112. The information storage media ana inrorrriacron transport -media (i.e., the networlcs) are collectively called "inforination carrier media."

[0013] The chassis 102 is coupled to the display 104 and the input device 106 to interact with a user. The display 104 and the input device 106 may together operate as a user interface. The display 104 is shown as a video monitor, but may take many alternative fonns such as a printer, a speaker, or other means for communicating information to a user.
The input device 106 is shown as a keyboard, but may similarly take many alternative forms such as a button, a mouse, a keypad, a dial, a motion sensor, a camera, a microphone or other means for receiving information from a user. Both the display 104 and the input device 106 may be integrated into the chassis 102.

[0014] Figl.ire 2 shows a simplified functional block diagram of system 100.
The chassis 102 may comprise a display interface 202, a peripheral interface 204, a processor 206, a modem or other suitable network interface 208, a memory 210, an information storage device 212, and a bus 214. System 100 may be a bus-based computer, with the bus 214 interconnecting the other elements and carrying communications between- them. The display interface 202 may take the form of a video card or other suitable display interface that accepts information from the bus 214 and transforms it into a form suitable for the display 104.
Conversely, the peripherai interface 204 may accept signals from the keyboard 106 and other input devices such as a pointing device 216, and transform them into a form suitable for communication on the bus 214.

[0015] The processor 206 gathers infoimation from other system elements, including input data from the peripheral iiiterface 204, and program instructions and other data from the memory 210, the information storage device 212, or from a remote location via the networlc interface 208. The processor 206 carries out the program instructions and processes the data atcdfdYYig'lyr'he pr6g'rarri mstn.ictions may ftirther configure the processor 206 to send data to otlier system elements, comprising information for the user which may be communicated via the display interface 202 and the display 104.

[0016] The networlc interface 208 enables the processor 206 to comrnunicate witli reinote systems via a network. The memory 210 may serve as a low-latency temporary store of information for the processor 206, and the information storage device 212 may serve as a long term (but higher latency) store of information.

[0017] The processor 206, and hence the coinputer 100 as a whole, operates in accordance with one or more programs stored on the information storage device 212. The processor 206 may copy portions of the programs into the memory 210 for faster access, and may switch between programs or carry out additional programs in response to user actuation of the input device. The additional programs may be retrieved from information the storage device 212 or may be retrieved from remote locations via the network interface 208. One or more of these programs configures system 100 to carry out at least one of the information exploration methods disclosed herein.

[0018] Figure 3 shows an illustrative set of documents 302-306. The document set can originate from anywhere. Examples include internet search results, snippets from internet search results, electronic database records, and text document files. Each document includes one or more "sentences", i.e., one or more groups of words separated by punctuation or some other form of semantic separators. It is contemplated that the document set as a whole may be large, i.e., more than 1,000 documents, and potentially may include millions of documents.
Witli such document sets, the disclosed information exploration methods are expected to greatly reduce the time required for a user to become familiar with the general contents of the ,nbcumenr set; as weiz as tne time required for a user to locate a particular document of interest or documents containing information on a particular topic of interest.

[0019] In various embodiments, the disclosed information exploration inetliods employ clustering teclin.iques to grotip similar doctunents together. Many such clustering tecluliques exist and may be einployed. Exainples of suitable clustering teclmiques include suffix tree clustering (see, e.g., O. Zamir and O. Etzioni, "Web Document Clustering: A
Feasibility Demonstration", SIGIR'98 Proceedings pp 46-54), agglomerative hierarchical clustering, divisive hierarchical clustering, K-means clustering, bisecting K-means clustering (see, e.g., M. Steinbach, et al., "A Comparison of Document Clustering Techniques", Technical Report #00-034, Univ. of Minnesota), Buckshot, Fractionation (see, e.g., D.R.
Cutting, et al., "Scatter/Gather: a cluster-based approach to browsing large document collections", SIGIR'92 Proceedings pp 318-329). Each of the foregoing references is hereby incorporated by reference.

[0020] Each of the foregoing clustering technique examples (except perhaps suffix tree clustering) can employ a variety of similarity measures to evaluate the "distance" or dissimilarity between any two documents. The similarity measures may be based on term frequency vectors (see, e.g., M. Steinbach, et al., cited previously), concept-space vectors, and other representations. However, clustering quality is an important consideration for the success of the disclosed information exploration methods. Clustering quality has been found to be significantly better when latent semantic analysis (LSA) principles are applied to obtain concept-space vectors. LSA-based clustering methods establish a relationship between documents in a document set and points in a high-dimensional concept space.

[0021] As described in U.S. Patent No. 6,847,966, "Method and System for Optimally Searching a Document Database Using a Representative Semantic Space", document-to-'rcbnr,'epr'Spcqte" relatiOnship can be established by applying singular value decomposition to a terins-to-document matrix. Briefly summarized, a terins-to-documents matrix is created having a row for each term and a column for each document. (Peivasive terms such as "the", "and", "in", "as", etc., may be eliminated from consideration.) Each matrix element is the number of times that row's term can be found in that column's document. In some embodiments, each row of the matrix is multiplied by a weighting factor to accoLUit for the discriminating power of different terms. For example, the weighting factor may be the term's inverse document frequency, i.e., the inverse of the number of documents in which that term appears. The term-to-document A matrix is then decomposed using singular value decomposition into three matrices: a term-to-concept matrix T, a diagonal matrix S, and a concept-to-doctunent matrix DT:

A=T S DT (1) Each colunm of the concept-to-document matrix DT provides the concept-space vector for a corresponding document. Once the term-to-concept matrix T, and the diagonal matrix S, have been established, the document-to-concept space relationship can be expressed:

dT = S I TT a, (2) where a is a column vector of term frequencies, i.e., the elements are the number of times that row's term appear in a given document, and dT is a resulting column vector of concept coordinates, i.e., the concept space vector for the given document. In embodiments having weighting factors for the elements of the term-to-document matrix A, those weighting factors are also applied to the column vector a. The relationship given by equation (2) can be applied to documents that were not in the set of documents used to derive the matrices S and T, although the concept-space vector may need to be normalized. The term "pseudo-document I~ecWa Y' 'HdrbYn V ed 'T6 "'r'efer to those concept-space vectors calculated for documents not in the original set of documents used to derive the S aiid T matrices. For fttrther details, refer to U.S. Patent No. 6,847,966, which is hereby incorporated by reference.

[0022] Fig. 4 shows an illustrative document set in a concept space. Three perpendicular axes are shown (X, Y, and Z), each axis representing a different concept. A unit sphere 402 (i.e., a sphere of radius 1) is shown centered at the origin. Also shown is a plurality of concept-space vectors 404, which may include one or more pseudo-docLunent vectors. Each of the vectors 404 is normalized to have unit length, meaning that each vector 404 is drawn from the origin to a point on unit sphere 402. Each of the vectors 404 is derived from a corresponding document in the document set, with the vector's direction representing some combination of the concepts represented by the illustrated axes. Although only three dimensions are shown, it is expected that in practice many more dimensions will be used.

[0023] Clustering techniques seek to group together documents concerning similar concepts.
In concept space, documents concerning similar concepts should be represented by vectors having similar orientations. Document similarity may thus be measured by determining the inner ("dot") product of the documents' concept space vectors. The dot product of two unit vectors equals the cosine of the angle between the vectors, meaning that aligned ("similar") concept vectors have a similarity of 1, while oppositely-aligned concept vectors have a similarity of -1. Other document similarity measures exist and may be employed. For examples of other similarity measures, see U.S. Patent Nos. 5,706,497, 6,633,868, 6,785,669, and 6,941,321, which are hereby incorporated herein by reference.

[0024] Fig. 5 shows an illustrative information exploration method that may be implemented by information exploration system 100. The method comprises four phases:
obtaining the documents 502, clustering the documents 504, labeling the clusters 516, and exploring the begins with block 502, in which the document set is obtained or otherwise made accessible. The documents may be obtained and processed incrementally, meaning that the operations represented by block 502 continue to occur even as the operations represented by subsequent processing blocks are initiated. Moreover, various processing operations described below may be serialized or parallelized. For example, the labeling operations 516 may be completed before exploration operations 528 are initiated.
Alternatively, various labeling operations 516 and exploration operations 528 may occur concurrently. (When parallelized, the infonnation exploration process may exploit available computational resources more effectively at the cost of added programming complexity.) [0025] The illustrated clustering operations 504 begin with block 506, which represents the information exploration system's conversion of docuinents into concept-space vectors. The conversion relationship may be predetermined from a representative document set, or the relationship may be calculated anew for each document set. The relationship may be detennined using singular value decomposition in accordance with equation (1), and applied in accordance with equation (2).

[0026] In block 508, the information exploration system 100 determines a number of clusters.
This number may be determined in a variety of ways such as having a predetermined number, or a number based on a distribution of the document vectors. In some preferred embodiments, the number is determined from the size of the document set, and chosen to predispose the final clusters to have a predetermined target size. For example, a target cluster size may be chosen, such as 100 documents per cluster. The size of the document set may then be divided by the target cluster size to and rounded up or down to obtain a number of clusters.
Alternatively, the number of clusters may be chosen as a function of the number of concept space dimensions, e.g., 2d or 2d , where d is the number of concept space dimensions. As yet "anotner aiternative, tne target cluster size may be allowed to vary as a nonlinear fu.nction of the document set size, so that the number of clusters is (e.g.) n = rN / ( l+logaN )I, (3) where N is the document set size.

[0027] In the illustrated clustering operations 504, a bisecting K-means clustering technique is employed. Initially, the document set is treated as a single cluster. In block 510, the current number of clusters is compared to the desired number of clusters to determine whether the clustering operations are complete. Other stopping criteria may be used in addition, or alternatively to, the number of clusters. For example, the clustering operations may be considered complete if the average clustering error falls below a predetermined threshold.
One possible definition for average clustering error is:

E = N y I IIdT - dIT J 2 , (4) kEC iEk wlzere C is the set of clusters, dIT is the average concept-space document vector for cluster k (hereafter termed the "mean cluster vector"), and dT is the ith concept-space document vector in cluster k.

[0028] In block 512, the information exploration system 100 selects a cluster to be divided. In some embodiments, the largest undivided cluster is selected for division. In other embodiments, the cluster with the largest clustering error is selected. As the loop iterates, the clusters are iteratively split and split again until the stopping criterion is met.

[0029] In block 514, the selected cluster is processed to determine two sub-clusters. The sub-clusters may be determined using a K-means algorithm. The determination involves the random selection of two members of the original cluster as "seeds" for the sub-clusters. Since "tLie "triit~al'"'~tt'b=chist~rs' have only one document vector, the mean cluster vector for the sub-clusters equals the corresponding document vector. Each of the remaining members of the original cluster are in turn compared to the mean cluster vectors and grouped into the sub-cluster witli the closest mean cluster vector. The mean cluster vector is updated as each new vector is grouped with the sub-cluster. Once each document vector has been processed, a tentative division of the original cluster has been determined. This process may be repeated multiple times, and the various resulting tentative divisions may be coinpared to determine the "best" division. The determination of a best division may be based on sub-cluster sizes, with more equal sizes being preferred. Alternatively the determination of best division may be based on average clustering error, with the smallest error being preferred. In some embodiments, the first tentative division is accepted if the disparity in sizes is no greater than 1:4. If the disparity is too great, the process is repeated to obtain a different tentative division of the cluster.

[0030] Once a tentative division is accepted, the original cluster is replaced with its two sub-clusters (although the original cluster is stored away for later use}. The information exploration system repeats blocks 510-514 until the stopping criterion is met.
The iterative subdividing of clusters creates a hierarchy of clusters as illustrated by Fig.
6A.

[0031] Fig. 6A shows an illustrative hierarchy of clusters in which a document set 602 is iteratively subdivided to obtain leaf clusters 604-616 (also identified by roman numerals I-VII). In the document set and the leaf clusters, letters A through P are used to represent docutnents. (Note that intermediate clusters can be reconstructed by combining leaf clusters as shown by the branches.) The cluster selection rule for this example is to select the largest remaining cluster, and the stopping criterion was that the largest cluster would have less than four documents. After the first division, the original cluster has been divided into {L,F,G,H,J,O,P}. The larger of these two clusters is then divided.
At the stage identified by line 620, three clusters exist: {A,E,K,L,M,N}, {B,D,I}, and {C,F,G,H,J,O,P}. The third of these is now the largest cluster, so it is next to be subdivided.

[0032] The repeated two-way divisions produce a binary clustering tree (i.e., each intennediate node has two children). However, the divisions occur in a sequential manner, meaning that any number of children can be identified for each intermediate node. For example, a ternary clustering tree can be constnicted in the following way:
the first two divisions of the original document data set produces three sub-clusters, as indicated by line 620. The first of these sub-clusters, after being divided twice, produces anotller three sub-clusters as indicated by line 622. The second original sub-cluster 610 is a leaf node, and has no children. The third original sub-cluster is divided twice, producing another three sub-clusters as indicated by line 624. Replacing lines 620-624 with nodes, the ternary tree shown in Fig. 6B results.

[0033] The desired branching factor for the cluster set hierarchy can be adjusted by moving lines 620-624 upwards or downwards along the tree (assuming enough divisions have been made to reach the desired branching factor). Thus the original binary cluster hierarchy can be converted into a ternary-, quaternary-, or n-ary cluster hierarchy. This configurability can be exploited in the exploration phase described further below.

[0034] Retuniing to Fig. 5, the illustrated cluster labeling operations 516 begin in block 518 with the construction of a suffix tree for each of the leaf clusters. A suffix tree is a data structure that is useful for the construction of a phrase index (block 520). A
"true" suffix tree is an agglomeration of paths beginning at the root node and progressing branch-by-branch into the tree. Each branch represents one or more words. A path is defined for each sentence in the document set and for each suffix of those sentences. (A five-word sentence has four stiftitds"'tl'f6'1Vt four'wor'cTs'of the sentence, the last three words of the sentence, the last two words of the sentence, and the last word of the sentence.) Every node of the suffix tree, except for the root and leaf nodes, has at least two children. If a node does not have two children, the node is eliminated and the word(s) associated with the links to and from the node are joined to form a multi-word phrase.

[0035] Before giving an example of a suffix tree, some sliglit simplifications will be discussed. The branching factor for a suffix tree can be quite high, leading to a potentially very large data structure. To improve the system performance, dociunents may be "cleaned"
before being built into the suffix tree. The cleaning involves eliminating "stop words", i.e., any words in a predefined set of words. The predefined set includes pervasive words (such as "a", "an", "the"), numbers ("1", "42"), and any other words selected by the system designer as not being helpful. The cleaning further involves "stemming", a process in which word stems are retained, but prefixes and suffixes are dropped. Stemming makes "walk", "walker", "walking" equivalent. Throughout the cleaning process, the position of a given term in the original document is stored for later use. As another simplification, the suffix tree depth is preferably limited to a predetermined maximum (e.g., a maximum of six branches from the root). Taken together, these simplifications preserve feasibility without a significant performance sacrifice.

[0036] Fig. 7A shows an illustrative suffix tree determined for a document set having three documents {B,D,I}, each with a single sentence:

B: "cat ate cheese"

D: "mouse ate cheese too"
I: "cat ate mouse too"

1jcscument-tr tias one -s-eritence anct two suffixes, each having a corresponding path in the suffix tree. The three paths begin at the root node 702, and end at a node labeled by a "B" in a square. For example, beginning at node 702, taking the branch labeled "cat ate" to node 704, followed by the branch labeled "cheese" reaches the "B" in the lower left corner. As another example, beginning at node 702 and taking the branch labeled "ate" to node 706, followed by the branch "cheese" yields a suffix path that reaches the second "B" four squares from the left of the figure. Similar paths can be found for the sentences and suffixes in docLunents D and I
too.

[0037] The suffix tree represents a phrase index. Each node (other than the root node) having at least two different documents (i.e., squares) in its descendants represents a plirase that is common to two or more documents. Thus node 704 represents the phrase "cat ate", which appears in documents B and I. Node 706 represents the phrase "ate", which appears in all three documents. Node 708 represents the phrase "ate cheese", which appears in documents B
and D. These and other shared phrases are shown in the illustrative phrase index of Fig. 7B.

[0038] Fig. 7B shows an illustrative master phrase index having all shared phrases from the document set {B,D,I}. The phrase index may be determined from a suffix tree as described above or it may be determined through other means including, e.g., a suffix array. In practice, the phrase index can get quite lengthy, and accordingly, the phrase indices that are determined for the leaf clusters in practicing various disclosed method embodiments may include only phrases that occur in some minimum number of documents of the corresponding leaf cluster.
In block 520 of Fig. 5, the information exploration system 100 constructs a master phrase-to-leaf cluster index by combining the phrase indices for each leaf cluster. For each phrase, the system 100 identifies the leaf clusters containing that phrase. Although it is not strictly necessary, the phrase index may also include an indication of one or more documents and posinons witnin tne onginal aocuments where the phrase appears. Also for each phrase, the phrase index includes a score determined by system 100. Different scoring strategies may be employed. In some embodiments, the score is a product of the phrase's coverage (the fraction of documents in which it appears), and a logaritlun of the phrase's length.
For example, the scores given in Fig. 7B are deteimined in accordance with:

Score = ( i / N ) logZ L , (5) where m is the number of documents in which the phrase appears, N is the number of documents in the document set, and L is the number of words in the phrase.

[0039] In block 520 of Fig. 5, the master phrase index is sorted by score.
Fig. 8 shows an example of a sorted phrase-to-leaf index for the cluster hierarchy shown in Fig. 6A. In block 522, the information exploration system 100 iterates through the nodes in cluster hierarchy, selecting representative phrases from the master phrase index to label each node. In at least some embodiments, the label selection strategy produces multi-phrase labels that include only phrases that do not appear in sibling clusters. For example, node 622 in Fig.
6A includes only documents in leaf clusters I, II, and III. Thus any phrase that the index indicates as appearing in leaf clusters IV, V, VI, or VII, would not be selected as a label for node 622. (Note that in some embodiments the phrase index construction method may disregard a phrase's appearance in some leaf clusters if that phrase appears in less than some predetermined number of documents. Accordingly, this exclusionary principle may not be absolute.) The representative phrases are used to label the clusters during exploration operations 528.

[0040] In some alternative embodiments, the information exploration systein 100 may determine a node-specific score for each of the phrases in the master phrase index. This operation is optional, but may be desirable if it is desired to determine which phrase is most descriptive of the cluster. Though the score determined in equation (5) provides some repre0ntative the phrases are of the cluster, node-specific scoring strategies may be preferred. For example, the phrases may be converted into concept-space docuinent vectors using equation (2), and scored in accordance with the phrase vector's similarity to the cluster centroid, e.g., the average document vector for the cluster. (This similarity may be termed the pluase-vector-to-ch.tster-centroid similarity.) Further details of labeling strategies for block 522 are disclosed below in the description of Fig. 9.

[0041] Exploration operations 528 begin at the root of the clustering tree (e.g., node 620 of Fig. 6B). In block 530, the information exploration system 100 displays representative phrases for the clusters at the current position in the clustering tree.
Initially, a single root cluster would be displayed as a label comprising the representative phrases for that cluster. In block 532, the information exploration system 100 processes user input.
Expected user input includes a termination command and selection of a different node in the clustering tree. A
termination command causes the information exploration operations 528 to halt.
Otherwise, in block 534, the information exploration system changes the current position in the clustering tree, and in block 536, the information exploration system determines whether the current position is a leaf node. If not, the information exploration system returns to block 530.

[0042] If a leaf node has been reached, then in block 538, the information exploration system 100 shows a list of titles of the doctunents in the cluster, and allows the user to examine (in a separate window) the contents of documents selected by the user. In block 540, the information exploration system 100 determines wliether user input is a termination command or a position change. As before, a termination command halts the exploration operations 528, and a position change sends the information control system 100 back to block 534.

[0043] Illustrative display screens are shown in Figs. l0A and lOB. In Fig.
10A, display screen 1002 shows a label 1004 representing a root cluster. Label 1004 appears on the left id.
(,.sibiirig ,Y.s .....e ot""ctisp 1a .. y.s _ creen 1002. When a user selects label 1004 (or anY other labels on the sibling side of the screen), the right ("child") side of the display screen shows labels representing the sub-clusters of the selected cluster, and a "Back" link 1006.
When a user selects a label (e.g., label 1008) on the child side of the display screen, the contents of the child side of the screen are transferred to the sibling side of the screen, and the sub-clusters of the selected cluster are shown on the child side of the display. Fig. 10B
shows an example of display 1002 after label 1008 has been selected. If the selected cluster has no sub-chisters (i.e., the selected cluster is a leaf node), the information exploration system 100 shows on the right side of the display a list of document titles for the selected cluster.

[0044] Except when the current position is the root node, the "Back" link 1006 causes information exploration system 100 to transfer the contents of the sibling side of the display to the child side of the display, and to display the selected cluster's parent cluster and siblings of the parent cluster.. When the current position is the root node, the Back link 1006 causes the root node to be de-selected and clears the child side of the display.

[0045] Though cluster quality is important, the quality of the cluster labels is often even more important to the user. The representative phrases should, at a glance, give the user some understanding of what documents are to be found in the cluster. Some sophistication can therefore be justified in the selection process represented by block 526.

[0046] Fig. 9 shows an illustrative cluster-naming process. The process iterates systematically through the clustering tree, proceeding either in a bottom-up fashion (processing all nodes at a given depth in the tree before progressing closer to the root node) or in a top-down fashion.
Beginning with block 902, the information exploration system selects a first node of the clustering tree. Blocks 904-924 represent a loop that is executed to iterate through the master phrase index, restarting at the beginning of the phrase index each time a new node is selected.

Iri blbclc 904;. the fntormation exploration system tests whether the end of the index has been reached withotit selecting enough representative pllrases for the current node. If so, a flag is set in block 906 to waive the exclusivity requirement, and the loop iterations are restarted at the beginning of the phrase index.

[0047] In block 908, the information exploration system 100 selects the highest scoring, previously unselected phrase for the current node as a candidate phrase. As discussed previously, the scoring strategy may be designed to create a selection preference for longer phrases. Alternatively, a strict nile may be enforced on phrase length, restricting representative phrases to a length between two and five significant words, inclusive.
Throughout the phrase indexing process, each indexed phrase may be associated with at least one pointer to a doctunent where the original phrase (i.e., before document cleaning and stemming operations) appears. When an indexed phrase is selected as a representative phrase, the information exploration system 100 provides an example of the original phrase as part of the cluster label.

[0048] In block 910, the information exploration system 910 tests the phrase exclusivity, i.e., whether the phrase appears in any leaf nodes that are not descendants of the current node. If the phrase is not exclusive, the information system determines in block 912 whether the exclusivity requirement has been waived and replaced by more relaxed coverage test, e.g., whether the phrase's coverage of the leaves of the current node is at least 20% higher than that phrase's coverage of leaves that are not descendants of the current cluster. If the exclusivity requirement has not been waived or the more relaxed coverage requirement is not satisfied, then the information exploration system returns to block 904.

[0049] Conversely, if the exclusivity or more relaxed coverage requirements are satisfied, then in block 914 the information exploration system 100 compares the candidate phrase to previousiy seiectea representative phrases for the current node. If the newly selected representative phrase is a superstring of one or more previously selected phrases, the one or more previously selected phrases are dropped tuiless the difference in cluster coverage exceeds a predetermined threshold, e.g., 20%. For exainple, if the newly selected phrase "chairman of the board" has a leaf cluster coverage of 30%, and the previously selected phrases "chairman" and "board" have leaf cluster coverages of 35% and 75%, respectively, the previously selected phrase "chairman" would be dropped and the previously selected plirase "board" would be retained due to the difference in leaf cluster coverage.

[0050] In block 914, the information exploration system 100 also determines whether the newly selected phrase has more than 60% of its significant words (i.e., words that are not stop words) appearing in any one previously selected phrase. If so, the newly selected phrase will be dropped. Of course, the overlap threshold is programmable and can be set to other values.
If the candidate is dropped, the information exploration system returns to block 904.

[0051] If the previous tests are satisfied, the information exploration system 100 may further apply a test for path-wise uniqueness in optional block 916. In optional block 906, the information exploration system 100 drops candidate phrases that are path-wise non-unique.
When the process proceeds in a top-down fashion, the current node represents one end of a path from the root node. To aid in the exploration of the document set, the representative phrases used to label clusters preferably change as the clusters become smaller and more focused. Accordingly, the information exploration system 100 in block 906 drops phrases from the phrase index if those phrases have already been selected as representative of a previous cluster in the path. Thus a user, in following any given path from root node to leaf node in the clustering tree, will not encounter any representative phrase more than once, making the representative phrases "path-wise" unique.

[UUSLI utner uniqueness tests could be used. For example, in a bottom-up process, the infonnation system may drop phrases from the phrase index if those phrases have been selected as representative of any subordinate clusters of the current clusters, i.e., "descendants" in the tree such as children, grandchildren, etc. When clustering quality is high, the uniqueness of the representative phrases is expected to be inherent in the exclusionary phrase indices, and accordingly, block 916 may be treated as an optional operation.

[0053] In optional block 918, the information exploration system 100 determines whether one or more sub-clusters are being underrepresented by the selected phrases. For exainple, if five representative phrases are to be selected to represent a given cluster, and all of the five phrases that have been selected have less than 10% coverage of a given sub-cluster's leaves, the newly selected phrase may be dropped in favor of the next highest-scoring phrase having at least a 25% coverage of the given sub-cluster's leaves. The representation thresholds are programmable and may be allowed to vaiy based on cluster size. In at least some embodiments, at least one "slot" is reserved for each sub-cluster to assure that none of the sub-clusters go without representation in the cluster label. The number of reserved slots is thus equal to the chosen branching factor for the clustering tree. In some implementations of these embodiments, the reserved slot may be released for general use if a previously-selected phrase provides high coverage of the associated sub-cluster's leaf nodes.

[0054] In block 920, the information exploration system determines whether the selection of representative plirases for the current clustering tree node is complete. In some embodiments, this determination is simply a comparison with the desired number of representational phrases for the current node. In other embodiments, this determination is a comparison of the overall coverage of the selected representational phrases to a desired coverage threshold. In still other embodiments, this determination is a comparison of the coverages of selected phrases with coverages 01 avaiianie pnrases to determine a rough cost-to-benefit estimate for selecting additional representational phrases.

[0055] If the selection is not complete, the information exploration system loops back to block 904. Otherwise the information exploration system 100 determines in block 922 whether there are more nodes in the clustering tree. If so, the information exploration system 100 selects the next node in block 924. If not, the information exploration system terminates the cluster-naming process.

[0056] A m.imber of programmable or user-selectable parameters may be tailored for various applications of the disclosed information exploration methods and systems. For example, the leaf cluster size may be altered to provide a trade-off between clustering quality and the number of clusters. The branching factor of the clustering tree and maximi.un tree depth can be altered to suit user tastes. Similarly, the number of representative phrases can be tailored to trade off between viewing ease and viewable detail. The cluster-naming processes disclosed herein are applicable to any cluster set, irrespective of how that cluster set is obtained. Thus, the disclosed cluster-naming processes can be used with various clustering methods, which in turn can each be based on various similarity measures. Both hierarchical and non-hierarchical clustering methods can be used, though the information exploration system 100 may be expected to perform best with mutually exclusive cluster sets, i.e., sets of clusters that do not have any documents in more than one cluster.

[0057] Numerous other variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.

Claims (30)

1. A computer-implemented information exploration method that comprises:
processing a set of documents to identify a hierarchy of clusters; and selecting for each cluster in the hierarchy of clusters one or more phrases from the set of documents as representative phrases for that cluster.
2. The method of claim 1, wherein the selecting comprises:
selecting a predetermined number of phrases that are pathwise unique.
3. The method of claim 1, wherein the selecting comprises:

selecting a predetermined number of phrases having a balanced representation of any subordinate clusters.
4. The method of claim 1, wherein the selecting comprises:
determining a score for each phrase; and selecting a predetermined number of highest-scoring phrases.
5. The method of claim 4, wherein the score is a function of phrase length absent any stop words.
6. The method of claim 4, wherein the score is a function of at least one factor in the set consisting of document coverage, leaf cluster coverage, phrase frequency, and phrase-vector-to-cluster-centroid similarity.
7. The method of claim 1, wherein the selecting comprises:
constructing a phrase-to-leaf node index for the hierarchy of clusters;
8. The method of claim 7, wherein the selecting further comprises:

constructing a suffix tree for each leaf node in the hierarchy of clusters.
9. The method of claim 1, further comprising:

providing an interactive user interface that represents the set of documents as one or more clusters that can be selected to smaller clusters, each of which can be selected in turn to reveal still smaller clusters.
10. The method of claim 9, wherein the user interface provides a corresponding representation for each displayed cluster, wherein the representation comprises the representative phrases for the cluster.
11. The method of claim 10, wherein the representation for a displayed cluster comprises a hypertext link.
12. The method of claim 9, wherein the user interface displays titles of documents in a user-selected cluster if the user-selected cluster has no subordinate clusters in the hierarchy of clusters.
13. The method of claim 1, wherein the processing comprises a bisecting K-means clustering operation.
14. The method of claim 1, wherein the processing comprises at least one clustering operation from a set consisting of suffix tree clustering, divisive hierarchical clustering, agglomerative hierarchical clustering, K-means clustering, Buckshot clustering, Fractionation clustering.
15. The method of claim 1, wherein the processing comprises:

calculating a pseudo-document vector for each document in the set of documents; and computing the hierarchy of clusters from the pseudo-document vectors.
16. The method of claim 7, wherein the representative phrases for any given cluster are indicated by the phrase-to-leaf node index to be absent from any leaf nodes that are not descendants of the given cluster.
17. An information exploration system that comprises:
a display;

a user input device;

a memory that stores software; and a processor coupled to the memory to execute the software, wherein the software configures the processor to interact with a user via the display and user input device, and wherein the software further configures the processor to:

process a set of documents to identify a hierarchy of clusters;
determine a phrase index for the hierarchy of clusters;

select one or more phrases from the phrase index as representative phrases for each cluster in the hierarchy of clusters; and displaying the representative phrases as cluster labels on the display.
18. The information exploration system of claim 17, wherein the representative phrases are selected to be pathwise unique.
19. The information exploration system of claim 17, wherein the representative phrases are exclusionary.
20. The information exploration system of claim 17, wherein as part of selecting representative phrases for a cluster, the software configures the processor to provide a balanced representation of immediate sub-clusters of that cluster.
21. The information exploration system of claim 17, wherein as part of selecting representative phrases for each cluster, the software configures the processor to determine a score for each phrase.
22. The information exploration system of claim 21, wherein the score is a function of at least one factor in the set consisting of document coverage, leaf-node coverage, phrase frequency, and phrase-vector-to-cluster-centroid similarity.
23. Application instructions on an information carrier medium, wherein the instructions, when executed, effect an information exploration interface, the application instructions comprising:

a clustering process that determines a hierarchy of clusters for a set of documents;

a cluster-naming process that selects representative phrases for each cluster in the hierarchy of clusters; and an exploration process that interactively displays the representative phrases to a user.
24. The application instructions of claim 23, wherein the cluster-naming process selects representative phrases by:

determining a phrase-to-leaf node index for the hierarchy of clusters;
scoring each phrase in the phrase index; and selecting representative phrases from the phrase index, wherein the representative phrases for any given cluster are indicated by the phrase-to-leaf node index to be absent from any leaf nodes that are not descendants of the given cluster.
25. The application instructions of claim 24, wherein the scores are a function of at least one factor in the set consisting of document coverage, leaf-node coverage, phrase frequency, and phrase-vector-to-cluster-centroid distance.
26. The application instructions of claim 23, wherein as part of selecting representative phrases, the application instructions ensure that the representative phrases are pathwise unique.
27. The application instructions of claim 23, wherein as part of selecting representative phrases, the application instructions ensure that the representative phrases are also balanced representations of sub-clusters.
28. A computer-implemented information exploration method that comprises:
processing a set of documents to identify a set of mutually exclusive clusters;
creating a phrase index for the set of mutually exclusive clusters; and selecting one or more phrases from the phrase index as representative phrases for each cluster in the set of mutually exclusive clusters.
29. The method of claim 28, wherein the representative phrases for any given cluster are indicated by the phrase index to be absent from clusters that are not descendants of the given cluster.
30. The method of claim 28, wherein the selecting comprises:
determining a score for each phrase; and selecting a predetermined number of highest-scoring phrases that satisfy an exclusivity criterion.
CA 2629999 2005-11-15 2006-11-15 Information exploration systems and methods Active CA2629999C (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US11/274,435 2005-11-15
US11274435 US7676463B2 (en) 2005-11-15 2005-11-15 Information exploration systems and method
PCT/US2006/044367 WO2007059225A3 (en) 2005-11-15 2006-11-15 Information exploration systems and methods

Publications (2)

Publication Number Publication Date
CA2629999A1 true true CA2629999A1 (en) 2007-05-24
CA2629999C CA2629999C (en) 2014-12-23

Family

ID=38042113

Family Applications (1)

Application Number Title Priority Date Filing Date
CA 2629999 Active CA2629999C (en) 2005-11-15 2006-11-15 Information exploration systems and methods

Country Status (4)

Country Link
US (1) US7676463B2 (en)
CA (1) CA2629999C (en)
GB (1) GB0810333D0 (en)
WO (1) WO2007059225A3 (en)

Families Citing this family (62)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8645137B2 (en) 2000-03-16 2014-02-04 Apple Inc. Fast, language-independent method for user authentication by voice
US7603351B2 (en) * 2006-04-19 2009-10-13 Apple Inc. Semantic reconstruction
US8131722B2 (en) * 2006-11-20 2012-03-06 Ebay Inc. Search clustering
US20080208847A1 (en) * 2007-02-26 2008-08-28 Fabian Moerchen Relevance ranking for document retrieval
US8166021B1 (en) 2007-03-30 2012-04-24 Google Inc. Query phrasification
US8166045B1 (en) 2007-03-30 2012-04-24 Google Inc. Phrase extraction using subphrase scoring
US7693813B1 (en) 2007-03-30 2010-04-06 Google Inc. Index server architecture using tiered and sharded phrase posting lists
US8510312B1 (en) * 2007-09-28 2013-08-13 Google Inc. Automatic metadata identification
US7814108B2 (en) * 2007-12-21 2010-10-12 Microsoft Corporation Search engine platform
US9330720B2 (en) 2008-01-03 2016-05-03 Apple Inc. Methods and apparatus for altering audio output signals
US20090240498A1 (en) * 2008-03-19 2009-09-24 Microsoft Corporation Similiarity measures for short segments of text
US8996376B2 (en) 2008-04-05 2015-03-31 Apple Inc. Intelligent text-to-speech conversion
US8676815B2 (en) * 2008-05-07 2014-03-18 City University Of Hong Kong Suffix tree similarity measure for document clustering
US20100030549A1 (en) 2008-07-31 2010-02-04 Lee Michael M Mobile device having human language translation capability with positional feedback
US9858925B2 (en) 2009-06-05 2018-01-02 Apple Inc. Using context information to facilitate processing of commands in a virtual assistant
US20110078144A1 (en) * 2009-09-28 2011-03-31 Oracle International Corporation Hierarchical sequential clustering
US20110074789A1 (en) * 2009-09-28 2011-03-31 Oracle International Corporation Interactive dendrogram controls
US20110078194A1 (en) * 2009-09-28 2011-03-31 Oracle International Corporation Sequential information retrieval
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
US8682667B2 (en) 2010-02-25 2014-03-25 Apple Inc. User profiling for selecting user specific voice input processing information
US8713021B2 (en) 2010-07-07 2014-04-29 Apple Inc. Unsupervised document clustering using latent semantic density analysis
US8775444B2 (en) * 2010-10-29 2014-07-08 Xerox Corporation Generating a subset aggregate document from an existing aggregate document
US8751496B2 (en) 2010-11-16 2014-06-10 International Business Machines Corporation Systems and methods for phrase clustering
JP5617674B2 (en) * 2011-02-14 2014-09-26 日本電気株式会社 Article between similarity calculation device, the inter-document similarity calculation method, and, the inter-document similarity calculation program
US9262612B2 (en) 2011-03-21 2016-02-16 Apple Inc. Device access using voice authentication
US8994660B2 (en) 2011-08-29 2015-03-31 Apple Inc. Text correction processing
JP5389130B2 (en) * 2011-09-15 2014-01-15 株式会社東芝 Document classification apparatus, method and program
JP5639562B2 (en) * 2011-09-30 2014-12-10 株式会社東芝 Service execution unit, the service execution method and a service execution program
US9483461B2 (en) 2012-03-06 2016-11-01 Apple Inc. Handling speech synthesis of content for multiple languages
US9495129B2 (en) 2012-06-29 2016-11-15 Apple Inc. Device, method, and user interface for voice-activated navigation and browsing of a document
US8700583B1 (en) 2012-07-24 2014-04-15 Google Inc. Dynamic tiermaps for large online databases
US9576574B2 (en) 2012-09-10 2017-02-21 Apple Inc. Context-sensitive handling of interruptions by intelligent digital assistant
WO2014074917A1 (en) * 2012-11-08 2014-05-15 Cooper & Co Ltd Edwin System and method for divisive textual clustering by label selection using variant-weighted tfidf
US9368114B2 (en) 2013-03-14 2016-06-14 Apple Inc. Context-sensitive handling of interruptions
WO2014144579A1 (en) 2013-03-15 2014-09-18 Apple Inc. System and method for updating an adaptive speech recognition model
US9116974B2 (en) * 2013-03-15 2015-08-25 Robert Bosch Gmbh System and method for clustering data in input and output spaces
US9501506B1 (en) 2013-03-15 2016-11-22 Google Inc. Indexing system
US9122681B2 (en) 2013-03-15 2015-09-01 Gordon Villy Cormack Systems and methods for classifying electronic information using advanced active learning techniques
US9483568B1 (en) 2013-06-05 2016-11-01 Google Inc. Indexing system
US9582608B2 (en) 2013-06-07 2017-02-28 Apple Inc. Unified ranking with entropy-weighted information for phrase-based semantic auto-completion
WO2014197336A1 (en) 2013-06-07 2014-12-11 Apple Inc. System and method for detecting errors in interactions with a voice-based digital assistant
WO2014197334A3 (en) 2013-06-07 2015-01-29 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
JP2016521948A (en) 2013-06-13 2016-07-25 アップル インコーポレイテッド System and method for emergency call initiated by voice command
US9620105B2 (en) 2014-05-15 2017-04-11 Apple Inc. Analyzing audio input for efficient speech and music recognition
US9502031B2 (en) 2014-05-27 2016-11-22 Apple Inc. Method for supporting dynamic grammars in WFST-based ASR
US9734193B2 (en) 2014-05-30 2017-08-15 Apple Inc. Determining domain salience ranking from ambiguous words in natural speech
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US9785630B2 (en) 2014-05-30 2017-10-10 Apple Inc. Text prediction using combined word N-gram and unigram language models
US9760559B2 (en) 2014-05-30 2017-09-12 Apple Inc. Predictive text input
US9842101B2 (en) 2014-05-30 2017-12-12 Apple Inc. Predictive conversion of language input
US9633004B2 (en) 2014-05-30 2017-04-25 Apple Inc. Better resolution when referencing to concepts
US9430463B2 (en) 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US9818400B2 (en) 2014-09-11 2017-11-14 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US9606986B2 (en) 2014-09-29 2017-03-28 Apple Inc. Integrated word N-gram and class M-gram language models
US9646609B2 (en) 2014-09-30 2017-05-09 Apple Inc. Caching apparatus for serving phonetic pronunciations
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
US9711141B2 (en) 2014-12-09 2017-07-18 Apple Inc. Disambiguating heteronyms in speech synthesis
US9865280B2 (en) 2015-03-06 2018-01-09 Apple Inc. Structured dictation using intelligent automated assistants
US9721566B2 (en) 2015-03-08 2017-08-01 Apple Inc. Competing devices responding to voice triggers
US9842105B2 (en) 2015-04-16 2017-12-12 Apple Inc. Parsimonious continuous-space phrase representations for natural language processing
US9697820B2 (en) 2015-09-24 2017-07-04 Apple Inc. Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks

Family Cites Families (49)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3220885B2 (en) 1993-06-18 2001-10-22 株式会社日立製作所 Keywords grant system
US7251637B1 (en) * 1993-09-20 2007-07-31 Fair Isaac Corporation Context vector generation and retrieval
EP0856175A4 (en) * 1995-08-16 2000-05-24 Univ Syracuse Multilingual document retrieval system and method using semantic vector matching
US5787422A (en) * 1996-01-11 1998-07-28 Xerox Corporation Method and apparatus for information accesss employing overlapping clusters
US5819258A (en) * 1997-03-07 1998-10-06 Digital Equipment Corporation Method and apparatus for automatically generating hierarchical categories from large document collections
US6137911A (en) * 1997-06-16 2000-10-24 The Dialog Corporation Plc Test classification system and method
US6134532A (en) * 1997-11-14 2000-10-17 Aptex Software, Inc. System and method for optimal adaptive matching of users to most relevant entity and information in real-time
US6216134B1 (en) * 1998-06-25 2001-04-10 Microsoft Corporation Method and system for visualization of clusters and classifications
US6446061B1 (en) * 1998-07-31 2002-09-03 International Business Machines Corporation Taxonomy generation for document collections
US6360227B1 (en) * 1999-01-29 2002-03-19 International Business Machines Corporation System and method for generating taxonomies with applications to content-based recommendations
WO2000046701A1 (en) * 1999-02-08 2000-08-10 Huntsman Ici Chemicals Llc Method for retrieving semantically distant analogies
US6374217B1 (en) * 1999-03-12 2002-04-16 Apple Computer, Inc. Fast update implementation for efficient latent semantic language modeling
US6408295B1 (en) * 1999-06-16 2002-06-18 International Business Machines Corporation System and method of using clustering to find personalized associations
US6438539B1 (en) * 2000-02-25 2002-08-20 Agents-4All.Com, Inc. Method for retrieving data from an information network through linking search criteria to search strategy
US6658406B1 (en) * 2000-03-29 2003-12-02 Microsoft Corporation Method for selecting terms from vocabularies in a category-based system
JP3672234B2 (en) * 2000-06-12 2005-07-20 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Maschines Corporation Retrieve rank with how the document from the database, the computer system, and a recording medium
KR100426382B1 (en) * 2000-08-23 2004-04-08 학교법인 김포대학 Method for re-adjusting ranking document based cluster depending on entropy information and Bayesian SOM(Self Organizing feature Map)
US6895406B2 (en) * 2000-08-25 2005-05-17 Seaseer R&D, Llc Dynamic personalization method of creating personalized user profiles for searching a database of information
US7039638B2 (en) * 2001-04-27 2006-05-02 Hewlett-Packard Development Company, L.P. Distributed data clustering system and method
US6742003B2 (en) * 2001-04-30 2004-05-25 Microsoft Corporation Apparatus and accompanying methods for visualizing clusters of data and hierarchical cluster classifications
US7024400B2 (en) * 2001-05-08 2006-04-04 Sunflare Co., Ltd. Differential LSI space-based probabilistic document classifier
JP3845553B2 (en) * 2001-05-25 2006-11-15 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Maschines Corporation Computer system, and a program for performing the retrieve-ranking of the documents in the database
JP3870043B2 (en) * 2001-07-05 2007-01-17 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Maschines Corporation Search of the main cluster and the outlier cluster in large-scale database, a system for the detection and identification, computer programs, and the server
US20030093411A1 (en) * 2001-11-09 2003-05-15 Minor James M. System and method for dynamic data clustering
US20030154181A1 (en) * 2002-01-25 2003-08-14 Nec Usa, Inc. Document clustering with cluster refinement and model selection capabilities
US7480628B2 (en) * 2002-01-29 2009-01-20 Netcomponents, Inc. Smart multi-search method and system
JP3860046B2 (en) * 2002-02-15 2006-12-20 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Maschines Corporation Program for information processing using a random sample hierarchical structures, systems and recording medium
US6847966B1 (en) * 2002-04-24 2005-01-25 Engenium Corporation Method and system for optimally searching a document database using a representative semantic space
US7177863B2 (en) * 2002-04-26 2007-02-13 International Business Machines Corporation System and method for determining internal parameters of a data clustering program
JP3773888B2 (en) * 2002-10-04 2006-05-10 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Maschines Corporation Data search system, a data search method, graphical user interface system for displaying program for executing a data search, a computer-readable storage medium storing the program, the retrieved documents to the computer, computer-executable program, and storage medium storing the program for realizing the graphical user interface
WO2004042493A3 (en) * 2002-10-24 2006-03-02 Agency Science Tech & Res Method and system for discovering knowledge from text documents
US7280957B2 (en) * 2002-12-16 2007-10-09 Palo Alto Research Center, Incorporated Method and apparatus for generating overview information for hierarchically related information
US7225184B2 (en) * 2003-07-18 2007-05-29 Overture Services, Inc. Disambiguation of search phrases using interpretation clusters
US20050044487A1 (en) * 2003-08-21 2005-02-24 Apple Computer, Inc. Method and apparatus for automatic file clustering into a data-driven, user-specific taxonomy
US7346629B2 (en) * 2003-10-09 2008-03-18 Yahoo! Inc. Systems and methods for search processing using superunits
US7191175B2 (en) * 2004-02-13 2007-03-13 Attenex Corporation System and method for arranging concept clusters in thematic neighborhood relationships in a two-dimensional visual display space
US20050222987A1 (en) * 2004-04-02 2005-10-06 Vadon Eric R Automated detection of associations between search criteria and item categories based on collective analysis of user activity data
US20050267872A1 (en) * 2004-06-01 2005-12-01 Yaron Galai System and method for automated mapping of items to documents
US7567959B2 (en) * 2004-07-26 2009-07-28 Google Inc. Multiple index based information retrieval system
US7580929B2 (en) * 2004-07-26 2009-08-25 Google Inc. Phrase-based personalization of searches in an information retrieval system
US7711679B2 (en) * 2004-07-26 2010-05-04 Google Inc. Phrase-based detection of duplicate documents in an information retrieval system
CN100462961C (en) * 2004-11-09 2009-02-18 国际商业机器公司 Method for organizing multi-file and equipment for displaying multi-file
US7356777B2 (en) * 2005-01-26 2008-04-08 Attenex Corporation System and method for providing a dynamic user interface for a dense three-dimensional scene
US7451124B2 (en) * 2005-05-12 2008-11-11 Xerox Corporation Method of analyzing documents
US8010480B2 (en) * 2005-09-30 2011-08-30 Google Inc. Selecting high quality text within identified reviews for display in review snippets
US20070078669A1 (en) * 2005-09-30 2007-04-05 Dave Kushal B Selecting representative reviews for display
US20070078670A1 (en) * 2005-09-30 2007-04-05 Dave Kushal B Selecting high quality reviews for display
US7558769B2 (en) * 2005-09-30 2009-07-07 Google Inc. Identifying clusters of similar reviews and displaying representative reviews from multiple clusters
US7599945B2 (en) * 2006-11-30 2009-10-06 Yahoo! Inc. Dynamic cluster visualization

Also Published As

Publication number Publication date Type
WO2007059225A2 (en) 2007-05-24 application
WO2007059225A3 (en) 2009-05-07 application
CA2629999C (en) 2014-12-23 grant
GB2452799A (en) 2009-03-18 application
US7676463B2 (en) 2010-03-09 grant
GB0810333D0 (en) 2008-07-09 grant
US20070112755A1 (en) 2007-05-17 application

Similar Documents

Publication Publication Date Title
Hotho et al. Ontology-based text document clustering
Zezula et al. Similarity search: the metric space approach
Huang Similarity measures for text document clustering
US7117206B1 (en) Method for ranking hyperlinked pages using content and connectivity analysis
US7693813B1 (en) Index server architecture using tiered and sharded phrase posting lists
Ding et al. Link analysis: hubs and authorities on the World Wide Web
US8166045B1 (en) Phrase extraction using subphrase scoring
Aggarwal et al. A survey of text clustering algorithms
Hofmann The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data
Chen et al. Dense subgraph extraction with application to community detection
Börner et al. Visualizing knowledge domains
US6038574A (en) Method and apparatus for clustering a collection of linked documents using co-citation analysis
Mecca et al. A new algorithm for clustering search results
Leung et al. Personalized web search with location preferences
US6611825B1 (en) Method and system for text mining using multidimensional subspaces
US7225183B2 (en) Ontology-based information management system and method
Pan et al. Gcap: Graph-based automatic image captioning
Almpanidis et al. Combining text and link analysis for focused crawling—An application for vertical search engines
Batsakis et al. Improving the performance of focused web crawlers
Lagus et al. Mining massive document collections by the WEBSOM method
Chang Mining the World Wide Web: an information search approach
Zhang et al. BIRCH: A new data clustering algorithm and its applications
US20050114331A1 (en) Near-neighbor search in pattern distance spaces
Khan et al. Ontology construction for information selection
Navigli et al. Inducing word senses to improve web search result clustering

Legal Events

Date Code Title Description
EEER Examination request