WO2019161258A1

WO2019161258A1 - Guided discovery of information

Info

Publication number: WO2019161258A1
Application number: PCT/US2019/018294
Authority: WO
Inventors: Patrick SHAFTO; Scott Hsin-Cheng YANG; Yue Yu; Pei Wang; Arash GIVCHI
Original assignee: Rutgers, The State University Of New Jersey
Priority date: 2018-02-16
Filing date: 2019-02-15
Publication date: 2019-08-22

Abstract

The methods and systems disclosed herein provide an improved way to facilitate searching of information. Such information may be in the form of, for example, a database, texts, collection of images or videos, internet resources, other source of quantified information (i.e., data), or a combination thereof. One example embodiment is a method of guiding a search of a dataset. The example method includes partitioning the dataset into a plurality of portions based on occurrences of words or concepts in the dataset. Identifiers of words or concepts that are associated with each of the portions are determined based on the occurrences of words or concepts. The identifiers have varying levels of abstraction with respect to the dataset, where each identifier is broader than a proceeding identifier. The identifiers are presented as options for narrowing the dataset.

Description

GUIDED DISCOVERY OF INFORMATION RELATED APPLICATION(S)

[0001] This application claims the benefit of U.S. Provisional Application No.

62/631,610, filed on February 16, 2018. The entire teachings of the above application(s) are incorporated herein by reference.

GOVERNMENT SUPPORT

[0002] This invention was made with government support under FA8750-17-2-0146 awarded by DARPA. The government has certain rights in the invention.

BACKGROUND

[0003] Searching a corpus of information typically involves using keyword searches that return results matching one or more terms that a user selects for the search. One drawback with this searching method is that the results returned depend on the user’s selection of terms. If the user does not know that a particular term will return a desirable result, the user may not ever be provided with such a result.

SUMMARY

[0004] The methods and systems disclosed herein provide an improved way to facilitate searching of information. Such information may be in the form of a database, texts, collection of images or videos, internet resources, other source of quantified information (z.e., data), or a combination thereof. One example embodiment is a method of guiding a search of a dataset. The example method includes partitioning the dataset into a plurality of portions based on occurrences of words or concepts in the dataset. Identifiers of words or concepts that are associated with each of the portions are determined based on the occurrences of words or concepts. The identifiers have varying levels of abstraction with respect to the dataset, where each identifier is broader than a proceeding identifier. The identifiers are presented as options for narrowing the dataset. In some embodiments, the dataset is a set of documents. In such an embodiment, the method includes partitioning the set of documents into a plurality of portions based on occurrences of words or concepts in the set of documents. Identifiers of words or concepts that are associated with each of the portions are determined based on the occurrences of words or concepts, and the identifiers are presented as options for narrowing the set of documents.

[0005] Another example embodiment is a method of guiding a search of a set of documents. The example method includes determining occurrences of words in the set of documents. Based on the occurrences of words, a first word associated with a first portion of the documents is determined, and a second word associated with a second portion of the documents is determined, where the second portion of the documents is a portion of remaining documents if the first portion of the documents were removed from the set of documents. At least a third word is determined, based on the occurrences of words, that is associated with a third portion of the documents, where the third portion of the documents includes documents remaining if the first and second portions of the documents were removed from the set of documents. The first, second, and third words have varying levels of abstraction with respect to the set of documents, where the second word is broader than the first word, and the third word is broader than the first and second words. The first, second, and third words are then presented as options for narrowing the set of documents. In many embodiments, documents can be returned that are associated with a selection of one of the words, and in some embodiments, the number of documents associated with each of the first, second, and third words can be presented. The first, second, and third portions of the documents may include a similar number of documents.

[0006] Determining the occurrences of words in the set of documents can include generating a matrix of document identifiers and words. A value in the matrix for a given document identifier and word can be set to a certain value if the word is present in the document identified by the document identifier.

[0007] In response to a selection of one of the words, a subsequent set of words can be determined to present as options for further narrowing documents associated with the selected word. The subsequent set of words can be determined based on the occurrences of words with respect to the documents associated with the selected word. In some embodiments, for each of the first, second, and third words, a corresponding subsequent set of words can be determined to present as options for further narrowing documents associated with the word. The subsequent set of words can be determined based on the occurrences of words with respect to the documents associated with the word. [0008] Another example embodiment is a method of guiding a search of a set of documents. The example method includes determining occurrences of concepts in the set of documents. Based on the occurrences of words, a first concept identifier associated with a first portion of the documents is determined. Second through N-l concept identifiers are then determined that are associated with second through N-l portions of the documents, respectively. Each of the second through N-l portions of the documents is a portion of remaining documents if the preceding portions of the documents were removed from the set of documents. An Nth concept identifier is then determined that is associated with an Nth portion of the documents. The Nth portion of the documents includes documents remaining if the first through N-l portions of the documents were removed from the set of documents. The first through Nth concept identifiers have varying levels of abstraction with respect to the set of documents. For a given concept identifier M, the concept identifier M is broader than concept identifier M-l . The first through Nth concept identifiers are then presented as options for narrowing the set of documents.

[0009] Another example embodiment is a system for providing guided searching of a set of documents. The system includes an interface, datastore, and processor. The interface is configured to present information to a user of the system and to accept input from the user. The datastore stores (i) the set of documents or links to the documents and (ii) a data structure representing occurrences of words in the set of documents. The processor is in

communication with the datastore and the interface and is configured to determine, based on the occurrences of words, at least three words. A first of the words is associated with a first portion of the documents. A second of the words is associated with a second portion of the documents, where the second portion of the documents is a portion of remaining documents if the first portion of the documents were removed from the set of documents. A third of the words is associated with a third portion of the documents, where the third portion of the documents includes documents remaining if at least the first and second portions of the documents were removed from the set of documents. The first, second, and third words have varying levels of abstraction with respect to the set of documents, where the second word is broader than the first word, and the third word is broader than the first and second words.

The processor is further configured to cause the interface to present the at least three words as options for narrowing the set of documents. BRIEF DESCRIPTION OF THE DRAWINGS

[0010] The foregoing will be apparent from the following more particular description of example embodiments, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating embodiments.

[0011] FIG. l is a flow chart illustrating a method of guiding a search of a dataset, according to an example embodiment.

[0012] FIG. 2 is a flow chart illustrating a method of guiding a search of a set of documents, according to an example embodiment.

[0013] FIG. 3 is a schematic diagram illustrating a data structure representing

occurrences of concepts in a dataset, according to an example embodiment.

[0014] FIG. 4 is a schematic diagram illustrating a set of identifiers of concepts that are associated with portions of a dataset, according to an example embodiment.

[0015] FIG. 5 is a schematic diagram illustrating portions of a dataset and concepts associated with those portions, according to an example embodiment.

[0016] FIG. 6 is a flow chart illustrating a method of guiding a search of a set of documents, according to an example embodiment.

[0017] FIG. 7 is a flow chart illustrating a method of guiding a search of a set of documents, according to an example embodiment.

[0018] FIG. 8 is a schematic view of a computer network environment in the example embodiments presented herein can be implemented.

[0019] FIG. 9 is a block diagram illustrating an example computer node of the network of FIG. 8.

DETAILED DESCRIPTION

[0020] A description of example embodiments follows.

[0021] Human learning is characterized by the cooperative transmission of data. In addition to direct observations and taking actions in one's own environment, humans also engage in purposeful selection of data with the goal of conveying knowledge about the world to less knowledgeable agents. Moreover, less knowledgeable agents assume purposeful, cooperative selection and leverage cooperation to augment learning. The cooperative selection of data, and learning from such data, plays a central role in theories of cognition, cognitive development, and cultural evolution. Indeed, this cooperative inference is argued to be the feature that drives accumulation of knowledge over generations.

[0022] The disclosed methods and systems facilitate discovery of information by, for example, when provided with an initial datum, suggesting subsequent candidate data from which a user can choose. The suggestions are designed to converge rapidly on a source of greatest interest for the user. For example, consider the problem of text retrieval. The user can provide a word that describes the general concept of interest. A set of candidate second words can be provided to add to the query. These candidates may optimize the chances that the user will be provided the document that is closest to the concept of interest. Put another way, the set of options can differentiate amongst the documents that are consistent with the previous elements of the query. Alternatively, instead of being in response to an initial word that describes the general concept of interest, a set of candidate initial words can be provided to the user.

[0023] The disclosed methods and systems provide more rapid and effective search through massive repositories of information, which provides a solution for every field that has been affected by the big data revolution. Example practical applications include information filtering (the internet), recommender systems (e-commerce), and as a core approach to internet-scale educational technology.

[0024] FIG. 1 is a flow chart illustrating a method 100 of guiding a search of a dataset, according to an example embodiment. The example method 100 includes partitioning 105 the dataset into a plurality of portions based on occurrences of words or concepts in the dataset. Identifiers of words or concepts that are associated with each of the portions are determined 110 based on the occurrences of words or concepts. The identifiers have varying levels of abstraction with respect to the dataset, where each identifier is broader than a proceeding identifier. The identifiers are presented 115 as options for narrowing the dataset.

[0025] FIG. 2 is a flow chart illustrating a method of guiding a search of a set of documents, according to an example embodiment. The example method 200 includes partitioning 205 the set of documents into a plurality of portions based on occurrences of words or concepts in the set of documents. Identifiers of words or concepts that are associated with each of the portions are determined 210 based on the occurrences of words or concepts. The identifiers have varying levels of abstraction with respect to the set of documents, where each identifier is broader than a proceeding identifier. The identifiers are presented 215 as options for narrowing the set of documents.

[0026] FIG. 3 is a schematic diagram illustrating a data structure 300 representing occurrences of concepts 310 in a dataset 305. Occurrences of concepts 310 ( e.g ., words) in a dataset 305 (e.g., set of documents) can be represented by a data structure 300, such as a matrix. The data structure 300 can include identifiers of data 315 in the data set (e.g, document identifiers) and identifiers of concepts 310 (e.g, words). A value in the matrix for a given data identifier and concept identifier can be set to a certain value if the concept is present in the data identified by the data identifier. For example, the value can be set to“1” if the concept is present and“0” of the concept is not present.

[0027] FIG. 4 is a schematic diagram illustrating a set of identifiers of concepts 405, 410, and 415 that are associated with portions 400 of a dataset. Using a representation 300 of the occurrences of concepts (e.g, words), a first concept (e.g, word) 405 associated with a first portion of the data (e.g, documents) can be determined, a second concept 410 associated with a second portion of the data can be determined, and at least a third concept 415 associated with a third portion of the data can be determined. Assuming there are three portions, the first concept 405 can be associated with about a third of the documents (subset 11), the second concept 410 can be associated with about another third of the documents in addition to the first third (subset 1), and the third concept 415 can be associated with about the remaining third of the documents, in addition to the first and second thirds (entire dataset).

As a specific example, the dataset could include documents regarding dogs. The first concept 405 may be“Irish setters,” the second concept 410 may be“large dogs,” and the third concept 415 may be“dogs.” The second concept 410 (large dogs) is broader than the first concept 405 (Irish setters), and the third concept 415 (dogs) is broader than both the first concept 405 (Irish setters) and the second concept 410 (large dogs). If a user were to choose the third concept 415 (dogs), it can be inferred that the user means, for example,“dogs but not large dogs.” If a user were to choose the second concept 410 (large dogs), it can be inferred that the user means, for example,“large dogs but not Irish setters.”

[0028] FIG. 5 is a schematic diagram illustrating portions 505, 510, and 515 of a dataset and concepts 520 associated with those portions. To determine the set of concepts (e.g., words) a subset of N words can be selected that result in a triangular or near-triangular matrix 525. According to the example of FIG. 5, concept wl is associated with a portion 505 of the data (dl-d4), concept w2 is associated with the middle portion 510, and concept w3 is associated with the last portion 515. It should be noted that concept wl is broad enough to cover all portions 505, 510, and 515 of the data, and that concept w2 is broad enough to cover portions 510 and 515 of the data. This leverages principles of cooperative inference in humans to simultaneously search breadth and depth. This allows an inference from“user chooses wl” to“not d5-dm,” for example. The arrangement illustrated in FIG. 5 can be accomplished by searching the data over the set of concepts and selecting a set of N concepts that form an upper triangle 525. This can be optimized in a variety of ways. For example, preference may be given to even splits. Other methods can be used ( e.g ., randomization) other than searching over all data. On subsequent steps, the same operation can be performed on the subset of the dataset determined by a user’s choice of word.

[0029] The following are examples of portioning a dataset and determining concepts associated with the partitions. One example is based on the Gibbs sampler approach for probabilistic topic models. The triangular patterned matrix is used to describe a pattern of zero and non-zero values for a hyperparameter typically notated as“alpha.” This is a parameter of the Multinomial-Dirichlet distribution on topics given documents. Zero values indicate that the corresponding topic cannot appear in the particular document. The interpretation of the matrix is then one of a structured form of induced sparsity. Specifically, the triangular pattern indicates that topics are structured such that some are shared across many documents, while others can only appear in a few documents. The particular ordering of documents can be determined through permutation. The number of topics is fixed a priori and can be optimized via a variety of methods. A second matrix, the word-topic matrix, can also be used, and is governed by a separate multinomial-dirichlet matrix. Inference is performed via Gibbs sampling. Each word in each document is randomly assigned to a topic. Th words are then incremented in order. For each word, consider which topics it fits with.

Fit can be determined by the product of two probabilities, topic given document and word given topic. Specifically, the probability of a topic, given a document, is the number of times that topic appears in that document plus the corresponding alpha, all over the total number of words in that document plus the sum of alphas for that document. The probability of a word, given a topic, is the sum of the number of times that word appears in that topic plus a corresponding beta parameter, all over the total number of words assigned to that topic plus the sum of betas for that topic. A new topic is then sampled by randomly choosing from among the topics in proportion to their probability.

[0030] Another example is as follows. For a given corpus with fixed number of documents and words: Sample alpha and beta values from an exponential distribution, then sample corresponding theta values for the topic given document distribution and phi values for the word given topic distributions. Generate a large collection of such sets of parameters associated with the details of the corpus of interest. Construct a Markov chain by

incrementally accepting or rejecting each set of parameters in this collection using the Metropolis algorithm computed on a Cooperative Index, which measures the probability of successfully communicating about a specific document using words via the latent topics. The Cooperative Index is computed as follows. Construct a word document matrix by

multiplying the topic document probabilities with corresponding word topic probabilities. Then, iteratively normalize the rows to sum to one, followed by the columns, repeating until the values in the matrix change less than a predetermined value. The Cooperative Index is then computed by considering, for each document, the probability of selecting a word, and the probability of inferring that document given that word which is given by the row and column normalized versions of the matrix. Cooperative Index is computed by marginalizing over words and documents. The Metropolis algorithm ensures that parameterizations that tend to yield high Cooperative Index are those that are accepted. A search may then be guided by the best solution, or an average where the weight of each parameterization can be computed based on the probability of the corpus given the parameters. Several variants are possible. The word document matrix may be computed with small sets of words, such as pairs or triplets, to increase the efficacy of search. The Cooperative Index may be computed based on only the top N words, or other method of selecting good words such as introducing a greediness parameter to exponentiate the probabilities, instead of simply marginalizing over all words. Efficient implementations will make use of linear algebra to compute the marginalizations.

[0031] The following is an example of searching a concept-data matrix for selecting concept identifiers to use in a guided search. Given a number of concepts, N, the following can be used to select a set of concept identifiers. For sake of the example, data is described as being in the form of documents. The first concept nl, should be selected such that it is present in l/N documents. Because the concept-document matrix indicates whether a concept is present in a document, such a concept can be identified by summing over the documents and dividing by the total number of documents. This yields the proportion of documents in which a concept is present. If there are many concepts satisfying this constraint, concept nl can by randomly chosen amongst those concepts having a proportion of roughly l/N. Update the concept-document matrix by removing all documents in which concept nl is present. Compute the sum over documents as before. Select a concept that is present in roughly l/(N-l) documents. Random selection may be used to choose amongst candidates that satisfy the constraint. Repeat this process. For the last concept nN, select the concept to be present in all remaining documents. In situations where concept nN might not be present in all documents, N may be treated as a lower bound. That is, if, when one gets to the last concept nN, there is not a single concept remaining, one could, for example, select the concept that covers the most remaining documents, and then select an additional concept, n(N+l).

[0032] The concept identifier selection process is highly efficient, relying only on computing sums of the rows of the concept-document matrix. Even the most“naive” implementation has a complexity of O(CDN) if summing over all documents for each row, for N concepts. Depending on computational complexity requirements, the process can be even more efficient. For example, the most expensive costs are generally going to be C and D, as these are concepts ( e.g ., words) and documents. C can be greatly reduced by only computing enough row sums to obtain the concepts needed. The expected number of row sums is far fewer than C, though the exact number depends on the details of the concept- document itself. One could also optimize or randomize the trade-off between computational complexity and the set N.

[0033] While one version of the process can be executed at search time, the process could also, or alternatively, use pre-computation or caching to yield a much more efficient approach. For example, the first set of sums is used for every user and would only need to be computed once. Similarly, given that plausible values of N are in the range of 2 to some small integer (a manageable number for a user to look through and compare), there are only so many possible sets of options the user may see and, therefore, these could also be pre- computed.

[0034] There are also ways to adapt the process to be more effective based on the structure of the concept-document matrix and based on user behavior. To exploit the structure of the concept-document matrix, methods such as Monte Carlo tree search could be used to find sets of concepts that tend to arrive at desirable sets of documents most quickly. Similarly, user satisfaction ratings (satisfaction with the documents found) can be collected at the end of searches and used to infer which concepts and sets of concepts are most effective.

[0035] When the user is provided the set of concepts as search options, the user may be provided additional information and be given additional options. For example, each concept has an associated number of documents, which may be presented to the user to provide a sense of how many remaining documents there are. Each concept can also be associated with a small set of representative documents, which the user may browse through in order to understand the remaining set. The user may either select one of the presented concepts in order to further narrow the dataset, or may choose to terminate the search, in which case the set of documents remaining may be provided to the user. Additional information, such as the presentation of multiple concepts (in some cases), estimates of how much each concept will narrow the set of documents, and various aspects related to the display of information, can also be provided to the user.

[0036] FIG. 6 is a flow chart illustrating a method 600 of guiding a search of a set of documents, according to an example embodiment. The example method 600 includes determining 605 occurrences of words in the set of documents. Based on the occurrences of words, a first word associated with a first portion of the documents is determined 610, and a second word associated with a second portion of the documents is determined 615. The second portion of the documents is a portion of remaining documents if the first portion of the documents were removed from the set of documents. At least a third word is determined 620, based on the occurrences of words, that is associated with a third portion of the documents. The third portion of the documents includes documents remaining if the first and second portions of the documents were removed from the set of documents. The first, second, and third words have varying levels of abstraction with respect to the set of documents, where the second word is broader than the first word, and the third word is broader than the first and second words. The first, second, and third words are presented 625 as options for narrowing the set of documents.

[0037] FIG. 7 is a flow chart illustrating a method of guiding a search of a set of documents, according to an example embodiment. The example method 700 includes determining 705 occurrences of concepts in the set of documents. Based on the occurrences of words, a first concept identifier associated with a first portion of the documents is determined 710. Second through N-l concept identifiers are then determined 715 that are associated with second through N-l portions of the documents, respectively. Each of the second through N-l portions of the documents is a portion of remaining documents if the preceding portions of the documents were removed from the set of documents. An Nth concept identifier is then determined 720 that is associated with an Nth portion of the documents. The Nth portion of the documents includes documents remaining if the first through N-l portions of the documents were removed from the set of documents. The first through Nth concept identifiers have varying levels of abstraction with respect to the set of documents. For a given concept identifier M, the concept identifier M is broader than concept identifier M-l . The first through Nth concept identifiers are then presented 725 as options for narrowing the set of documents.

[0038] FIG. 8 illustrates a computer network or similar digital processing environment in which the present embodiments may be implemented. Client computer(s)/devices 50 and server computer(s) 60 provide processing, storage, and input/output devices executing application programs and the like. Client computer(s)/devices 50 can also be linked through communications network 70 to other computing devices, including other client

devices/processes 50 and server computer(s) 60. Communications network 70 can be part of a remote access network, a global network ( e.g ., the Internet), cloud computing servers or service, a worldwide collection of computers, Local area or Wide area networks, and gateways that currently use respective protocols (TCP/IP, Bluetooth, etc.) to communicate with one another. Other electronic device/computer network architectures are suitable.

[0039] FIG. 9 is a diagram of the internal structure of a computer (e.g., client processor/device 50 or server computers 60) in the computer system of FIG. 8. Each computer 50, 60 contains system bus 79, where a bus is a set of hardware lines used for data transfer among the components of a computer or processing system. Bus 79 is essentially a shared conduit that connects different elements of a computer system (e.g, processor, disk storage, memory, input/output ports, and network ports) that enables the transfer of information between the elements. Attached to system bus 79 is EO device interface 82 for connecting various input and output devices (e.g, keyboard, mouse, displays, printers, and speakers) to the computer 50, 60. Network interface 86 allows the computer to connect to various other devices attached to a network (e.g, network 70 of FIG. 8). Memory 90 provides volatile storage for computer software instructions 92 and data 94 used to implement many embodiments ( e.g ., code detailed above and in FIGS. 1-7 including example routines 100, 200, 600, and 700 and example data structure 300). Disk storage 95 provides non-volatile storage for computer software instructions 92 and data 94 used to implement many embodiments. Central processor unit 84 is also attached to system bus 79 and provides for the execution of computer instructions.

[0040] In the context of FIG. 9, the computer 50, 60 can include a system for providing guided searching of a set of documents. Components of the system include an interface 82, a datastore 90, 95, and a processor 84. The interface 82 can be configured to present information to a user of the system and to accept input from the user. The datastore 90, 95 can store (i) the set of documents or links to the documents and (ii) a data structure representing occurrences of words in the set of documents. The processor 84 can be configured to determine, based on the occurrences of words, at least three words. A first of the words is associated with a first portion of the documents. A second of the words is associated with a second portion of the documents. The second portion of the documents is a portion of remaining documents if the first portion of the documents were removed from the set of documents. A third of the words is associated with a third portion of the documents. The third portion of the documents includes documents remaining if at least the first and second portions of the documents were removed from the set of documents. The first, second, and third words have varying levels of abstraction with respect to the set of documents, where the second word is broader than the first word, and the third word is broader than the first and second words. The processor 84 can be further configured to cause the interface 82 to present the at least three words as options for narrowing the set of documents.

[0041] In one embodiment, the processor routines 92 and data 94 are a computer program product (generally referenced 92), including a computer readable medium (e.g., a removable storage medium such as one or more DVD-ROM’s, CD-ROM’s, diskettes, and tapes) that provides at least a portion of the software instructions for the system. Computer program product 92 can be installed by any suitable software installation procedure, as is well known in the art. In another embodiment, at least a portion of the software instructions may also be downloaded over a cable, communication and/or wireless connection. In other embodiments, the programs are a computer program propagated signal product 75 (FIG. 8) embodied on a propagated signal on a propagation medium ( e.g ., a radio wave, an infrared wave, a laser wave, a sound wave, or an electrical wave propagated over a global network such as the Internet, or other network(s)). Such carrier medium or signals provide at least a portion of the software instructions for the routines/program 92.

[0042] In alternate embodiments, the propagated signal is an analog carrier wave or digital signal carried on the propagated medium. For example, the propagated signal may be a digitized signal propagated over a global network (e.g., the Internet), a telecommunications network, or other network. In one embodiment, the propagated signal is a signal that is transmitted over the propagation medium over a period of time, such as the instructions for a software application sent in packets over a network over a period of milliseconds, seconds, minutes, or longer. In another embodiment, the computer readable medium of computer program product 92 is a propagation medium that the computer system 50 may receive and read, such as by receiving the propagation medium and identifying a propagated signal embodied in the propagation medium, as described above for computer program propagated signal product. Generally speaking, the term“carrier medium” or transient carrier encompasses the foregoing transient signals, propagated signals, propagated medium, storage medium and the like. In other embodiments, the program product 92 may be implemented as Software as a Service (SaaS), or other installation or communication supporting end-users.

[0043] While the above description provides as specific examples guided search of a set of documents, the disclosed methods and systems can be applied to any discrete or discretizable domain, such as, for example, images, video, medical data, financial data, military intelligence, and information on the internet. In these cases, the target items being searched through can be represented in a data structure as the columns of a matrix, and the rows can be any partial or complete representation of those target items, or vice-versa. In the case where the rows are exactly the columns, the result is a diagonal matrix that forms a more trivial search problem of looking through all documents one-by-one. When the rows are representations of items that span documents, we obtain more interesting search tasks. When the rows are documents and the columns are individual words, we obtain one such case: each word tends to appear in multiple documents and the problem is then to choose words that facilitate search given the pattern of occurrence of words in documents. Assuming the columns represent different documents, rows that abstract away from specific words to words of similar meanings (e.g, dog, puppy, doggie, German shepherd) can improve performance by relaxing the requirement that a specific word be in a document, to require only that a concept (or topic) appear in a document. A machine learning algorithm that infers a discrete latent representation of the corpus can, therefore, be used to guide a search.

[0044] The disclosed methods and systems can be applied to any domain where the rows and columns are discrete or discretizable. For example, consider searching through a database for entries that satisfy general criteria. In this case, each entry can be a column and, in the simplest case, the rows can be the specific dimensions and relations that are the union of all values for all entries (all values that appear in the dataset), together with their feature or relation. For example,“height’’=69 inches could be one row in a table that contains information from a database with information about people. Just as for words and documents, concepts inferred by a machine learning algorithm can be substituted for words; here less specific characterizations of height ( e.g .,“height” is between 60 and 72 inches, or “height” is between 60 and 72 inches and“weight” is less than 200 pounds). Machine learning algorithms can be used to infer these concepts such that they provide a good characterization of the domain to be searched. Innumerable examples of candidate structures on the rows are possible, spanning any combination of images, video, words, speech, numeric, categorical variables at a given time or across time points, or any concept or function defined or inferred. Similarly, on the columns, any aggregation of these into data structures could exist on the rows. For example, in a corpus, a standard document is a conjunction of words, which makes the documents-words format natural for the columns and rows. Similarly, in a data table, a standard row is a conjunction of feature-value pairs. For richer datasets, this general idea is naturally extensible: Each target to be searched over can be viewed as a conjunction of the specific elements that appear in it, and the corpus is the union of that. (This fact has a name in machine leaming-the empirical distribution.) The unique specific elements can be enumerated to form rows of the matrix over which a search may operate. Machine learning algorithms can be used to create abstractions to facilitate search. In this sense, the machine learning algorithms infer a latent structure of the domain, which can be used as input into our algorithm to guide search over the domain. Because even continuous (a.k.a. numeric) distributions are approximated by their empirical distributions, these qualify as discretizable.

[0045] As described above, N can be a number representing how many options are offered to the user at each step. Given the values of N, there is an optimal structure for the domain, which ensures that the user has to make the minimal number of choices to find any target. This is a unique representation, and can be used to constrain any machine learning method used to define concepts that summarize the domain, ensuring optimal performance in guided search.

[0046] Thus far, the disclosed methods and systems have been described in the context of search, but the disclosed approach also applies to guided learning. Learning differs from search in that a user wishes to learn about a subset of a domain, rather than simply find a specific item. Whereas in search, the goal is often for the user to recover a single document, the goal in learning is to cover a representative subset of the dataset. The approach described above provides a framework for doing so. Whereas in search, the user is presented with a set of options among which to choose, in learning the goal is for the learner to explore each of these items. Rather than looking through the data directly, we use the same approach used in guiding search to define a collection of paths through the data, where the user is presented with a collection of concepts-targets. Intuitively, for a simpler domain such as animals, we can consider one path as mammal-dog-Labrador with a target that instantiates those concepts e.g ., a document about Labradors) and guidance in learning about the subdomain mammals can be in having multiple different examples that cover the breadth of the concept. In general, the problem is much harder because the user may not know the structure of the domain, and machine learning algorithms can be used to automatically infer this structure from large amounts of data.

[0047] Guided learning has a variety of applications, including facilitating human learning in educational contexts. One other application is for explainable Artificial

Intelligence. There, the goal is to provide a human-understandable explanation of what an AI system has learned or why it made a decision. Guided learning can be used to reduce the large dataset used to train an AI system to a manageable subset that is representative of the full dataset, which can provide users with an understanding of the domain, as well as the capabilities of the AI system itself. For a specific decision, the explanation can take the form of guided learning about cases in the dataset that are similar (as defined by the system). This guided approach can be used to explain machine learning algorithms, including deep learning. In deep learning, latent concepts are not human-interpretable, so a key element for explaining these models is using specific cases in the dataset to guide learning about the layers of the network. Note that the optimal structure for guided search also applies to learning, and this can be used to construct machine learning (including deep learning) systems that are optimized for guided search and guided learning.

[0048] While example embodiments have been particularly shown and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the embodiments encompassed by the appended claims.

Claims

CLAIMS What is claimed is:

1. A method of guiding a search of a set of documents, the method comprising:

determining occurrences of words in the set of documents;

determining, based on the occurrences of words, a first word associated with a first portion of the documents;

determining, based on the occurrences of words, a second word associated with a second portion of the documents, the second portion of the documents being a portion of remaining documents if the first portion of the documents were removed from the set of documents;

determining, based on the occurrences of words, a third word associated with a third portion of the documents, the third portion of the documents being documents remaining if the first and second portions of the documents were removed from the set of documents;

the first, second, and third words having varying levels of abstraction with respect to the set of documents, the second word being broader than the first word, and the third word being broader than the first and second words; and

presenting the first, second, and third words as options for narrowing the set of documents.

2. A method as in claim 1 further comprising:

in response to a selection of one of the words, returning documents associated with the selected word.

3. A method as in claim 1 further comprising:

in response to a selection of one of the words, determining a subsequent set of words to present as options for further narrowing documents associated with the selected word, the subsequent set of words being determined based on the occurrences of words with respect to the documents associated with the selected word.

4. A method as in claim 1 further comprising:

for each of the first, second, and third words, determining a corresponding subsequent set of words to present as options for further narrowing documents associated with the word, the subsequent set of words being determined based on the occurrences of words with respect to the documents associated with the word.

5. A method as in claim 1 wherein determining the occurrences of words in the set of documents includes generating a matrix of document identifiers and words, a value in the matrix for a given document identifier and word being set to a certain value if the word is present in the document identified by the document identifier.

6. A method as in claim 1 wherein the first, second, and third portions of the documents include a similar number of documents.

7. A method as in claim 1 wherein presenting the first, second, and third words includes presenting the number of documents associated with each of the first, second, and third words.

8. A method of guiding a search of a set of documents, the method comprising:

determining occurrences of concepts in the set of documents; determining, based on the occurrences of concepts, a first concept identifier associated with a first portion of the documents;

determining, based on the occurrences of concepts, second through N-l concept identifiers associated with second through N-l portions of the documents, respectively, each of the second through N-l portions of the documents being a portion of remaining documents if the preceding portions of the documents were removed from the set of documents;

determining, based on the occurrences of concepts, an Nth concept identifier associated with an Nth portion of the documents, the Nth portion of the documents being documents remaining if the first through N-l portions of the documents were removed from the set of documents; the first through Nth concept identifiers having varying levels of abstraction with respect to the set of documents;

for a given concept identifier M, the concept identifier M being broader than concept identifier M-l; and

presenting the first through Nth concept identifiers as options for narrowing the set of documents.

9. A method as in claim 8 wherein determining the occurrences of concepts in the set of documents includes determining occurrences of words in the set of documents, and wherein determining the concept identifiers includes determining words.

10. A method as in claim 8 further including:

in response to a selection of one of the concept identifiers, returning documents associated with the selected concept identifier.

11. A method as in claim 8 further including:

in response to a selection of one of the concept identifiers, determining a subsequent set of concept identifiers to present as options for further narrowing the documents associated with the selected concept identifier, the subsequent set of concept identifiers being determined based on the occurrences of concepts with respect to the documents associated with the selected concept identifier.

12. A method as in claim 8 further including:

for each of the concept identifiers, determining a corresponding subsequent set of concept identifiers to present as options for further narrowing documents associated with the concept identifier, the subsequent set of concept identifiers being determined based on the occurrences of concepts with respect to the documents associated with the concept identifier.

13. A method as in claim 8 wherein determining the occurrences of concepts in the set of documents includes generating a matrix of document identifiers and concept identifiers, a value in the matrix for a given document identifier and concept identifier being set to a certain value if the concept is present in the document identified by the document identifier.

14. A system for providing guided searching of a set of documents, the system

comprising:

an interface configured to present information to a user of the system and to accept input from the user;

a datastore storing (i) the set of documents or links to the documents and (ii) a data structure representing occurrences of words in the set of documents; and

a processor in communication with the datastore and the interface and configured to determine, based on the occurrences of words, at least three words, wherein (i) a first of the words is associated with a first portion of the documents, (ii) a second of the words is associated with a second portion of the documents, the second portion of the documents being a portion of remaining documents if the first portion of the documents were removed from the set of documents, (iii) a third of the words is associated with a third portion of the documents, the third portion of the documents being documents remaining if at least the first and second portions of the documents were removed from the set of documents, and (iv) the first, second, and third words having varying levels of abstraction with respect to the set of documents, the second word being broader than the first word, and the third word being broader than the first and second words;

the processor further configured to cause the interface to present the at least three words as options for narrowing the set of documents.

15. A system as in claim 14 wherein the processor is configured to, in response to the user selecting one of the words, cause the interface to provide the user with documents, or representations of documents, associated with the selected word.

16. A system as in claim 14 wherein the processor is configured to, in response to the user selecting one of the words, determine a subsequent set of words to present as options for further narrowing documents associated with the selected word, the subsequent set of words being determined based on the occurrences of words with respect to the documents associated with the selected word.

17. A system as in claim 14 wherein the processor is configured to, for each of the at least three words, determine a corresponding subsequent set of words to present as options for further narrowing documents associated with the word, the subsequent set of words being determined based on the occurrences of words with respect to the documents associated with the word.

18. A system as in claim 14 wherein the data structure representing the occurrences of words in the set of documents includes a matrix of document identifiers and words, wherein a value in the matrix for a given document identifier and word is set to a certain value if the word is present in the document identified by the document identifier.

19. A system as in claim 14 wherein the portions of the documents associated with the at least three words include a similar number of documents.

20. A system as in claim 14 wherein the processor is configured to cause the interface to present the number of documents associated with each of the first, second, and third identifiers.

21. A method of guiding a search of a set of documents, the method comprising:

partitioning the set of documents into a plurality of portions based on occurrences of words or concepts in the set of documents;

determining, based on the occurrences of words or concepts, identifiers of words or concepts that are associated with each of the portions;

the identifiers having varying levels of abstraction with respect to the set of documents, each identifier being broader than a proceeding identifier; and

presenting the identifiers as options for narrowing the set of documents.

22. A method of guiding a search of a dataset, the method comprising: partitioning the dataset into a plurality of portions based on occurrences of words or concepts in the dataset;

the identifiers having varying levels of abstraction with respect to the dataset, each identifier being broader than a proceeding identifier; and

presenting the identifiers as options for narrowing the dataset.