US10394851B2 - Methods and systems for mapping data items to sparse distributed representations - Google Patents

Methods and systems for mapping data items to sparse distributed representations Download PDF

Info

Publication number
US10394851B2
US10394851B2 US14/816,133 US201514816133A US10394851B2 US 10394851 B2 US10394851 B2 US 10394851B2 US 201514816133 A US201514816133 A US 201514816133A US 10394851 B2 US10394851 B2 US 10394851B2
Authority
US
United States
Prior art keywords
data
sdr
documents
term
computing device
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US14/816,133
Other languages
English (en)
Other versions
US20160042053A1 (en
Inventor
Francisco Eduardo De Sousa Webber
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cortical IO AG
Original Assignee
Cortical IO AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cortical IO AG filed Critical Cortical IO AG
Priority to US14/816,133 priority Critical patent/US10394851B2/en
Assigned to CORTICAL.IO GMBH reassignment CORTICAL.IO GMBH ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: DE SOUSA WEBBER, Francisco Eduardo
Publication of US20160042053A1 publication Critical patent/US20160042053A1/en
Assigned to CORTICAL.IO AG reassignment CORTICAL.IO AG CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: CORTICAL.IO GMBH
Priority to US16/510,108 priority patent/US20190332619A1/en
Application granted granted Critical
Publication of US10394851B2 publication Critical patent/US10394851B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems

Definitions

  • the disclosure relates to generating cross-lingual sparse distributed representations. More particularly, the methods and systems described herein relate to functionality for generating mappings of terms to sparse distributed representations (SDRs) and associating translations of the terms with the generated mappings.
  • SDRs sparse distributed representations
  • a computer-implemented method of mapping data items to sparse distributed representations includes clustering in a two-dimensional metric space, by a reference map generator, executing on a computing device, a set of data documents selected according to at least one criterion.
  • the method includes associating, by the reference map generator, a coordinate pair with each of the set of data documents.
  • the method includes generating, by a parser executing on the computing device, an enumeration of data items occurring in the set of data documents.
  • the method includes determining, by a representation generator executing on the computing device, for each data item in the enumeration, occurrence information including: (i) a number of data documents in which the data item occurs, (ii) a number of occurrences of the data item in each data document, and (iii) the coordinate pair associated with each data document in which the data item occurs.
  • the method includes generating, by the representation generator, a distributed representation using the occurrence information.
  • the method includes receiving, by a sparsifying module executing on the computing device, an identification of a maximum level of sparsity.
  • the method includes reducing, by the sparsifying module, a total number of set bits within the distributed representation based on the maximum level of sparsity to generate a sparse distributed representation having a normative fillgrade.
  • FIG. 1A is a block diagram depicting an embodiment of a system for mapping data items to sparse distributed representations
  • FIG. 1B is a block diagram depicting one embodiment of a system for generating a semantic map for use in mapping data items to sparse distributed representations;
  • FIG. 1C is a block diagram depicting one embodiment of a system for generating a sparse distributed representation for a data item in a set of data documents;
  • FIG. 2 is a flow diagram depicting an embodiment of a method for mapping data items to sparse distributed representations
  • FIG. 3 is a block diagram depicting one embodiment of a system for performing arithmetic operations on sparse distributed representations of data items generated using data documents clustered on semantic maps;
  • FIG. 4 is a flow diagram depicting one embodiment of a method for identifying a level of semantic similarity between data items
  • FIG. 5 is a flow diagram depicting one embodiment of a method for identifying a level of semantic similarity between a user-provided data item and a data item within a set of data documents;
  • FIG. 6A is a block diagram depicting one embodiment of a system for expanding a query provided for use with a full-text search system
  • FIG. 6B is a flow diagram depicting one embodiment of a method for expanding a query provided for use with a full-text search system
  • FIG. 6C is a flow diagram depicting one embodiment of a method for expanding a query provided for use with a full-text search system
  • FIG. 7A is a block diagram depicting one embodiment of a system for providing topic-based documents to a full-text search system
  • FIG. 7B is a flow diagram depicting one embodiment of a method for providing topic-based documents to a full-text search system
  • FIG. 8A is a block diagram depicting one embodiment of a system for providing keywords associated with documents to a full-text search system for improved indexing
  • FIG. 8B is a flow diagram depicting one embodiment of a method for providing keywords associated with documents to a full-text search system for improved indexing
  • FIG. 9A is a block diagram depicting one embodiment of a system for providing search functionality for text documents
  • FIG. 9B is a flow diagram depicting one embodiment of a method for providing search functionality for text documents
  • FIG. 10A is a block diagram depicting one embodiment of a system providing user expertise matching within a full-text search system
  • FIG. 10B is a block diagram depicting one embodiment of a system providing user expertise matching within a full-text search system
  • FIG. 10C is a flow diagram depicting one embodiment of a method for matching user expertise with requests for user expertise
  • FIG. 10D is a flow diagram depicting one embodiment of a method for user profile-based semantic ranking of query results received from a full-text search system
  • FIG. 11A is a block diagram depicting one embodiment of a system for providing medical diagnosis support
  • FIG. 11B is a flow diagram depicting one embodiment of a method for providing medical diagnosis support
  • FIGS. 12A-12C are block diagrams depicting embodiments of computers useful in connection with the methods and systems described herein;
  • FIG. 12D is a block diagram depicting one embodiment of a system in which a plurality of networks provide data hosting and delivery services;
  • FIG. 13 is a block diagram depicting one embodiment of a system for generating cross-lingual sparse distributed representations
  • FIG. 14A is a flow diagram depicting an embodiment of a method for determining similarities between cross-lingual sparse distributed representations.
  • FIG. 14B is a flow diagram depicting an embodiment of a method for determining similarities between cross-lingual sparse distributed representations.
  • the methods and systems described herein provide functionality for generating mappings of terms to sparse distributed representations (SDRs) and associating translations of the terms with the generated mappings.
  • SDRs sparse distributed representations
  • the system 100 includes an engine 101 , a machine 102 a , a set of data documents 104 , a reference map generator 106 , a semantic map 108 , a parser and preprocessing module 110 , an enumeration of data items 112 , a representation generator 114 , a sparsifying module 116 , one or more sparse distributed representations (SDRs) 118 , a sparse distributed representation (SDR) database 120 , and a full-text search system 122 .
  • the engine 101 refers to all of the components and functionality described in connection with FIGS. 1A-1C and 2 .
  • the system includes a set of data documents 104 .
  • the documents in the set of data documents 104 include text data.
  • the documents in the set of data documents 104 include variable values of a physical system.
  • the documents in the set of data documents 104 include medical records of patients.
  • the documents in the set of data documents 104 include chemistry-based information (e.g., DNA sequences, protein sequences, and chemical formulas).
  • each document in the set of data documents 104 includes musical scores.
  • the data items within data documents 104 may be words, numeric values, medical analyses, medical measurements, and musical notes.
  • the data items may be strings of any type (e.g., a string including one or more numbers).
  • the data items in a first set of data documents 104 may be different language than the data items in a second set of data documents 104 .
  • the set of data documents 104 includes historic log data.
  • a “document” as used herein may refer to a collection of data items each of which corresponds to a system variable originating from the same system. In some embodiments, system variables in such a document are sampled concurrently.
  • data item herein encompasses words as string data, scalar values as numerical data, medical diagnoses and analyses as numeric data or class-data, musical notes and variables of any type all coming from a same “system.”
  • the “system” may be any physical system, natural or artificial, such as a river, a technical device, or a biological entity such as a living cell or a human organism.
  • the system may also be a “conceptual system” such as a language or web server log-data.
  • the language can be a natural language such as English or Chinese, or an artificial language such as JAVA or C++ program code.
  • data document encompasses a set of “data items.” These data items may be interdependent by the semantics of the underlying “system.”
  • This grouping can be a time based group, if all data item values are sampled at the same moment; for example, measurement data items coming from the engine of a car can be sampled every second and grouped into a single data document.
  • This grouping can also be done along a logical structure characterized by the “system” itself, for example in natural language, word data items can be grouped as sentences, while in music, data items corresponding to notes can be grouped by measures.
  • document vectors can be generated by the above methods (or according to other methods as understood by those of ordinary skill in the art) in order to generate a semantic map of the “system,” as will be described in greater detail below.
  • semantic map data item SDRs can be generated, as will be described in greater detail below. All of the methods and systems described below may be applied to all types of data item SDRs.
  • a user selects the set of data documents 104 according to at least one criterion. For example, the user may select data documents for inclusion in the set of data documents 104 based upon whether the data documents relate to a particular subject matter. As another example, the set of data documents 104 represents a semantic universe in which the system 100 will be used. In one embodiment, the user is a human user of the system 100 . In another embodiment, the machine 100 executes functionality for selecting data documents in the set of data documents 104 .
  • the system 100 includes a reference map generator 106 .
  • the reference map generator 106 is a self-organizing map.
  • the reference map generator 106 is a generative topographic map.
  • the reference map generator 106 is an elastic map.
  • the reference map generator 106 is a neural gas type map.
  • the reference map generator 106 is any type of competitive, learning-based, unsupervised, dimensionality-reducing, machine-learning method.
  • the reference map generator 106 is any computational method that can receive the set of data documents 104 and generate a two-dimensional metric space on which are clustered points representing the documents from the set of data documents 104 .
  • the reference map generator 106 is any computer program that accesses the set of data documents 104 to generate a two-dimensional metric space on which every clustered point represents a data document from the set of data documents 104 . Although typically described herein as populating a two-dimensional metric space, in some embodiments, the reference map generator 106 populates an n-dimensional metric space. In some embodiments, the reference map generator 106 is implemented in software. In other embodiments, the reference map generator 106 is implemented in hardware.
  • the two-dimensional metric space may be referred to as a semantic map 108 .
  • the semantic map 108 may be any vector space with an associated distance measure.
  • the parser and preprocessing module 110 generates the enumeration of data items 112 . In another embodiment, the parser and preprocessing module 110 forms part of the representation generator 114 . In some embodiments, the parser and preprocessing module 110 is implemented at least in part as a software program. In other embodiments, the parser and preprocessing module 110 is implemented at least in part as a hardware module. In still other embodiments, the parser and preprocessing module 110 executes on the machine 102 . In some embodiments, a parser and preprocessing module 110 may be specialized for a type of data. In other embodiments, a plurality of parser and preprocessing modules 110 may be provided for a type of data.
  • the representation generator 114 generates distributed representations of data items. In some embodiments, the representation generator 114 is implemented at least in part as a software program. In other embodiments, the representation generator 114 is implemented at least in part as a hardware module. In still other embodiments, the representation generator 114 executes on the machine 102 .
  • the sparsifying module 116 generates a sparse distributed representation (SDR) of a data item.
  • SDR may be a large numeric (binary) vector.
  • an SDR may have several thousand elements.
  • each element in an SDR generated by the sparsifying module 116 has a specific semantic meaning.
  • vector elements with similar semantic meaning are closer to each other than semantically dissimilar vector elements, measured by the associated distance metric.
  • the representation generator 114 provides the functionality of the sparsifying module 116 . In another embodiment, the representation generator 114 is in communication with a separate sparsifying module 116 . In some embodiments, the sparsifying module 116 is implemented at least in part as a software program. In other embodiments, the sparsifying module 116 is implemented at least in part as a hardware module. In still other embodiments, the sparsifying module 116 executes on the machine 102 .
  • the sparse distributed representation (SDR) database 120 stores sparse distributed representations 118 generated by the representation generator 114 .
  • the sparse distributed representation database 120 stores SDRs and the data item the SDRs represent.
  • the SDR database 120 stores metadata associated with the SDRs.
  • the SDR database 120 includes an index for identifying an SDR 118 .
  • the SDR database 120 has an index for identifying data items semantically close to a particular SDR 118 .
  • the SDR database 120 may store, by way of example and without limitation, any one or more of the following: a reference number for a data item, the data item itself, an identification of a data item frequency for the data item in the set of data documents 104 , a simplified version of the data item, a compressed binary representation of an SDR 118 for the data item, one or several tags for the data item, an indication of whether the data item identifies a location (e.g., “Vienna”), and an indication of whether the data item identifies a person (e.g., “Einstein”).
  • the sparse distributed representation database 120 may be any type or form of database.
  • Examples of an SDR database 120 include, without limitation, structured storage (e.g., NoSQL-type databases and BigTable databases), HBase databases distributed by The Apache Software Foundation of Forest Hill, Md., MongoDB databases distributed by 10 Gen, Inc. of New York, N.Y., Cassandra databases distributed by The Apache Software Foundation, and document-based databases.
  • the SDR database 120 is an ODBC-compliant database.
  • the SDR database 120 may be provided as an ORACLE database manufactured by Oracle Corporation of Redwood City, Calif.
  • the SDR database 120 can be a Microsoft ACCESS database or a Microsoft SQL server database manufactured by Microsoft Corporation of Redmond, Wash.
  • the SDR database 120 may be a custom-designed database based on an open source database, such as the MYSQL family of freely available database products distributed by Oracle Corporation.
  • the method 200 includes clustering, by a reference map generator executing on a computing device, in a two-dimensional metric space, a set of data documents selected according to at least one criterion, generating a semantic map ( 202 ).
  • the method 200 includes associating, by the semantic map, a coordinate pair with each of the set of data documents ( 204 ).
  • the method 200 includes generating, by a parser executing on the computing device, an enumeration of data items occurring in the set of data documents ( 206 ).
  • the method 200 includes determining, by a representation generator executing on the computing device, for each data item in the enumeration, occurrence information including: (i) a number of data documents in which the data item occurs, (ii) a number of occurrences of the data item in each data document, and (iii) the coordinate pair associated with each data document in which the data item occurs ( 208 ).
  • the method 200 includes generating, by the representation generator, a distributed representation using the occurrence information ( 210 ).
  • the method 200 includes receiving, by a sparsifying module executing on the computing device, an identification of a maximum level of sparsity ( 212 ).
  • the method 200 includes reducing, by the sparsifying module, a total number of set bits within the distributed representation based on the maximum level of sparsity to generate a sparse distributed representation having a normative fillgrade ( 214 ).
  • the method 200 includes clustering, by a reference map generator executing on a computing device, in a two-dimensional metric space, a set of data documents selected according to at least one criterion, generating a semantic map ( 202 ).
  • the at least one criterion indicates that data items in the set of data documents 104 appear a threshold number of times.
  • the at least one criterion indicates that each data document in the set of data documents 104 should include descriptive information about the state of the system it was derived from.
  • the at least one criterion indicates that each data document should express a conceptual topic (e.g., an encyclopedic description). In another embodiment, the at least one criterion indicates that a list of characteristics of the set of data documents 104 should evenly fill out a desired information space. In another embodiment, the at least one criterion indicates that the set of data documents 104 is originating from the same system. In the case of data documents, the at least one criterion indicates that the data documents are all in the same language. In still another embodiment, the at least one criterion indicates that the set of data documents 104 be in a natural (e.g., human) language.
  • the at least one criterion indicates that the set of data documents 104 be in a computer language (e.g., computer code of any type). In another embodiment, the at least one criterion indicates that the set of data documents 104 may include any type or form of jargon or other institutional rhetoric (e.g., medicine, law, science, automotive, military, etc.). In another embodiment, the at least one criterion indicates that the set of data documents 104 should have a threshold number of documents in the set.
  • a human user selects the set of data documents 104 and the machine 102 receives the selected set of data documents 104 from the human user (e.g., via a user interface to a repository, directory, document database, or other data structure storing one or more data documents, not shown).
  • the machine 102 preprocesses the set of data documents 104 .
  • the parser and preprocessing module 110 provides the preprocessing functionality for the machine 102 .
  • the machine 102 segments each of the set of data documents 104 into terms and sentences, standardizes punctuation, and eliminates or converts undesired characters.
  • the machine 102 executes a tagging module (not shown) to associate one or more meta-information tags to any data item or portion of a data item in the set of data documents 104 .
  • the machine 102 normalizes the text size of a basic conceptual unit, slicing each of the set of data documents 104 into equally sized text snippets.
  • the machine 102 may apply one or more constraints when slicing the set of data documents 104 into the snippets.
  • the constraints may indicate that documents in the set of data documents 104 should only contain complete sentences, should contain a fixed number of sentences, should have a limited data item count, should have a minimum number of distinct nouns per documents, and that the slicing process should respect the natural paragraphs originating from a document author.
  • the application of constraints is optional.
  • the system 100 provides functionality for identifying the most relevant data items, from a semantic perspective, of each document in a set of data documents 104 .
  • the parser and preprocessing module 110 provides this functionality.
  • the reference map generator 106 receives one or more document vectors and generates the semantic map 108 using the received one or more document vectors.
  • the system 100 may be configured to identify and select nouns (e.g., identifying based on a part-of-speech tag assigned to each data item in a document during preprocessing).
  • selected nouns may be stemmed to aggregate all morphologic variants behind one main data item instance (e.g., plurals and case variations).
  • a term-frequency-inverse document frequency (“tf-idf indexed”) statistic is calculated for selected nouns, reflecting how important a data item is to a data document given the specific set of data documents 104 ; a coefficient may be computed based on the data item count in the document and a data item count in the set of data documents 104 .
  • the system 100 identifies a predetermined number of the highest tf-idf indexed and stemmed nouns per document, generating an aggregate complete list of selected nouns to define document vectors (e.g., and as understood by one of ordinary skill in the art, vectors indicating whether a particular data item appears in a document) used in training the semantic map 106 .
  • functionality for preprocessing and vectorization of the set of data documents 104 generates a vector for each document in the set of data documents 104 .
  • an identifier and an integer per data item on the list of selected nouns represent each document.
  • the machine 102 provides the preprocessed documents to a full-text search system 122 .
  • the parser and preprocessing module 110 may provide this functionality.
  • use of the full-text search system 122 enables interactive selection of the documents.
  • the full-text search system 122 may provide functionality allowing for retrieval of all documents, or snippets of original documents, that contain a specific data item using, for example, literal exact matching.
  • each of the preprocessed documents is associated with at least one of the following: a document identifier, a snippet identifier, a document title, the text of the document, a count of data items in the document, a length in bytes of the document, and a classification identifier.
  • semantic map coordinate pairs are assigned to documents; such coordinate pairs may be associated with the preprocessed documents in the full-text search system 122 .
  • the full-text search system 122 may provide functionality for receiving a single or compound data item and for returning the coordinate pairs of all matching documents containing the received data item.
  • Full-text search systems 122 include, without limitation, Lucene-based Systems (e.g., Apache SOLR distributed by The Apache Software Foundation, Forest Hills, Md., and ELASTICSEARCH distributed by Elasticsearch Global BV, Amsterdam, The Netherlands), open source systems (Indri distributed by The Lemur Project through SourceForge Lemur Project, owned and operated by Slashdot Media, San Francisco, Calif., a Dice Holdings, Inc.
  • Apache SOLR distributed by The Apache Software Foundation, Forest Hills, Md.
  • ELASTICSEARCH distributed by Elasticsearch Global BV, Amsterdam, The Netherlands
  • open source systems Indri distributed by The Lemur Project through SourceForge Lemur Project, owned and operated by Slashdot Media, San Francisco, Calif., a Dice Holdings, Inc.
  • the reference map generator 106 accesses the document vectors of the set of data documents 104 to distribute each of the documents across a two-dimensional metric space. In another embodiment, the reference map generator 106 accesses the preprocessed set of data documents 104 to distribute points representing each of the documents across the two-dimensional metric space. In still another embodiment, the distributed points are clustered. For example, the reference map generator 106 may calculate a position of a point representing a document based on semantic content of the document. The resulting distribution represents the semantic universe of a specific set of data documents 104 .
  • the reference map generator 106 is trained using the document vectors of the preprocessed set of data documents 104 . In another embodiment, the reference map generator 106 is trained using the document vectors of the set of data documents 104 (e.g., without preprocessing). Users of the system 100 may use training processes well understood by those skilled in the relevant arts to train the reference map generator 106 with the set of data documents 104 .
  • the training process leads to two results.
  • the initial training set is large and descriptive enough, adding new training documents can extend the vocabulary.
  • new documents may be positioned on the map by transforming their document vectors with the trained weights.
  • the intended semantic map 108 can be refined and improved by analyzing the distribution of the points representing documents over the semantic map 108 . If there are topics that are under- or over-represented, the set of data documents 104 can be adapted accordingly and the semantic map 108 can then be recomputed.
  • the method 200 includes clustering, by a reference map generator executing on a computing device, in a two-dimensional metric space, a set of data documents selected according to at least one criterion, generating a semantic map.
  • a reference map generator executing on a computing device, in a two-dimensional metric space, a set of data documents selected according to at least one criterion, generating a semantic map.
  • various techniques may be applied to cluster the data documents; for example, and without limitation, implementations may leverage generative topographic maps, growing self-organizing maps, elastic maps, neural gas, random mapping, latent semantic indexing, principal components analysis or any other dimensionality reduction-based mapping method.
  • FIG. 1B a block diagram depicts one embodiment of a system for generating a semantic map 108 for use in mapping data items to sparse distributed representations.
  • the set of data documents 104 received by the machine 102 may be referred to as a language definition corpus.
  • the documents may be referred to as a reference map generator training corpus.
  • the documents may also be referred to as a neural network training corpus.
  • the reference map generator 106 accesses the reference map generator training corpus to generate as output a semantic map 108 on which the set of data documents are positioned.
  • the semantic map 108 may extract the coordinates of each document.
  • the semantic map 108 may provide the coordinates to the full-text search system 122 .
  • corpuses may include those based on an application (e.g., a web application for content creation and management) allowing collaborative modification, extension, or deletion of its content and structure; such an application may be referred to as a “wiki” and the “Wikipedia” encyclopedia project supported and hosted by the Wikimedia Foundation of San Francisco, Calif., is one example of such an application.
  • Corpuses may also include knowledge bases of any kind or type.
  • any type or form of algorithm may be used to map high dimensional vectors into a low dimensional space (e.g., the semantic map 108 ) by, for example, clustering the input vectors such that similar vectors are located close to each other on the low dimensional space, resulting in a low dimensional map that is topologically clustered.
  • a size of a quadratic semantic map defines the “semantic resolution” with which patterns of sparse distributed representations (SDRs) of data items will be computed, as will be discussed in further detail below. For example, a side-length of 128 corresponds to a descriptiveness of 16K features per data item-SDR.
  • the size of the map can be chosen freely, considering that there are computational limits as bigger reference map generator sizes take longer to train and bigger SDRs take longer to be compared or processed by any means.
  • a data item SDR size of 128 ⁇ 128 has proven to be useful when applied on a “general English language” set of data documents 104 .
  • the method 200 includes associating, by the semantic map, a coordinate pair with each of the set of data documents ( 204 ).
  • the reference map generator 106 calculates a position of a point on the semantic map 108 , the point representing a document in the set of data documents 104 .
  • the semantic map 108 may then extract the coordinates of the point.
  • the semantic map 108 transmits the extracted coordinates to the full-text search system 122 .
  • FIG. 1C a block diagram depicts one embodiment of a system for generating a sparse distributed representation for each of a plurality of data items in the set of data documents 104 .
  • the representation generator 114 may transmit a query to the full-text search system 122 and receive one or more data items matching the query.
  • the representation generator 114 may generate sparse distributed representations of data items retrieved from the full-text search system 122 .
  • using data from a semantic map 108 to generate the SDRs may be said to involve “folding” semantic information into generated sparse distributed representations (e.g., sparsely populated vectors).
  • the method 200 includes generating, by a parser executing on the computing device, an enumeration of data items occurring in the set of data documents ( 206 ).
  • the parser and preprocessing module 110 generates the enumeration of data items 112 .
  • the parser and preprocessing module 110 accesses the set of data documents 104 directly to generate the enumeration of data items 112 .
  • the parser and preprocessing module 110 accesses the full-text search system 122 storing (as described above) a preprocessed version of the set of data documents 104 .
  • the parser and preprocessing module 110 extends the enumeration of data items 112 to include not just the data items explicitly included in the set of data documents 104 but common useful data item combinations; for example, the parser and preprocessing module 110 may access frequent combinations of data items (such as “bad weather” or “electronic commerce”) retrieved from publicly available collections.
  • the parser and preprocessing module 110 delimits the data items in the enumeration 112 using, for example, spaces, or punctuation.
  • data items appearing in the enumeration 112 multiple times under different parts of speech tags are treated as distinct (e.g., the data item “fish” will have a different SDR if it is used as a noun than if it is used as a verb and so two entries are included).
  • the parser and preprocessing module 110 provides the enumeration of data items 112 to the SDR database 120 .
  • the representation generator 114 will access the stored enumeration of data items 112 to generate an SDR for each data item in the enumeration 112 .
  • the method 200 includes determining, by a representation generator executing on the computing device, for each data item in the enumeration, occurrence information including: (i) a number of data documents in which the data item occurs, (ii) a number of occurrences of the data item in each data document, and (iii) the coordinate pair associated with each data document in which the data item occurs ( 208 ).
  • the representation generator 114 accesses the full-text search system 122 to retrieve data stored in the full-text search system 122 by the semantic map 108 and the parser and preprocessing module 110 and generates sparse distributed representations for data items enumerated by the parser and preprocessing module 110 using data from the semantic map 108 .
  • the representation generator 114 accesses the full-text search system 122 to retrieve coordinate pairs for each document that contain a particular string (e.g., words or numbers or combinations of words and numbers). The representation generator 114 may count the number of retrieved coordinate pairs to determine a number of documents in which the data item occurs. In another embodiment, the representation generator 114 retrieves, from the full-text search system 122 , a vector representing each document that contains the string. In such an embodiment, the representation generator 114 determines a number of set bits within the vector (e.g., the number of bits within the vector set to 1), which indicates how many times the data item occurred in a particular document. The representation generator 114 may add the number of set bits to determine the occurrence value.
  • a number of set bits within the vector e.g., the number of bits within the vector set to 1
  • the method 200 includes generating, by the representation generator, a distributed representation using the occurrence information ( 210 ).
  • the representation generator 114 may use well-known processes for generating distributed representation.
  • the distributed representation may be used to determine a pattern representative of semantic contexts in which a data item in the set of data documents 104 occurs; the spatial distribution of coordinate pairs in the pattern reflects the semantic regions in the context of which the data item occurred.
  • the representation generator 114 may generate a two-way mapping between a data item and its distributed representation.
  • the SDR database 120 may be referred to as a pattern dictionary with which the system 100 may identify data items based on distributed representations and vice versa.
  • the method 200 includes receiving, by a sparsifying module executing on the computing device, an identification of a maximum level of sparsity ( 212 ).
  • a human user provides the identification of the maximum level of sparsity.
  • the maximum level of sparsity is set to a predefined threshold.
  • the maximum level of sparsity depends on a resolution of the semantic map 108 .
  • the maximum level of sparsity depends on a type of the reference map generator 106 .
  • the method 200 includes reducing, by the sparsifying module, a total number of set bits within the distributed representation based on the maximum level of sparsity to generate a sparse distributed representation (SDR) having a normative fillgrade ( 214 ).
  • the sparsifying module 116 sparsifies the distributed representation by setting a count threshold (e.g., using the received identification of the maximum level of sparsity) that leads to a specific fillgrade of the final SDR 118 .
  • the sparsifying module 116 therefore generates an SDR 118 , which may be said to provide a binary fingerprint of the semantic meaning or the semantic value of a data item in the set of data documents 104 ; the SDR 118 may also be referred to as a semantic fingerprint.
  • the sparsifying module 116 stores the SDR 118 in the SDR database 120 .
  • the system 100 populates a vector with 1s and 0s—1 if a data document uses a data item, 0 if it doesn't, for example.
  • a user may receive a graphical representation of the SDR showing points on a map reflective of the semantic meaning of the data item (the graphical representation being referred to either as an SDR, a semantic fingerprint, or a pattern), and although the description herein may also refer to points and patterns, one of ordinary skill in the art will understand that referring to “points” or “patterns” also refers to the set bits within the SDR vector that are set—to the data structure underlying any such graphical representation, which is optional.
  • the representation generator 114 and the sparsifying module 116 may combine a plurality of data items into a single SDR. For example, if a phrase, sentence, paragraph, or other combination of data items needs to be converted into a single SDR that reflects the “union property” of the individual SDRs, the system 100 may convert each individual data item into its SDR (by generating dynamically or by retrieving the previously generated SDR) and use a binary OR operation to form a single compound SDR from the individual SDRs. Continuing with this example, the number of set bits is added for every location within the compound SDR. In one embodiment, the sparsifying module 116 may proportionally reduce a total number of set bits using a threshold resulting in a normative fillgrade.
  • the sparsifying module 116 may apply a weighting scheme to reduce the total number of set bits, which may include evaluating a number of bits surrounding a particular set bit instead of simply counting the number of set bits per location in the SDR.
  • a locality weighting scheme may favor bits that are part of clusters within the SDR and are therefore semantically more important than single isolated bits (e.g., with no set bits surrounding them).
  • implementation of the methods and systems described herein provides a system that does not simply generate a map that clusters sets of data documents by context, but goes on to analyze the positions on the map representing clustered data documents, determine which data documents include a particular data item based on the analysis, and use the analysis to provide a specification for each data item in each data document.
  • the sparse distributed representations of the data items are generated based on data retrieved from the semantic map 108 .
  • the sparse distributed representations of the data items need not be limited to use in training other machine learning methods, but may be used to determine relationships between the data items (such as, for example, determining similarity between data items, ranking data items, or identifying data items that users did not previously know to be similar for use in searching and analysis in a variety of environments).
  • any data item becomes “semantically grounded” (e.g., within its semantic universe) and therefore explicitly comparable and computable even without using any machine learning, neural network, or cortical algorithm.
  • the system 300 includes functionality for determining semantic similarity between sparse distributed representations.
  • the system 300 includes functionality for determining relevance ranking of data items converted into SDRs by matching against a reference data item converted into an SDR.
  • the system 300 includes functionality for determining classifications of data items converted into SDRs by matching against a reference text element converted into an SDR.
  • the system 300 includes functionality for performing topic filtering of data items converted into SDRs by matching against a reference data item converted into an SDR.
  • the system 300 includes functionality for performing keyword extraction from data items converted into SDRs.
  • the system 300 includes the elements and provides the functionality described above in connection with FIGS. 1A-1C (shown in FIG. 3 as the engine 101 and the SDR database 120 ).
  • the system 300 also includes a machine 102 a , a machine 102 b , a fingerprinting module 302 , a similarity engine 304 , a disambiguation module 306 , a data item module 308 , and an expression engine 310 .
  • the engine 101 executes on the machine 102 a .
  • the fingerprinting module 302 , the similarity engine 304 , the disambiguation module 306 , the data item module 308 , and the expression engine 310 execute on the machine 102 b.
  • the system 300 includes a fingerprinting module 302 .
  • the fingerprinting module 302 includes the representation generator 114 and the sparsifying module 116 described above in connection with FIGS. 1A-1C and 2 .
  • the fingerprinting module 302 forms part of the engine 101 .
  • the fingerprinting module 302 is implemented at least in part as a hardware module.
  • the fingerprinting module 302 is implemented at least in part as a software program.
  • the fingerprinting module 302 executes on the machine 102 .
  • the fingerprinting module 302 performs a postproduction process to transform a data item SDR into a semantic fingerprint (e.g., via the sparsification process described herein) in real-time, with SDRs that are not part of the SDR database 120 but that are generated dynamically (e.g., to create document semantic fingerprints from word semantic fingerprints); however, such postproduction processing is optional.
  • the representation generator 114 may be accessed directly in order to generate sparsified SDRs for data items that do not yet have SDRs in the SDR database 120 ; in such an embodiment, the representation generator 114 may call the sparsifying module 116 automatically and automatically generate a sparsified SDR.
  • SDR and “fingerprint” and “semantic fingerprint” are used interchangeably herein and may be used to refer both to SDRs that have been generated by the fingerprinting module 302 and to SDRs that are generated by the calling the representation generator 114 directly.
  • the system 300 includes a similarity engine 304 .
  • the similarity engine 304 may provide functionality for computing distances between SDRs and determining a level of similarity.
  • the similarity engine 304 is implemented at least in part as a hardware module.
  • the similarity engine 304 is implemented at least in part as a software program.
  • the similarity engine 304 executes on the machine 102 b.
  • the system 300 includes a disambiguation module 306 .
  • the disambiguation module 306 identifies contextual sub-spaces embodied within a single SDR of a data item. Therefore, the disambiguation module 306 may allow users to better understand different semantic contexts of a single data item.
  • the disambiguation module 306 is implemented at least in part as a hardware module.
  • the disambiguation module 306 is implemented at least in part as a software program. In other embodiments, the disambiguation module 306 executes on the machine 102 b.
  • the system 300 includes a data item module 308 .
  • the data item module 308 provides functionality for identifying the most characteristic data items from a set of received data items—that is, data items whose SDRs have less than a threshold distance from an SDR of the received set of data items, as will be discussed in greater detail below.
  • the data item module 308 may be used in conjunction with or instead of a keyword extraction module 802 discussed below in connection with FIG. 8A .
  • the data item module 308 is implemented at least in part as a hardware module.
  • the data item module 308 is implemented at least in part as a software program. In other embodiments, the data item module 308 executes on the machine 102 b.
  • the system 300 includes an expression engine 310 .
  • the expression engine 310 provides functionality for evaluating Boolean operators received with one or more data items from a user. Evaluating the Boolean operators provides users with flexibility in requesting analysis of one or more data items or combinations of data items.
  • the expression engine 310 is implemented at least in part as a hardware module.
  • the expression engine 310 is implemented at least in part as a software program.
  • the expression engine 310 executes on the machine 102 b.
  • the method 400 includes clustering, by a reference map generator executing on a computing device, in a two-dimensional metric space, a set of data documents selected according to at least one criterion, generating a semantic map ( 402 ).
  • the method 400 includes associating, by the semantic map, a coordinate pair with each of the set of data documents ( 404 ).
  • the method 400 includes generating, by a parser executing on the computing device, an enumeration of data items occurring in the set of data documents ( 406 ).
  • the method 400 includes determining, by a representation generator executing on the computing device, for each data item in the enumeration, occurrence information including: (i) a number of data documents in which the data item occurs, (ii) a number of occurrences of the data item in each data document, and (iii) the coordinate pair associated with each data document in which the data item occurs ( 408 ).
  • the method 400 includes generating, by the representation generator, a distributed representation using the occurrence information ( 410 ).
  • the method 400 includes receiving, by a sparsifying module executing on the computing device, an identification of a maximum level of sparsity ( 412 ).
  • the method 400 includes reducing, by the sparsifying module, a total number of set bits within the distributed representation based on the maximum level of sparsity to generate a sparse distributed representation (SDR) having a normative fillgrade ( 414 ).
  • the method 400 includes determining, by a similarity engine executing on the computing device, a distance between a first SDR of a first data item and a second SDR of a second data item ( 416 ).
  • the method 400 includes providing, by the similarity engine, an identification of a level of semantic similarity between the first data item and the second data item based upon the determined distance ( 418 ).
  • the method 400 includes clustering, by a reference map generator executing on a computing device, in a two-dimensional metric space, a set of data documents selected according to at least one criterion, generating a semantic map ( 402 ).
  • the clustering occurs as described above in connection with FIG. 2 ( 202 ).
  • the method 400 includes associating, by the semantic map, a coordinate pair with each of the set of data documents ( 404 ). In one embodiment, the associating occurs as described above in connection with FIG. 2 ( 204 ).
  • the method 400 includes generating, by a parser executing on the computing device, an enumeration of data items occurring in the set of data documents ( 406 ). In one embodiment, the generating occurs as described above in connection with FIG. 2 ( 206 ).
  • the method 400 includes determining, by a representation generator executing on the computing device, for each data item in the enumeration, occurrence information including: (i) a number of data documents in which the data item occurs, (ii) a number of occurrences of the data item in each data document, and (iii) the coordinate pair associated with each data document in which the data item occurs ( 408 ).
  • the determining occurs as described above in connection with FIG. 2 ( 208 ).
  • the method 400 includes generating, by the representation generator, a distributed representation using the occurrence information ( 410 ). In one embodiment, the generating occurs as described above in connection with FIG. 2 ( 210 ).
  • the method 400 includes receiving, by a sparsifying module executing on the computing device, an identification of a maximum level of sparsity ( 412 ). In one embodiment, the receiving occurs as described above in connection with FIG. 2 ( 212 ).
  • the method 400 includes reducing, by the sparsifying module, a total number of set bits within the distributed representation based on the maximum level of sparsity to generate a sparse distributed representation (SDR) having a normative fillgrade ( 414 ).
  • the reducing occurs as described above in connection with FIG. 2 ( 214 ).
  • the method 400 includes determining, by a similarity engine executing on the computing device, a distance between a first SDR of a first data item and a second SDR of a second data item ( 416 ).
  • the similarity engine 304 computes the distance between at least two SDRs.
  • Distance measures may include, without limitation, Direct Overlap, Euclidian Distance (e.g., determining the ordinary distance between two points in an SDR in a similar manner as a human would measure with a ruler), Jaccard Distance, and Cosine-similarity. The smaller the distance between two SDRs, the greater the similarity and (with semantic folding SDRs) a higher similarity indicates a higher semantic relatedness of the data elements the SDRs represent.
  • the similarity engine 304 counts a number of bits that are set on both the first SDR and the second SDR (e.g., points at which both SDRs are set to 1). In another embodiment, the similarity engine 304 identifies a first point in the first SDR (e.g., an arbitrarily selected first bit that is set to 1), finds the same point within the second SDR and determines the closest set bit in the second SDR. By determining what the closest set bit in the second SDR is to a set bit in the first SDR—for each set bit in the first SDR—the similarity engine 304 is able to calculate a sum of the distances at each point and divide by the number of points to determine the total distance.
  • a first point in the first SDR e.g., an arbitrarily selected first bit that is set to 1
  • similarity is not an absolute measure but may vary depending on the different contexts that a data item might have.
  • the similarity engine 304 also analyzes the topography of the overlap between the two SDRs. For example, the topology of the overlap may be used to add a weighting function to the similarity computation. As another example, similarity measures may be used.
  • the method 400 includes providing, by the similarity engine, an identification of a level of semantic similarity between the first data item and the second data item data item based upon the determined distance ( 418 ).
  • the similarity engine 304 may determine that the distance between the two SDRs exceeds a maximum threshold for similarity and thus the represented data items are not similar. Alternatively, the similarity engine 304 may determine that the distance between the two SDRs does not exceed the maximum threshold and thus the represented data items are similar.
  • the similarity engine 304 may identify the level of similarity based upon a range, threshold, or other calculation. In one embodiment, because SDRs actually represent the semantic meaning (expressed by a large number of semantic features) of a data item, it is possible to determine the semantic closeness between two data items.
  • the system 100 provides a user interface (not shown) with which users may enter data items and receive an identification of the level of similarity.
  • the user interface may provide this functionality to users directly accessing the machine 100 .
  • the user interface may provide this functionality to users accessing the machine 100 across a computer network.
  • a user may enter a pair of data items such as “music” and “apple” into the user interface; the similarity engine 304 receives the data items and generates the SDRs for the data items as described above in connection with FIGS. 1A-1C and 2 .
  • the similarity engine 304 may then compare the two SDRs as described above.
  • the similarity engine 304 may provide a graphical representation of each of the SDRs to the user via the user interface, allowing the user to visually review the way in which each data item is semantically mapped (e.g., viewing the points that are clustered in a semantic map representing a use of the data item in the reference collection used for training the reference map generator 106 ).
  • the similarity engine 304 receives only one data item from a user.
  • the method 500 includes clustering, by a reference map generator executing on a first computing device, in a two-dimensional metric space, a set of data documents selected according to at least one criterion, generating a semantic map ( 502 ).
  • the method 500 includes associating, by the semantic map, a coordinate pair with each of the set of data documents ( 504 ).
  • the method 500 includes generating, by a parser executing on the first computing device, an enumeration of data items occurring in the set of data documents ( 506 ).
  • the method 500 includes determining, by a representation generator executing on the first computing device, for each data item in the enumeration, occurrence information including: (i) a number of data documents in which the data item occurs, (ii) a number of occurrences of the data item in each data document, and (iii) the coordinate pair associated with each data document in which the data item occurs ( 508 ).
  • the method 500 includes generating, by the representation generator, for each data item in the enumeration, a distributed representation using the occurrence information ( 510 ).
  • the method 500 includes receiving, by a sparsifying module executing on the first computing device, an identification of a maximum level of sparsity ( 512 ).
  • the method 500 includes reducing, by the sparsifying module, for each distributed representation, a total number of set bits within the distributed representation based on the maximum level of sparsity to generate a sparse distributed representation (SDR) having a normative fillgrade ( 514 ).
  • the method 500 includes storing, in an SDR database, each of the generated SDRs ( 516 ).
  • the method 500 includes receiving, by a similarity engine executing on a second computing device, from a third computing device, a first data item ( 518 ).
  • the method 500 includes determining, by the similarity engine, a distance between a first SDR of the first data item and a second SDR of a second data item retrieved from the SDR database ( 520 ).
  • the method 500 includes providing, by the similarity engine, to the third computing device, an identification of the second data item and an identification of a level of semantic similarity between the first data item and the second data item, based on the determined distance ( 522 ).
  • ( 502 )-( 516 ) are performed as described above in connection with FIG. 2 ( 202 - 214 ).
  • the method 500 includes receiving, by a similarity engine executing on a second computing device, from a third computing device, a first data item ( 518 ).
  • the system 300 includes a user interface (not shown) with which a user may enter the first data item.
  • the fingerprinting module 302 generates an SDR of the first data item.
  • the representation generator 114 generates the SDR.
  • the method 500 includes determining, by the similarity engine, a distance between a first SDR of the first data item and a second SDR of a second data item retrieved from the SDR database ( 520 ). In one embodiment, the method 500 includes determining the distance between the first SDR of the first data item and the second SDR of the second data item as described above in connection with FIG. 4 , ( 416 ). In some embodiments, the similarity engine 304 retrieves the second data item from the SDR database 120 . In one of these embodiments, the similarity engine 304 examines each entry in the SDR database 120 to determine whether there is a level of similarity between the retrieved item and the received first data item.
  • system 300 implements current text indexing techniques and text search libraries to perform efficient indexing of a semantic fingerprint (i.e., SDR) collection and to allow the similarity engine 304 to identify the second SDR of the second data item more efficiently than a “brute force” process such as iterating through each and every item in the database 120 .
  • SDR semantic fingerprint
  • the method 500 includes providing, by the similarity engine, to the third computing device, an identification of the second data item and an identification of a level of semantic similarity between the first data item and the second data item, based on the determined distance ( 522 ).
  • the similarity engine 304 provides the identifications via the user interface.
  • the similarity engine 304 provides an identification of a level of semantic similarity between the first data item and the second data item based upon the determined distance, as described above in connection with FIG. 4 , ( 418 ).
  • the similarity engine 304 retrieves a third SDR for a third data item from the SDR database and repeats the process of determining a distance between the first SDR of the first data item and the third SDR of the third data item and providing an identification of a level of semantic similarity between the first and third data items, based on the determined distance.
  • the similarity engine 304 may return an enumeration of other data items that are most similar to the received data item.
  • the similarity engine 304 may generate an SDR 118 for the received data item and then search the SDR database 120 for other SDRs that are similar to the SDR 118 .
  • the data item module 308 provides this functionality.
  • the similarity engine 304 may compare the SDR 118 for the received data item with each of a plurality of SDRs in the SDR database 120 as described above and return an enumeration of data items that satisfy a requirement for similarity (e.g., having a distance between the data items that falls below a predetermined threshold). In some embodiments, the similarity engine 304 returns the SDRs that are most similar to a particular SDR (as opposed to returning the data item itself).
  • a method for receiving a data item (which may be referred to as a keyword) and identifying similar data items performs as described above in connection with FIG. 2 ( 202 - 214 ).
  • the data item module 308 provides this functionality.
  • the method includes receiving a data item.
  • the method may include receiving a request for most similar data items that are not identical to the received data item.
  • the method includes generating a first SDR for the received data item.
  • the method includes determining a distance between the first SDR and each SDR in the SDR database 120 .
  • the method includes providing an enumeration of data items for which the distance between an SDR of an enumerated data item and the first SDR fall below a threshold.
  • the method includes providing an enumeration of data items with a level of similarity between each data item and the received data item above a threshold.
  • methods for identifying similar data items provide functionality for receiving a data item or an SDR of a data item and generating an enumeration of SDRs ordered by increasing distance (e.g., Euclidean distance).
  • the system 100 provides functionality for returning all contextual data items—that is, data items within the conceptual space in which the submitted data item occurs.
  • the data item module 308 may return similar data items either to a user providing the received data item or to another module or engine (e.g., the disambiguation module 306 ).
  • the system may generate an enumeration of similar data items and transmit the enumeration to a system for executing queries, which may be either a system within the system 300 or a third-party search system.
  • a user may enter a data item into a user interface for executing queries (e.g., a search engine) and the user interface may forward the data item to the query module 601 ; the query module 601 may automatically call components of the system (e.g., the similarity engine 304 ) to generate the enumeration of similar data items and provide the data items to the user interface for executing as queries in addition to the user's original query, thereby improving the comprehensiveness of the user's search results.
  • the similarity engine 304 may automatically call components of the system to generate the enumeration of similar data items and provide the data items to the user interface for executing as queries in addition to the user's original query, thereby improving the comprehensiveness of the user's search results.
  • the system may generate the enumeration of similar data items, provide the data items directly to a third-party search system, and return the results of the expanded search to the user via the user interface.
  • Third-party search systems (which may also be referred to herein as enterprise search systems) may be any type or form; as indicated above in connection with the full-text search system 122 , a side variety of such systems are available and may be enhanced using the methods and systems described herein.
  • FIG. 6A a block diagram depicts one embodiment of a system 300 for expanding a query of a full-text search system.
  • the system 300 includes the elements and provides the functionality described above in connection with FIGS. 1A-1C and FIG. 3 above.
  • the system 300 includes a machine 102 d executing a query module 601 .
  • the query module 601 executes a query expansion module 603 , a ranking module 605 , and a query input processing module 607 .
  • the query module 601 receives query terms, directs the generation of SDRs for the received terms, and directs the identification of similar query terms.
  • the query module 601 is in communication with an enterprise search system provided by a third party.
  • the query module 601 may include one or more interfaces (e.g., application programming interface) with which to communicate with the enterprise search system.
  • the query module 601 is implemented at least in part as a software program.
  • the query module 601 is implemented at least in part as a hardware module.
  • the query module 601 executes on the machine 102 d.
  • the query input processing module 607 receives query terms from a user of a client 102 c . In another embodiment, the query input processing module 607 identifies a type of query term (e.g., individual word, group of words, sentence, paragraph, document, SDR, or other expression to be used in identifying similar terms). In some embodiments, the query input processing module 607 is implemented at least in part as a software program. In other embodiments, the query input processing module 607 is implemented at least in part as a hardware module. In still other embodiments, the query input processing module 607 executes on the machine 102 d . In further embodiments, the query module 601 is in communication with or provides the functionality of the query input processing module 607 .
  • a type of query term e.g., individual word, group of words, sentence, paragraph, document, SDR, or other expression to be used in identifying similar terms.
  • the query input processing module 607 is implemented at least in part as a software program. In other embodiments, the query input processing module 607
  • the query expansion module 603 receives query terms from a user of a client 102 c . In another embodiment, the query expansion module 603 receives query terms from the query input processing module 607 . In still another embodiment, the query expansion module 603 directs the generation of an SDR for a query term. In another embodiment, the query expansion module 603 directs the identification, by the similarity engine 304 , of one or more terms that are similar to the query term (based on a distance between the SDRs). In some embodiments, the query expansion module 603 is implemented at least in part as a software program. In other embodiments, the query expansion module 603 is implemented at least in part as a hardware module. In still other embodiments, the query expansion module 603 executes on the machine 102 d . In further embodiments, the query module 601 is in communication with or provides the functionality of the query expansion module 603 .
  • a flow diagram depicts one embodiment of a method 600 for expanding a query of a full-text search system.
  • the method 600 includes clustering, by a reference map generator executing on a first computing device, in a two-dimensional metric space, a set of data documents selected according to at least one criterion, generating a semantic map ( 602 ).
  • the method 600 includes associating, by the semantic map, a coordinate pair with each of the set of data documents ( 604 ).
  • the method 600 includes generating, by a parser executing on the first computing device, an enumeration of terms occurring in the set of data documents ( 606 ).
  • the method 600 includes determining, by a representation generator executing on the first computing device, for each term in the enumeration, occurrence information including: (i) a number of data documents in which the term occurs, (ii) a number of occurrences of the term in each data document, and (iii) the coordinate pair associated with each data document in which the term occurs ( 608 ).
  • the method 600 includes generating, by the representation generator, a sparse distributed representation (SDR) for each term in the enumeration, using the occurrence information for each term ( 610 ).
  • the method 600 includes storing, in an SDR database, each of the generated SDRs ( 612 ).
  • the method 600 includes receiving, by a query expansion module executing on a second computing device, from a third computing device, a first term ( 614 ).
  • the method 600 includes determining, by a similarity engine executing on a fourth computing device, a level of semantic similarity between a first SDR of the first term and a second SDR of a second term retrieved from the SDR database ( 616 ).
  • the method 600 includes transmitting, by the query expansion module, to a full-text search system, using the first term and the second term, a query for an identification of each of a set of documents containing at least one term similar to at least one of the first term and the second term ( 618 ).
  • the method 600 includes transmitting, by the query expansion module, to the third computing device, the identification of each of the set of documents ( 620 ).
  • ( 602 )-( 612 ) are performed as described above in connection with FIG. 2 ( 202 - 214 ).
  • the method 600 includes receiving, by a query expansion module executing on a second computing device, from a third computing device, a first term ( 614 ).
  • the query expansion module 603 receives the first data item as described above in connection with FIG. 5 ( 516 ).
  • the query input processing module 607 receives the first term.
  • the query input processing module 607 transmits the first term, with a request for generation of an SDR, to the fingerprinting module 302 .
  • the query input processing module 607 transmits the first term to the engine 101 for generation of an SDR by the representation generator 114 .
  • the method 600 includes determining, by a similarity engine executing on a fourth computing device, a level of semantic similarity between a first SDR of the first term and a second SDR of a second term retrieved from the SDR database ( 616 ).
  • the similarity engine 304 determines the level of semantic similarity as described above in connection with FIG. 5 ( 520 ).
  • the method 600 includes transmitting, by the query expansion module, to a full-text search system, using the first term and the second term, a query for an identification of each of a set of documents containing at least one term similar to at least one of the first term and the second term ( 618 ).
  • the similarity engine 304 provides the second term to the query module 601 . It will be understood that the similarity engine may provide a plurality of terms that have a level of similarity to the first term that exceeds a similarity threshold.
  • the query module 601 may include one or more application programming interfaces with which to transmit queries, including one or more search terms, to the third-party enterprise search system.
  • the method 600 includes transmitting, by the query expansion module, to the third computing device, the identification of each of the set of documents ( 620 ).
  • a flow diagram depicts one embodiment of a method 650 for expanding a query of a full-text search system.
  • the method 650 includes clustering, by a reference map generator executing on a first computing device, in a two-dimensional metric space, a set of data documents selected according to at least one criterion, generating a semantic map ( 652 ).
  • the method 650 includes associating, by the semantic map, a coordinate pair with each of the set of data documents ( 654 ).
  • the method 600 includes generating, by a parser executing on the first computing device, an enumeration of terms occurring in the set of data documents ( 656 ).
  • the method 650 includes determining, by a representation generator executing on the first computing device, for each term in the enumeration, occurrence information including: (i) a number of data documents in which the term occurs, (ii) a number of occurrences of the term in each data document, and (iii) the coordinate pair associated with each data document in which the term occurs ( 658 ).
  • the method 650 includes generating, by the representation generator, a sparse distributed representation (SDR) for each term in the enumeration, using the occurrence information for each term ( 660 ).
  • the method 650 includes storing, in an SDR database, each of the generated SDRs ( 662 ).
  • the method 650 includes receiving, by a query expansion module executing on a second computing device, from a third computing device, a first term ( 664 ).
  • the method 650 includes determining, by a similarity engine executing on a fourth computing device, a level of semantic similarity between a first SDR of the first term and a second SDR of a second term retrieved from the SDR database ( 666 ).
  • the method 650 includes transmitting, by the query expansion module, to the third computing device, the second term ( 668 ).
  • the method 650 includes transmitting, by the query expansion module, to the third computing device, the second term ( 668 ).
  • a user of the third computing device has the ability to review or modify the second term before the query is transmitted to the enterprise search system.
  • the user wants additional control over the query.
  • the user prefers to execute the queries herself.
  • the user wants the ability to modify a term identified by the system before transmission of the query.
  • providing the identified term to the user allows the system to request feedback from the user regarding the identified term.
  • the user may rate the accuracy of the similarity engine in identifying the second term.
  • the user provides an indication that the second term is a type of term in which the user has a level of interest (e.g., the second term is a type the user is currently researching or developing an area of expertise).
  • a method for evaluating at least one Boolean expression includes receiving, by the expression engine 310 , at least one data item and at least one Boolean operator. The method includes performing the functionality described above in connection with FIG. 2 ( 202 - 214 ).
  • the expression engine 310 receives a plurality of data items that a user combined using Boolean operators and parentheses. For example, the user may submit a phrase such as “jaguar SUB porsche” and the expression engine 310 will evaluate the phrase and generate a modified version of an SDR for the expression. In another embodiment, therefore, the expression engine 310 generates a first SDR 118 for a first data item in the received phrase.
  • the expression engine 310 identifies the Boolean operator within the received phrase (e.g., by determining that the second data item in a three-data item phrase is the Boolean operator or by comparing each data item in the received phrase to an enumeration of Boolean operators to determine whether the data item is a Boolean operator or not). The expression engine 310 evaluates the identified Boolean operator to determine how to modify the first data item.
  • the expression evaluator 310 may determine that a Boolean operator “SUB” is included in the received phrase; the expression engine 310 may then determine to generate a second SDR for a data item following the Boolean operator (e.g., porsche, in the example phrase above) and generate a third SDR by removing the points from the first SDR that appear in the second SDR. The third SDR would then be the SDR of the first data item, not including the SDR of the second data item. Similarly, if the expression engine 310 determined that the Boolean operator was “AND,” the expression engine 310 would generate a third SDR by only using points in common to the first and the second SDR.
  • a Boolean operator e.g., porsche, in the example phrase above
  • the expression engine 310 accepts data items, compound data items, and SDRs combined using Boolean operators and parentheses and returns an SDR that reflects the Boolean result of the formulated expression.
  • the resulting modified SDR may be returned to a user or provided to other engines within the system 200 (e.g., the similarity engine 304 ).
  • Boolean operators include, without limitation, AND, OR, XOR, NOT, and SUB.
  • a method for identifying a plurality of sub-contexts of a data item includes receiving, by the disambiguation module 306 , a data item.
  • the method includes performing the functionality described above in connection with FIG. 2 ( 202 - 214 ).
  • the method includes generating a first SDR for the received data item.
  • the method includes generating an enumeration of data items that have SDRs that are similar to the first SDR; for example, the method may include providing the first SDR to the similarity engine and requesting an enumeration of similar SDRs as described above.
  • the method includes analyzing one of the enumerated SDRs that is similar but not equal to the first SDR and removing from the first SDR the points (e.g., set bits) that also appear in the enumerated SDR (e.g., via binary subtraction) to generate a modified SDR.
  • the method includes repeating the process of removing points that appear in both the first SDR and the similar (but not identical) SDRs until the method has removed from the first SDR all the points that appear in each of the enumeration of similar SDRs.
  • the system may return data items such as “macintosh,” and “iphone,” “operating system;” if a user provides the expression “apple SUB macintosh” and asks for similar data items from the remaining points, the system may return data items such as “fruit,” “plum,” “orange,” “banana.” Continuing with this example, if the user then provides the expression “apple SUB macintosh SUB fruit” and repeats the request for similar data items, the system may return data items such as “records,” “beatles,” and “pop music.”
  • the method includes subtraction of the points of the similar SDRs from the largest clusters in the first SDR instead of from the entire SDR, providing a more optimized solution.
  • data items may refer to items other than words.
  • the system 300 e.g., the similarity engine 304
  • the system 300 e.g., the similarity engine 304
  • the system may store an association between an SDR for the data item ( 100 . 1 ) and an identification of the data item as a reference data item for a patient with a fever). Determining that the data items have similar patterns provides functionality for identifying commonalities between dynamically generated SDRs and reference SDRs, enabling users to better understand the import of a particular data item.
  • the reference SDRs are linked to qualified diagnoses, making it possible to match a new patient's SDR profile against diagnosed patterns and deduct from it a mosaic of possible diagnoses for the new patient.
  • users may “see” where points (e.g., semantic features of a data item) overlap and/or match.
  • the most similar diagnosis to the new patient's SDR pattern is the predicted diagnoses.
  • the set of data documents 104 may include logs of captured flight data generated by airplane sensors (as opposed to, for example, encyclopedia entries on flight); the logs of captured data may include alphanumeric data items or may be primarily numeric.
  • the system 100 may provide functionality for generating SDRs of a variable (e.g., a variable associated with any type of flight data) and compare the generated SDR with a reference SDR (e.g., an SDR of a data item used as a reference item known to have a particular characteristic such as a fact about the flight during which the data item was generated, for example, that the flight had a particular level of altitude or a characterization of the altitude such as too high or too low).
  • a reference SDR e.g., an SDR of a data item used as a reference item known to have a particular characteristic such as a fact about the flight during which the data item was generated, for example, that the flight had a particular level of altitude or a characterization of the altitude such as
  • the system 100 may generate a first SDR for “500 (degrees)” and determine that the first SDR is similar to a second SDR for “28,000 (feet).” The system 100 may then determine that the second SDR is a reference SDR for data items indicating a characteristic of the flight (e.g., too high, too low, too fast, etc.), and thus provide a user who started with a data item “500” with an understanding of the import of the data item.
  • a characteristic of the flight e.g., too high, too low, too fast, etc.
  • a method for dividing a document into portions (also referred to herein as slices) while respecting the topical structure of the submitted text.
  • the data item module 308 receives a document to be divided into topical slices.
  • the data item module 308 identifies a location in the document that has a different semantic fingerprint than a second location and divides the document into two slices, one containing the first location and one containing the second.
  • the method includes performing the functionality described above in connection with FIG. 2 ( 202 - 214 ).
  • the method includes generating an SDR 118 for each sentence (e.g., strings delimited by periods) in the document.
  • the method includes comparing a first SDR 118 a of a first sentence with a second SDR 118 b of a second sentence.
  • the method may include transmitting the two SDRs to the similarity engine 304 for comparison.
  • the method includes inserting a break into the document after the first sentence when the distance between the two SDRs exceeds a predetermined threshold.
  • the method includes determining not to insert a break into the document when the distance between the two SDRs does not exceed the predetermined threshold.
  • the method includes repeating the comparison between the second sentence and a subsequent sentence.
  • the method includes iterating through the document, repeating comparisons between sentences until reaching the end of the document.
  • the method includes using the inserted breaks to generate slices of the document (e.g., returning a section of the document up through a first inserted break as a first slice).
  • having a plurality of smaller slices is preferred over a document but arbitrarily dividing a document (e.g., by length or word count) may be inefficient or less useful than a topic-based division.
  • the system 300 can determine where the topic of the document has changed creating a logical dividing point.
  • the system 300 may provide a semantic fingerprint index in addition to a conventional index. Further examples of topic slicing are discussed in connection with FIGS. 7A-7B below.
  • a block diagram depicts one embodiment of a system 700 for providing topic-based documents to a full-text search system.
  • the system 700 includes the elements and provides the functionality described above in connection with FIGS. 1A-1C and FIG. 3 above.
  • the system 700 further includes a topic slicing module 702 .
  • the topic slicing module 702 receives documents, directs the generation of SDRs for the received documents, and directs the generation of sub-documents in which sentences having less than a threshold level of similarity are placed into different documents, or other data structures.
  • the topic slicing module 702 is in communication with an enterprise search system provided by a third party.
  • the topic slicing module 702 is implemented at least in part as a software program. In other embodiments, the topic slicing module 702 is implemented at least in part as a hardware module. In still other embodiments, the topic slicing module 702 executes on the machine 102 b.
  • a flow diagram depicts one embodiment of a method 750 for providing topic-based documents to a full-text search system.
  • the method 750 includes clustering, by a reference map generator executing on a first computing device, in a two-dimensional metric space, a set of data documents selected according to at least one criterion, generating a semantic map ( 752 ).
  • the method 750 includes associating, by the semantic map, a coordinate pair with each of the set of data documents ( 754 ).
  • the method 750 includes generating, by a parser executing on the first computing device, an enumeration of terms occurring in the set of data documents ( 756 ).
  • the method 750 includes determining, by a representation generator executing on the first computing device, for each term in the enumeration, occurrence information including: (i) a number of data documents in which the term occurs, (ii) a number of occurrences of the term in each data document, and (iii) the coordinate pair associated with each data document in which the term occurs ( 758 ).
  • the method 750 includes generating, by the representation generator, a sparse distributed representation (SDR) for each term in the enumeration, using the occurrence information ( 760 ).
  • the method 750 includes storing, in an SDR database, each of the generated SDRs ( 762 ).
  • the method 750 includes receiving, by a topic slicing module executing on a second computing device, from a third computing device associated with an enterprise search system, a second set of documents ( 764 ).
  • the method 750 includes generating, by the representation generator, a compound SDR for each sentence in the each of the second set of documents ( 766 ).
  • the method 750 includes determining, by a similarity engine executing on the second computing device, a distance between a first compound SDR of a first sentence and a second compound SDR of a second sentence ( 768 ).
  • the method 750 includes generating, by the topic slicing module, a second document including the first sentence and a third document including the second sentence, based on the determined distance ( 770 ).
  • the method 750 includes transmitting, by the topic slicing module, to the third computing device, the second document and the third document ( 772 ).
  • ( 752 )-( 762 ) are performed as described above in connection with FIG. 2 , ( 202 )-( 214 ).
  • the method 750 includes receiving, by a topic slicing module executing on a second computing device, from a third computing device associated with an enterprise search system, a second set of documents ( 764 ).
  • the topic slicing module 702 receives the second set of documents for processing to create a version of the second set of documents optimized for indexing by the enterprise search system, which may be a conventional search system.
  • the topic slicing module 702 receives the second set of documents for processing to create a version of the second set of documents optimized for indexing by a search system provided by the system 700 , as will be described in greater detail below in connection with FIGS. 9A-9B .
  • the received second set of documents includes one or more XML documents.
  • the third computing device may have converted one or more enterprise documents into XML documents for improved indexing.
  • the method 750 includes generating, by the representation generator, a compound SDR for each sentence in each of the second set of documents ( 766 ).
  • a phrase, sentence, paragraph, or other combination of data items needs to be converted into a single SDR that reflects the “union property” of the individual SDRs (e.g., the combination of the SDRs of each word in a sentence)
  • the system 100 may convert each individual data item into its SDR (by generating dynamically or by retrieving the previously generated SDR) and use a binary OR operation to form a single compound SDR from the individual SDRs; the result may be sparsified by the sparsifying module 116 .
  • the method 750 includes determining, by a similarity engine executing on the second computing device, a distance between a first compound SDR of a first sentence and a second compound SDR of a second sentence ( 768 ).
  • the similarity engine determines the distance as described above in connection with FIG. 4 ( 416 ).
  • the method 750 includes generating, by the topic slicing module, a second document including the first sentence and a third document including the second sentence, based on the determined distance ( 770 ).
  • the topic slicing module may determine that the distance determined by the similarity engine exceeds a threshold for similarity and that the second sentence therefore relates to a different topic than the first sentence and so should go into a different document (or other data structure).
  • the similarity engine provides the topic slicing module 702 with an identification of a level of similarity between the first sentence and the second sentence, based on the determined distance (as described above in connection with FIG.
  • the topic slicing module 702 determines that the level of similarity does not satisfy a threshold level of similarity and determines to put the second sentence in a different document than the first sentence. In contrast, in other embodiments, the topic slicing module 702 decides that the determined distance (and/or level of similarity) satisfies a similarity threshold and that the first sentence and the second sentence are topically similar and should remain together in a single document.
  • the method includes repeating the comparison between the second sentence and a subsequent sentence. In another embodiment, the method includes iterating through the document, repeating comparisons between sentences until reaching the end of the document.
  • the method 750 includes transmitting, by the topic slicing module, to the third computing device, the second document and the third document ( 772 ).
  • a flow diagram depicts one embodiment of a method 850 for extracting keywords from text documents.
  • the method 850 includes clustering in a two-dimensional metric space, by a reference map generator executing on a first computing device, a set of data documents selected according to at least one criterion, generating a semantic map ( 852 ).
  • the method 850 includes associating, by the semantic map, a coordinate pair with each of the set of data documents ( 854 ).
  • the method 850 includes generating, by a parser executing on the first computing device, an enumeration of terms occurring in the set of data documents ( 856 ).
  • the method 850 includes determining, by a representation generator executing on the first computing device, for each term in the enumeration, occurrence information including: (i) a number of data documents in which the term occurs, (ii) a number of occurrences of the term in each data document, and (iii) the coordinate pair associated with each data document in which the term occurs ( 858 ).
  • the method 850 includes generating, by the representation generator, a sparse distributed representation (SDR) for each term in the enumeration, using the occurrence information ( 860 ).
  • the method 850 includes storing, in an SDR database, each of the generated SDRs ( 862 ).
  • the method 850 includes receiving, by a keyword extraction module executing on a second computing device, from a third computing device associated with a full-text search system, a document from a second set of documents ( 864 ).
  • the method 850 includes generating, by the representation generator, at least one SDR for each term in the received document ( 866 ).
  • the method 850 includes generating, by the representation generator, a compound SDR for the received document, based on the generated at least one SDR ( 868 ).
  • the method 850 includes selecting, by the keyword extraction module, a plurality of term SDRs that, when compounded, create a compound SDR that has a level of semantic similarity to the compound SDR for the document, the level of semantic similarity satisfying a threshold ( 870 ).
  • the method 850 includes modifying, by the keyword extraction module, a keyword field of the received document to include the plurality of terms ( 872 ).
  • the method 850 includes transmitting, by the keyword extraction module, to the third computing device, the
  • ( 852 )-( 862 ) are performed as described above in connection with FIG. 2 , ( 202 )-( 214 ).
  • the system 800 includes the elements and provides the functionality described above in connection with FIGS. 1A-1C and FIG. 3 above.
  • the system 800 further includes a keyword extraction module 802 .
  • the keyword extraction module 802 receives documents, directs the generation of SDRs for the received documents, identifies keywords for the received documents, and modifies the received documents to include the identified keywords.
  • the keyword extraction module 802 is in communication with an enterprise search system provided by a third party.
  • the keyword extraction module 802 is implemented at least in part as a software program.
  • the keyword extraction module 802 is implemented at least in part as a hardware module.
  • the keyword extraction module 802 executes on the machine 102 b.
  • the method 850 includes receiving, by a keyword extraction module executing on a second computing device, from a third computing device associated with a full-text search system, a document from a second set of documents ( 864 ).
  • the keyword extraction module 802 receives the documents as described at FIG. 7 ( 764 ), in connection with the topic slicing module 702 .
  • the method 850 includes generating, by the representation generator, at least one SDR for each term in the received document ( 866 ).
  • the keyword extraction module 802 transmits each term in the received document to the representation generator 114 to generate the at least one SDR.
  • the keyword extraction module 802 transmits each term in the received document to the fingerprinting module 302 for generation of the at least one SDR.
  • the keyword extraction module 802 transmits the document to the fingerprinting module 302 with a request for generation of compound SDRs for each sentence in the document. In other embodiments, the keyword extraction module 802 transmits the document to the representation generator 114 with a request for generation of compound SDRs for each sentence in the document.
  • the method 850 includes generating, by the representation generator, a compound SDR for the document ( 868 ).
  • the keyword extraction module 802 requests generation of the compound SDR from the representation generator 114 .
  • the keyword extraction module 802 requests generation of the compound SDR from the fingerprinting module 302 .
  • the method 850 includes selecting, by the keyword extraction module, a plurality of term SDRs that, when compounded, create a compound SDR that has a level of semantic similarity to the compound SDR for the document, the level of semantic similarity satisfying a threshold ( 870 ).
  • the keyword extraction module 802 directs the similarity engine 304 to compare the compound SDR for the document with the SDRs for a plurality of terms (“term SDRs”) and to generate an identification of a level of similarity between the plurality of terms and the document itself.
  • the keyword extraction module 802 identifies the plurality of terms that satisfies the threshold by having the similarity engine iterate through combinations of term SDRs, generate comparisons with the compound SDR for the document, and return an enumeration of a level of semantic similarity between the document and each combination of terms.
  • the keyword extraction module 302 identifies a plurality of terms having a level of semantic similarity to the document that satisfies the threshold and that also contains the least number of terms possible.
  • the method 850 includes modifying, by the keyword extraction module, a keyword field of the received document to include the plurality of terms ( 872 ).
  • the received document may be a structured document, such as an XML document, and may have a section within which the keyword extraction module 802 may insert the plurality of terms.
  • the method 850 includes transmitting, by the keyword extraction module, to the third computing device, the modified document ( 874 ).
  • enterprise search systems may include implementations of conventional search systems, including those described in connection with the full-text search system 122 described above (e.g., Lucene-based systems, open source systems such as Xapian, commercial systems such as Autonomy IDOL or COGITO, and the other systems listed in detail above).
  • the phrases “enterprise search system” and “full-text search system” may be used interchangeably herein.
  • the methods and systems described in FIGS. 6-8 describe enhancements to such enterprise systems; that is, by implementing the methods and systems described herein, an entity making such an enterprise system available may enhance the available functionality—making indexing more efficient by adding keywords, expanding query terms for users and automatically providing them to the existing system, etc.
  • entities making search systems available to their users may wish to go further than enhancing certain aspects of their existing systems by replacing the systems entirely, or seeking to implement an improved search system in the first instance. In some embodiments, therefore, an improved search system is provided.
  • FIG. 9A a block diagram depicts one embodiment of a system 900 for implementing a full-text search system 902 .
  • the system 900 includes the functionality described above in connection with FIGS. 1A-1C, 3, 6A, 7A, and 8A .
  • the search system 902 includes the query module 601 , which may be provided as described above in connection with FIGS. 6A-6B .
  • the search system 902 includes a document fingerprint index 920 ; the document fingerprint index 920 may be a version of the SDR database 120 .
  • the document fingerprint index 920 may also include metadata (e.g., tags).
  • the search system 902 may include a document similarity engine 304 b ; for example, the document similarity engine 304 b may be a copy of the similarity engine 304 that is refined over time for working with the search system 902 .
  • the search system 902 includes an indexer 910 , which may be provided as either a hardware module or a software module.
  • a method 950 includes clustering in a two-dimensional metric space, by a reference map generator executing on a first computing device, a set of data documents selected according to at least one criterion, generating a semantic map ( 952 ).
  • the method 950 includes associating, by the semantic map, a coordinate pair with each of the set of data documents ( 954 ).
  • the method 950 includes generating, by a parser executing on the first computing device, an enumeration of terms occurring in the set of data documents ( 956 ).
  • the method 950 includes determining, by a representation generator executing on the first computing device, for each term in the enumeration, occurrence information including: (i) a number of data documents in which the term occurs, (ii) a number of occurrences of the term in each data document, and (iii) the coordinate pair associated with each data document in which the term occurs ( 958 ).
  • the method 950 includes generating, by the representation generator, a sparse distributed representation (SDR) for each term in the enumeration, using the occurrence information ( 960 ).
  • the method 950 includes storing, in an SDR database, each of the generated SDRs ( 962 ).
  • the method 950 includes receiving, by a full-text search system executing on a second computing device, a second set of documents ( 964 ).
  • the method 950 includes generating, by the representation generator, at least one SDR for each document in the second set of documents ( 966 ).
  • the method 950 includes storing, by an indexer in the full-text search system, each generated SDR in a document fingerprint index ( 968 ).
  • the method 950 includes receiving, by a query module in the search system, from a third computing device, at least one search term ( 970 ).
  • the method 950 includes querying, by the query module, the document fingerprint index, for at least one term in the document fingerprint index having an SDR similar to an SDR of the received at least one search term ( 972 ).
  • the method 950 includes providing, by the query module, to the third computing device, a result of the query ( 974 ).
  • the method 950 includes clustering in a two-dimensional metric space, by a reference map generator executing on a first computing device, a set of data documents selected according to at least one criterion, generating a semantic map ( 952 ).
  • the set of data documents are selected and the clustering occurs as described above in connection with FIG. 2 ( 202 ).
  • a training process occurs.
  • the reference map generator 106 is trained using at least one set of data documents (more specifically, using the document vectors of each document in the set of data documents).
  • the semantic resolution of a set of documents refers to how many positions are available based on the training data, which in some aspects reflects the nature of the training data (colloquially, this might be referred to as how much “real estate” is available on the map).
  • different or additional training documents may be used. There are, therefore, several different approaches to training the reference map generator 106 .
  • a generic training corpus may be used when generating SDRs for each term received (e.g., terms within enterprise documents); one advantage to such an approach is that the corpus has likely been selected to satisfy one or more training criteria but a disadvantage is that the corpus may or may not have sufficient words to support a specialized enterprise corpus (e.g., a highly technical corpus including a number of terms that have particular meanings within a specialty or practice).
  • a specialized enterprise corpus e.g., a highly technical corpus including a number of terms that have particular meanings within a specialty or practice.
  • a set of enterprise documents may be used as the training corpus; one advantage to this approach is that the documents used for training will include any highly technical or otherwise specialized terms common within the enterprise but a disadvantage is that the enterprise documents may not satisfy the training criteria (e.g., there may not be enough documents, they may be of insufficient length or diversity, etc.).
  • a generic training corpus and an enterprise corpus are combined for training purposes.
  • a special set of technical documents is identified and processed for use as a training corpus; for example, these documents may include key medical treatises, engineering specifications, or other key reference materials in specialties relevant to the enterprise documents that will be used.
  • a reference corpus may be processed and used for training and then the resulting engine 101 may use the trained database, separately licensed to enterprises seeking to implement the methods and systems described herein.
  • These embodiments are equally applicable to the embodiments discussed in connection with FIGS. 6-8 as to those with FIGS. 9A-B .
  • the method 950 includes receiving, by a full-text search system executing on a second computing device, a second set of documents ( 964 ).
  • the second set of documents includes enterprise documents (e.g., documents generated by, maintained by, accessed by, or otherwise associated with an enterprise seeking to implement the full-text search system 902 ).
  • the search system 902 makes one or more enterprise documents searchable. To do so, the search system 902 indexes the one or more enterprise documents.
  • the search system 902 directs the preprocessing of the enterprise documents (e.g., by having the topic slicing module 702 and/or the keyword extraction module 802 process the documents as described above in connection with FIGS. 7B and 8B ).
  • the search system 902 directs the generation of an SDR for each of the documents based on the training corpus (as described above in connection with FIGS. 1-2 ).
  • the search system 902 has enabled a search process wherein a query term is received (e.g., by the query input processing module 607 ), an SDR is generated for the query term and the query SDR is compared to a indexed SDRs.
  • the method 950 includes generating, by the representation generator, at least one SDR for each document in the second set of documents ( 966 ).
  • the search system 902 includes functionality for transmitting the documents to the fingerprinting module 302 for generation of the at least one SDR.
  • the search system 902 includes functionality for transmitting the documents to the representation generator 114 for generation of the at least one SDR.
  • the at least one SDR may include, by way of example, and without limitation, an SDR for each term in the document, a compound SDR for subsections of the document (e.g., sentences or paragraphs), and a compound SDR for the document itself.
  • the method 950 includes storing, by an indexer in the full-text search system, each generated SDR in a document fingerprint index ( 968 ).
  • the generated SDRs are stored in the document fingerprint index 920 in a substantially similar manner as the manner in which SDRs were stored in the SDR database 120 , discussed above.
  • the method 950 includes receiving, by a query module in the search system, from a third computing device, at least one search term ( 970 ).
  • the query module receives the search term as described above in connection with FIGS. 6A-6B .
  • the method 950 includes querying, by the query module, the document fingerprint index, for at least one term in the document fingerprint index having an SDR similar to an SDR of the received at least one search term ( 972 ).
  • the query module 601 queries the document fingerprint index 920 .
  • the query module 601 directs the document similarity engine 304 b to identify the SDR of the at least one term in the document fingerprint index 920 .
  • the query module 601 directs the similarity engine 304 executing on the machine 102 b to identify the term.
  • the query module 601 executes the search as described above in connection with FIGS. 6A-6B , although instead of sending the query to an external enterprise search system, the query module 601 sends the query to components within the system 900 .
  • the method 950 includes providing, by the query module, to the third computing device, a result of the query ( 974 ).
  • the query module 601 first ranks the results or directs another module to rank the results. Ranking may implement conventional ranking techniques. Alternatively, ranking may include execution of the methods described in connection with FIGS. 11A-B below.
  • the full-text search system 902 provides a user interface (not shown) with which a user may provide feedback on the query results.
  • the user interface includes a user interface element with which the user may specify whether the result was useful.
  • the user interface includes a user interface element with which the user may provide an instruction to the query module 601 to execute a new search using one of the query results.
  • the user interface includes a user interface element with which the user may specify that they have an interest in a topic related to one of the query results and wish to store an identifier of the query result and/or the related topic for future reference by either the user or the system 900 .
  • a system may provide functionality for monitoring the types of searches a user executes and developing a profile for the user based on analysis of the SDRs of the search terms the user provided.
  • the profile may identify a level of expertise of the user and may be provided to other users.
  • FIGS. 10A and 10B block diagrams depict embodiments of systems for matching user expertise with requests for user expertise, based on previous search results.
  • FIG. 1 OA depicts an embodiment in which functionality for developing user expertise profiles (e.g., user expertise profile module 1010 ) is provided in conjunction with a conventional full-text search system.
  • FIG. 10B depicts an embodiment in which functionality for developing user expertise profiles (e.g., user expertise profile module 1010 ) is provided in conjunction with the full-text search system 902 .
  • Each of the modules depicted in FIGS. 10A-B may be provided as either hardware modules or software modules.
  • a flow diagram depicts an embodiment of a method 1050 for matching user expertise with requests for user expertise, based on previous search results.
  • the method 1050 includes clustering in a two-dimensional metric space, by a reference map generator executing on a first computing device, a set of data documents selected according to at least one criterion, generating a semantic map ( 1052 ).
  • the method 1050 includes associating, by the semantic map, a coordinate pair with each of the set of data documents ( 1054 ).
  • the method 1050 includes generating, by a parser executing on the first computing device, an enumeration of terms occurring in the set of data documents ( 1056 ).
  • the method 1050 includes determining, by a representation generator executing on the first computing device, for each term in the enumeration, occurrence information including: (i) a number of data documents in which the term occurs, (ii) a number of occurrences of the term in each data document, and (iii) the coordinate pair associated with each data document in which the term occurs ( 1058 ).
  • the method 1050 includes generating, by the representation generator, a sparse distributed representation (SDR) for each term in the enumeration, using the occurrence information ( 1060 ).
  • the method 1050 includes storing each of the generated SDRs in an SDR database ( 1062 ).
  • the method 1050 includes receiving, by a query module executing on a second computing device, from a third computing device, at least one term ( 1064 ).
  • the method 1050 includes storing, by a user expertise profile module executing on the second computing device, an identifier of a user of the third computing device and the at least one term ( 1066 ).
  • the method 1050 includes generating, by the representation generator, an SDR of the least one term ( 1068 ).
  • the method 1050 includes receiving, by the user expertise profile module, from a fourth computing device, a second term and a request for an identification of a user who is associated with a similar term ( 1070 ).
  • the method 1050 includes identifying, by a similarity engine, a level of semantic similarity between the SDR of the at least one term and an SDR of the second term ( 1072 ).
  • the method 1050 includes providing, by the user expertise profile module, to the fourth computing device, the identifier of the user of the third computing device ( 1074 ).
  • ( 1052 )-( 1062 ) are performed as described above in connection with FIG. 2 , ( 202 )-( 214 ).
  • the method 1050 includes receiving, by a query module executing on a second computing device, from a third computing device, at least one term ( 1064 ).
  • the query module 601 receives the at least one term and executes the query as described above in connection with FIGS. 6A-C and 9 A-B.
  • the method 1050 includes storing, by a user expertise profile module executing on the second computing device, an identifier of a user of the third computing device and the at least one term ( 1066 ).
  • the user profile module 1002 receives the identifier of the user and the at least one term from the query input processing module 607 .
  • the user expertise profile module 1010 receives the identifier of the user and the at least one term from the query input processing module 607 .
  • the user expertise profile module 1010 stores the identifier of the user and the at least one term in a database.
  • the user expertise profile module 1010 stores the identifier of the user and the at least one term in the user expertise SDR database 1012 (e.g., with an SDR of the at least one term).
  • the method includes logging queries that are received from users with user identifiers and SDRs for each query term(s).
  • the user profile module 1002 also includes functionality for receiving an identification of search results that the querying user indicated were relevant or otherwise of interest to the querying user.
  • the method 1050 includes generating, by the representation generator, an SDR of the least one term ( 1068 ).
  • the user expertise profile module 1010 transmits the at least one data item to the fingerprinting module 302 for generation of the SDR.
  • the user expertise profile module 1010 transmits the at least one term to the representation generator 114 for generation of the SDR.
  • the user expertise profile module 1010 receives a plurality of data items as the user continues to make queries over time.
  • the user expertise profile module 101 directs the generation of a compound SDR that combines an SDR of a first query term with an SDR of a second query term; the resulting compound SDR more accurately reflects the types of queries that the user makes and the more term SDRs that can be added to the compound SDR over time, the more accurately the compound SDR will reflect an area of expertise of the user.
  • the method 1050 includes receiving, by the user expertise profile module, from a fourth computing device, a second term and a request for an identification of a user associated with a similar term ( 1070 ).
  • the request for the identification of the user associated with a similar data item is explicit.
  • the user expertise profile module 1010 automatically provides the identification as a service to the user of the fourth computing device.
  • a user of the fourth computing device performing a search for documents similar to query terms in a white paper the user is authoring may request (or be provided with an option to receive) an identification of other users who have developed an expertise in topics similar to the chosen query terms.
  • this functionality allows users to identify those who have developed an expertise in a particular topic, regardless of whether that expertise is part of their official title, job description, or role, making information readily available that was previously difficult to discern based only on official data or word of mouth or a personal connection. Since multiple areas of expertise (e.g., multiple SDRs based on one or more query terms) may be associated with a single user, information is available about primary as well as secondary areas of expertise; for example, although an individual may officially focus on a first area of research, the individual may perform a series of queries over the course of a week as they research a potential extension of their work into a second area of research and the expertise gained in even that limited period of time may be useful to another user. As another example, an individual seeking to build a team or structure (or restructure) an organization based on actual areas of interest may leverage the functionality of the user expertise profile module 1010 to identify users who have expertise relevant to the needs of the individual.
  • multiple areas of expertise e.g., multiple SDRs based on one or more query
  • the method 1050 includes identifying, by a similarity engine, a level of semantic similarity between the SDR of the at least one term and an SDR of the second term ( 1072 ).
  • the similarity engine 304 executes on the second machine 102 b .
  • the similarity engine 304 is provided by and executes within a search system 902 . Having received the query term from the user seeking to identify an individual having an area of expertise, the user expertise profile module 1010 may direct the similarity engine 304 to identify other users from the user expertise SDR database 1012 that satisfy the request.
  • the method 1050 includes providing, by the user expertise profile module, to the fourth computing device, the identifier of the user of the third computing device ( 1074 ).
  • a user of the methods and systems described herein may provide an identification of a preference regarding query terms.
  • a first user seeking to do a search on a query term may be interested in documents that relate to legal aspects of the query term—for example, uses of the query term or terms like it in court cases, patent applications, published licenses, or other legal documents—while a second user seeking to do a search on the same query term may be interested in documents that relate to scientific aspects of the query term—for example, uses of the query term or of terms like it in white papers, research publications, grant applications or other scientific documents.
  • the systems described herein provide functionality for identifying such preferences and ranking search results according to which results are closest (based on SDR analyses) to the type of document preferred by the searcher.
  • FIGS. 10A and 10B block diagrams depict embodiments of systems for semantic ranking of query results received from an enterprise search system based on user preferences.
  • FIG. 1 OA depicts an embodiment in which functionality for semantic ranking is provided in conjunction with results from a conventional enterprise search system.
  • FIG. 10B depicts an embodiment in which functionality for semantic ranking is provided in conjunction with results from a search system 902 .
  • a flow diagram depicts one embodiment of a method 1080 for user profile-based semantic ranking of query results received from a full-text search system.
  • the method 1080 includes clustering in a two-dimensional metric space, by a reference map generator executing on a first computing device, a set of data documents selected according to at least one criterion, generating a semantic map ( 1081 ).
  • the method 1080 includes associating, by the semantic map, a coordinate pair with each of the set of data documents ( 1082 ).
  • the method 1080 includes generating, by a parser executing on the first computing device, an enumeration of terms occurring in the set of data documents ( 1083 ).
  • the method 1080 includes determining, by a representation generator executing on the first computing device, for each term in the enumeration, occurrence information including: (i) a number of data documents in which the term occurs, (ii) a number of occurrences of the term in each data document, and (iii) the coordinate pair associated with each data document in which the term occurs ( 1084 ).
  • the method 1080 includes generating, by the representation generator, a sparse distributed representation (SDR) for each term in the enumeration, using the occurrence information ( 1085 ).
  • the method 1080 includes storing each of the generated SDRs in an SDR database ( 1086 ).
  • the method 1080 includes receiving, by a query module executing on a second computing device, from a third computing device, a first term and a plurality of preference documents ( 1087 ).
  • the method 1080 includes generating, by the representation generator, a compound SDR using the plurality of preference documents ( 1088 ).
  • the method 1080 includes transmitting, by the query module, to a full-text search system, a query for an identification of each of a set of results documents similar to the first term ( 1089 ).
  • the method 1080 includes generating, by the representation generator, an SDR for each of the documents identified in the set of results documents ( 1090 ).
  • the method 1080 includes determining, by a similarity engine, a level of semantic similarity between each SDR generated for each of the set of results documents and the compound SDR ( 1091 ).
  • the method 1080 includes modifying, by a ranking module executing on the second computing device, an order of at least one document in the set of results documents, based on the determined level of semantic similarity ( 1092 ).
  • the method 1080 includes providing, by the query module, to the third computing device, the identification of each of the set of results documents in the modified order ( 1093 ).
  • ( 1081 )-( 1086 ) are performed as described above in connection with FIG. 2 , ( 202 )-( 214 ).
  • the method 1080 includes receiving, by a query module executing on a second computing device, from a third computing device, a first term and a plurality of preference documents ( 1087 ).
  • the query input processing module 607 receives the first term as described above in connection with FIGS. 6A-B and 9 A-B.
  • the query input processing module 607 provides a user interface element (not shown) allowing a user of the third computing device to provide (e.g., upload) one or more preference documents.
  • Preference documents may be any type or form of data structure including one or more data items representative of a type of document the searching user is interested in.
  • a scientific researcher could provide a number of research documents that reflect the style and/or content of the type of documents the scientific researcher would consider relevant or preferable given her search objectives.
  • a lawyer could provide a number of legal documents that reflect the style and/or content of the type of documents the lawyer would consider relevant or preferable given her search objectives.
  • the system may provide functionality allowing a user to provide different sets of preference documents with different searches, allowing the user to create different preference profiles for use with different searches at different times—for example, a different preference profile may be relevant for a scientific search focused on a first topic of research than would be relevant for a scientific search focused on a second, different topic.
  • the method 1080 includes generating, by the representation generator, a compound SDR using the plurality of preference documents ( 1088 ).
  • the user preference module 1004 directs the generation of the compound SDR.
  • the user preference module 1004 may transmit the preference documents to the fingerprinting module 302 for generation of the compound SDR.
  • the user preference module 1004 may transmit the preference documents to the representation generator 114 for generation of the compound SDR.
  • the compound SDR that combines the SDRs of individual preference documents may be generated in the same way that compound SDRs of individual documents are generated from term SDRs.
  • the user preference module 1004 may store the generated compound SDR in the user preference SDR database 1006 .
  • the method 1080 includes transmitting, by the query module, to a full-text search system, a query for an identification of each of a set of results documents similar to the first term ( 1089 ).
  • the query module 601 may transmit the query to an external enterprise search system as described in connection with FIGS. 6A-B .
  • the query module 601 may transmit the query to a search system 902 as described above in connection with FIGS. 9A-B .
  • the method 1080 includes generating, by the representation generator, an SDR for each of documents identified in the set of results documents ( 1090 ).
  • the user preference module 1004 receives the set of results documents from the search system (either the search system 902 or the third-party enterprise search system).
  • the user preference module 1004 directs the similarity engine 304 to generate the SDRs for each of the received results documents.
  • the method 1080 includes determining, by a similarity engine, a level of semantic similarity between each SDR generated for each of the set of results documents and the compound SDR ( 1091 ).
  • the similarity engine executes on the second machine 102 b .
  • the similarity engine is provided by and executes within a search system 902 .
  • the user preference module 1004 directs the similarity engine 304 to identify the level of similarity.
  • the user preference module 1004 receives the level of similarity from the similarity engine 304 .
  • the method 1080 includes modifying, by a ranking module executing on the second computing device, an order of at least one document in the set of results documents, based on the determined level of semantic similarity ( 1092 ).
  • the similarity engine 304 may have indicated that a result included as the fifth document in the set of results documents has a higher level of similarity to the compound SDR of the plurality of preference documents than the first four documents.
  • the user preference module 1004 may then move the fifth document (or an identification of the fifth document) to the first position.
  • the method 1080 includes providing, by the query module, to the third computing device, the identification of each of the set of results documents in the modified order ( 1093 ).
  • the system may personalize search results, taking into account the context of the search in order to select search results likely to be most important to the searcher.
  • the system instead of returning an arbitrary number of conventionally ranked results (e.g., first ten or first page or other arbitrary number of results), the system could analyze thousands of documents and provide only those that are semantically relevant to the searcher.
  • symptoms of a disease may occur in a patient at a very early phase and a medical professional may identify a clear medical diagnosis.
  • a patient may present with only a subset of symptoms and a medical diagnosis is not yet clearly identifiable; for example, a patient may provide a blood sample from which the values of ten different types of measurements are determined and only one of the measurement types has a pathological value while the other nine may be close to a threshold level but remain in a range of normal values. It may be challenging to identify a clear medical diagnosis in such a case and the patient may be subjected to further testing, additional monitoring, and delayed diagnosis while a medical professional waits to see if the remaining symptoms develop. In such an example, the inability to make an early diagnosis may result in slower treatment and potentially negative impacts on a health outcome for the patient.
  • the system described herein may generate and store SDRs for numerical data items as well as text-based items and identify a level of similarity between an SDR generated for a subsequently-received document and one of the stored SDRs.
  • the system may provide an identification of the data or metadata (e.g., identifying a medical diagnosis associated with a document containing numerical data items) as a result of identifying the level of similarity.
  • a flow diagram depicts one embodiment of a method 1150 for providing medical diagnosis support.
  • the method 1150 includes clustering in a two-dimensional metric space, by a reference map generator executing on a first computing device, a set of data documents selected according to at least one criterion and associated with a medical diagnosis, generating a semantic map ( 1152 ).
  • the method 1150 includes associating, by the semantic map, a coordinate pair with each of the set of data documents ( 1154 ).
  • the method 1150 includes generating, by a parser executing on the first computing device, an enumeration of measurements occurring in the set of data documents ( 1156 ).
  • the method 1150 includes determining, by a representation generator executing on the first computing device, for each measurement in the enumeration, occurrence information including: (i) a number of data documents in which the measurement occurs, (ii) a number of occurrences of the measurement in each data document, and (iii) the coordinate pair associated with each data document in which the measurement occurs ( 1158 ).
  • the method 1150 includes generating, by the representation generator, for each measurement in the enumeration a sparse distributed representation (SDR) using the occurrence information ( 1160 ).
  • the method 1150 includes storing, in an SDR database, each of the generated SDRs ( 1162 ).
  • the method 1150 includes receiving, by a diagnosis support module executing on a second computing device, from a third computing device, a document comprising a plurality of measurements, the document associated with a medical patient ( 1164 ).
  • the method 1150 includes generating, by the representation generator, at least one SDR for the plurality of measurements ( 1166 ).
  • the method 1150 includes generating, by the representation generator, a compound SDR for the document, based on the at least one SDR generated for the plurality of measurements ( 1168 ).
  • the method 1150 includes determining, by a similarity engine executing on the second computing device, a level of semantic similarity between the compound SDR generated for the document and an SDR retrieved from the SDR database ( 1170 ).
  • the method 1150 includes providing, by the diagnosis support module, to the third computing device, an identification of the medical diagnosis associated with the SDR retrieved from the SDR database, based on the determined level of semantic similarity ( 1172 ).
  • the method 1150 includes clustering in a two-dimensional metric space, by a reference map generator executing on a first computing device, a set of data documents selected according to at least one criterion and associated with a medical diagnosis, generating a semantic map ( 1152 ).
  • clustering occurs as described above in connection with FIG. 2 .
  • each document in the set of documents includes a plurality of data items, as above.
  • the plurality of data items is a set of lab values taken at one point in time from one sample (e.g., a blood sample from a medical patient); by way of example, the plurality of data items in the document may be provided as a comma-separated list of values.
  • the system may receive 500 documents, one for each of 500 patients, and each document may contain 5 measurements (e.g., 5 values of a type of measurement derived from a single blood sample provided by each patient) and be associated with a medical diagnosis.
  • the system may generate the document vectors as described above in connection with FIG. 2 , using the measurements as data items.
  • the system in FIG. 11A includes the functionality described in connection with FIGS. 1A-C and FIG. 3 . However, the system in FIG.
  • 11A may have a different parser 110 (shown as the lab document parser and pre-processing module 110 b ), optimized for parsing documents containing lab values, and the system may include a binning module 150 for optimizing generation of an enumeration of measurements occurring in the set of data documents as will be discussed in greater detail below.
  • a different parser 110 shown as the lab document parser and pre-processing module 110 b
  • the system may include a binning module 150 for optimizing generation of an enumeration of measurements occurring in the set of data documents as will be discussed in greater detail below.
  • the method 1150 includes associating, by the semantic map, a coordinate pair with each of the set of data documents ( 1154 ).
  • the generation of a semantic map 108 and the distribution of document vectors onto the semantic map 108 and the association of coordinate pairs occurs as described above in connection with FIG. 2 .
  • each point in the semantic map 108 may represent one or more documents containing lab values for a type of measurement, such as, without limitation, any type of measurement identified from a metabolic panel (e.g., calcium per liter).
  • a metabolic panel e.g., calcium per liter
  • the method 1150 includes generating, by a parser executing on the first computing device, an enumeration of measurements occurring in the set of data documents ( 1156 ).
  • the measurements are enumerated as described above in connection with FIG. 2 .
  • the system includes a binning module 150 that provides for an optimized process of generating the enumeration.
  • Each document received may contain a plurality of values, each value identifying a value of a type of measurement.
  • a document may contain a value for a level of calcium in blood—the value is a number in the document and “calcium” is the type of the measurement.
  • the values for each type may vary from one document to another.
  • the values for “calcium” type measurements may range from 0.0 to 5.2 mg/liter.
  • the word is the same in each document—for example, if two documents contain the word “quick,” the text that forms that word “quick” is the same in each document.
  • two documents could each contain a value for the same type of measurement (e.g., a “calcium” type measurement or a “glucose” type measurement) but have very different values (e.g., 0.1 and 5.2) each of which is a valid value for the type of measurement.
  • the system may identify a range of values for each type of measurement included in the set of documents and provide a user with functionality for distributing the range substantially evenly into sub-groups; such a process may be referred to as binning Performing the binning ensures a significant amount of overlap among the measurements in a bin.
  • the system may indicate that there are 5000 values for a “calcium” type measurement in a set of documents, indicate that the range of values is from 0.01-5.2, and provide a user with an option to specify how to distribute the values.
  • a user may, for example, specify that values from 0.01-0.3 should be grouped into a first sub-division (also referred to herein as a “bin”), that values from 0.3-3.1 should be grouped into a second sub-division, and that values from 3.1-5.2 should be grouped into a third sub-division.
  • the system may then enumerate how many of the 5000 values fall into each of the three bins and that occurrence information may be used in generating SDRs for each value.
  • the binning module 150 may provide this functionality.
  • the method 1150 includes determining, by a representation generator executing on the first computing device, for each measurement in the enumeration, occurrence information including: (i) a number of data documents in which the measurement occurs, (ii) a number of occurrences of the measurement in each data document, and (iii) the coordinate pair associated with each data document in which the measurement occurs ( 1158 ).
  • the occurrence information is information as described above in connection with FIG. 2 .
  • the method 1150 includes generating, by the representation generator, for each measurement in the enumeration, a sparse distributed representation (SDR) using the occurrence information ( 1160 ).
  • SDR sparse distributed representation
  • the SDRs are generated as described above in connection with FIG. 2 .
  • the method 1150 includes storing, in an SDR database, each of the generated SDRs ( 1162 ).
  • the generated SDRs are stored in the SDR database 120 as described above in connection with FIG. 2 .
  • the method 1150 includes receiving, by a diagnosis support module executing on a second computing device, from a third computing device, a document comprising a plurality of measurements, the document associated with a medical patient ( 1164 ).
  • the diagnosis support module 1100 receives the document from a client 102 c.
  • the method 1150 includes generating, by the representation generator, at least one SDR for the plurality of measurements ( 1166 ).
  • the diagnosis support module 1100 directs the fingerprinting module 302 to generate the SDR as described above in connection with FIGS. 1-3 .
  • the diagnosis support module 1100 directs the representation generator 114 to generate the SDR as described above in connection with FIGS. 1-3 .
  • the method 1150 includes generating, by the representation generator, a compound SDR for the document, based on the at least one SDR generated for the plurality of measurements ( 1168 ).
  • the diagnosis support module 1100 directs the fingerprinting module 302 to generate the compound SDR as described above in connection with FIGS. 1-3 .
  • the diagnosis support module 1100 directs the representation generator 114 to generate the compound SDR as described above in connection with FIGS. 1-3 .
  • the method 1150 includes determining, by a similarity engine executing on the second computing device, a level of semantic similarity between the compound SDR generated for the document and an SDR retrieved from the SDR database ( 1170 ).
  • the diagnosis support module 1100 directs the similarity engine 304 to determine the level of semantic similarity as described above in connection with FIGS. 3-5 .
  • the method 1150 includes providing, by the diagnosis support module, to the third computing device, an identification of the medical diagnosis associated with the SDR retrieved from the SDR database, based on the determined level of semantic similarity ( 1172 ).
  • a system can detect an approaching medical diagnosis, even when the individual measurements have not yet reached pathological levels.
  • the system can identify changes in a patient's pattern, thus capturing even dynamic processes. For example, a pre-cancer detection system would identify small changes in certain values but by having the ability to compare the pattern to the SDRs of other patients, and analyzing time-based sequences, medical diagnoses can be identified.
  • the diagnosis support module 1100 can direct the generation of an SDR for even an incomplete parameter vector—for example in a scenario in which the diagnosis support module 1100 receives a plurality of measurements in a document but the plurality of measurements is missing a measurement of a type relevant to a diagnosis—without degrading results.
  • a comparison between two SDRs can be made and a level of similarity identified, which may satisfy a threshold level of similarity even if the SDRs are not identical; so, even if the SDR generated for a document with an incomplete set of measurements is missing a point or two (e.g., a place on a semantic map 108 at which a more complete document would have had a value for a measurement), a comparison can still be made with a stored SDR.
  • the diagnosis support module 1100 can identify the at least one parameter that is relevant to a medical diagnosis but for which a value was not received and recommend that the value be provided (e.g., recommending follow-up procedures or analyses for missing parameters).
  • the documents received may include associations to metadata in addition to a medical diagnosis.
  • a document may also be associated with an identification of patient gender.
  • metadata may be used to provide confirmation of a level of similarity between two SDRs and an identified medical diagnosis.
  • the diagnosis support module 1100 may determine that two SDRs are similar and identifies a medical diagnosis associated with a document from which one of the SDRs was generated; the diagnosis support module 1100 may then apply a rule based on metadata to confirm the accuracy of the identification of the medical diagnosis.
  • a rule may specify that if metadata indicates a patient is male and the identified medical diagnosis indicates there is a danger of ovarian cancer, instead of providing a user of the client 102 c with the identified medical diagnosis, the diagnosis support module 1100 should instead report an error (since men do not have ovaries and cannot get ovarian cancer).
  • the system 1300 may receive translations of some or all of a set of documents from a first language into a second language and the translations may be used to identify corresponding SDRs in a second SDR database generated from the corpus of translated documents.
  • the system 1300 includes an engine 101 , including a second representation generator 114 b , a second parser and preprocessing module 110 c , a translated set of data documents 104 b , a second full-text search system 122 b , a second enumeration of data items 112 b , and a second SDR database 120 b .
  • the engine 101 may be an engine 101 as described above in connection with FIG. 1A .
  • the method 1400 includes clustering, by a reference map generator executing on a first computing device, in a two-dimensional metric space, a set of data documents in a first language, generating a semantic map, the set of data documents selected according to at least one criterion ( 1402 ).
  • the method 1400 includes associating, by the semantic map, a coordinate pair with each of the set of data documents ( 1404 ).
  • the method 1400 includes generating, by a first parser executing on the first computing device, an enumeration of terms occurring in the set of data documents ( 1406 ).
  • the method 1400 includes determining, by a first representation generator executing on the first computing device, for each term in the enumeration, occurrence information including: (i) a number of data documents in which the term occurs, (ii) a number of occurrences of the term in each data document, and (iii) the coordinate pair associated with each data document in which the term occurs ( 1408 ).
  • the method 1400 includes generating, by the first representation generator, a sparse distributed representation (SDR) for each term in the enumeration, using the occurrence information ( 1410 ).
  • the method 1400 includes storing, by the first representation generator, in a first SDR database, each of the generated SDRs ( 1412 ).
  • the method 1400 includes receiving, by the reference map generator, a translation, into a second language, of each of the set of data documents ( 1414 ).
  • the method 1400 includes associating, by the semantic map, the coordinate pair from each of the set of data documents with each corresponding document in the translated set of data documents ( 1416 ).
  • the method 1400 includes generating, by a second parser, a second enumeration of terms occurring in the translated set of data documents ( 1418 ).
  • the method 1400 includes determining, by a second representation generator, for each term in the second enumeration based on the translated set of data documents, occurrence information including: (i) a number of translated data documents in which the term occurs, (ii) a number of occurrences of the term in each translated data document, and (iii) the coordinate pair associated with each translated data document in which the term occurs ( 1420 ).
  • the method 1400 includes generating, by the second representation generator, for each term in the second enumeration, based on the translated set of data documents, an SDR ( 1422 ).
  • the method 1400 includes storing, by the second representation generator, in a second SDR database, each of the SDRs generated for each term in the second enumeration.
  • the method 1400 includes generating, by the first representation generator, a first SDR of a first document in the first language ( 1426 ).
  • the method 1400 includes generating, by the second representation generator, a second SDR of a second document in the second language ( 1428 ).
  • the method 1400 includes determining a distance between the first SDR and the second SDR ( 1430 ).
  • the method 1400 includes providing an identification of a level of similarity between the first document and the second document ( 1432 ).
  • ( 1402 )-( 1412 ) are performed as described above in connection with FIG. 2 ( 202 - 214 ).
  • the method 1400 includes receiving, by the reference map generator, a translation, into a second language, of each of the set of data documents ( 1414 ).
  • a translation process executed by the machine 102 a provides the translation to the reference map generator 106 .
  • a human translator provides the translation to the engine 101 .
  • a machine translation process provides the translation to the engine 101 ; the machine translation process may be provided by a third party and may provide the translation to the engine 101 directly or across a network.
  • a user of the system 1300 uploads the translation to the machine 102 a.
  • the method 1400 includes associating, by the semantic map, the coordinate pair from each of the set of data documents with each corresponding document in the translated set of data documents ( 1416 ).
  • the semantic map 108 performs the association.
  • the association is performed as described above in connection with FIG. 2 ( 204 ).
  • the method 1400 includes generating, by a second parser, a second enumeration of terms occurring in the translated set of data documents ( 1418 ).
  • the generation is performed as described above in connection with FIG. 2 ( 206 ).
  • the second parser is configured (e.g., includes a configuration file) optimizing the second parser 110 c for parsing documents in the second language.
  • the method 1400 includes determining, by a second representation generator, for each term in the second enumeration based on the translated set of data documents, occurrence information including: (i) a number of translated data documents in which the term occurs, (ii) a number of occurrences of the term in each translated data document, and (iii) the coordinate pair associated with each translated data document in which the term occurs ( 1420 ).
  • the determination of occurrence information is performed as described above in connection with FIG. 2 ( 208 ).
  • the method 1400 includes generating, by the second representation generator, for each term in the second enumeration, based on the translated set of data documents, an SDR ( 1422 ).
  • the generation of the term SDRs is performed as described above in connection with FIG. 2 ( 210 - 214 ).
  • the method 1400 includes storing, by the second representation generator, in a second SDR database, each of the SDRs generated for each term in the second enumeration ( 1424 ).
  • the storing of the SDRs in the second database is performed as described above in connection with FIG. 1A .
  • the method 1400 includes generating, by the first representation generator, a first SDR of a first document in the first language ( 1426 ). In one embodiment, the generation of the first SDR is performed as described above in connection with FIG. 2 .
  • the method 1400 includes generating, by the second representation generator, a second SDR of a second document in the second language ( 1428 ).
  • the generation of the second SDR is performed as described above in connection with FIG. 2 .
  • the method 1400 includes determining a distance between the first SDR and the second SDR ( 1430 ).
  • the method 1400 includes providing an identification of a level of similarity between the first document and the second document ( 1432 ). In one embodiment ( 1430 )-( 1432 ) are performed as described above in connection with FIGS. 3-4 .
  • a translation system may translate a text from a first language into a second language and both the text in the first language and the translation in the second language may be provided to the systems described herein; if the system determines that the SDR of the text in the first language is similar (e.g., exceeds a threshold level of similarity) to the SDR of the translated text (in the second language), then the translation may be said to have a high level of quality.
  • the translation may be said to have a low level of quality if the SDR of the text in the first language is insufficiently similar (e.g., does not exceed a predetermined threshold level of similarity) to the SDR of the translated text (in the second language), then the translation may be said to have a low level of quality.
  • the method 1450 includes clustering, by a reference map generator executing on a first computing device, in a two-dimensional metric space, a set of data documents in a first language, generating a semantic map, the set of data documents selected according to at least one criterion ( 1452 ).
  • the method 1450 includes associating, by the semantic map, a coordinate pair with each of the set of data documents ( 1454 ).
  • the method 1450 includes generating, by a first parser executing on the first computing device, an enumeration of terms occurring in the set of data documents ( 1456 ).
  • the method 1450 includes determining, by a first representation generator executing on the first computing device, for each term in the enumeration, occurrence information including: (i) a number of data documents in which the term occurs, (ii) a number of occurrences of the term in each data document, and (iii) the coordinate pair associated with each data document in which the term occurs ( 1458 ).
  • the method 1450 includes generating, by the first representation generator, for each term in the enumeration, a sparse distributed representation (SDR) using the occurrence information ( 1460 ).
  • SDR sparse distributed representation
  • the method 1450 includes storing, by the first representation generator, in a first SDR database, each of the generated SDRs ( 1462 ).
  • the method 1450 includes receiving, by the reference map generator, a translation, into a second language, of each of the set of data documents ( 1464 ).
  • the method 1450 includes associating, by the semantic map, the coordinate pair from each of the set of data documents with each of the translated data documents ( 1466 ).
  • the method 1450 includes generating, by a second parser, a second enumeration of terms occurring in the translated set of data documents ( 1468 ).
  • the method 1450 includes determining, by a second representation generator, for each term in the second enumeration based on the translated set of data documents, occurrence information including: (i) a number of translated data documents in which the term occurs, (ii) a number of occurrences of the term in each translated data document, and (iii) the coordinate pair associated with each translated data document in which the term occurs ( 1470 ).
  • the method 1450 includes generating, by the second representation generator, for each term in the second enumeration, based on the translated set of data documents, an SDR ( 1472 ).
  • the method 1450 includes storing, by the second representation generator, in a second SDR database, each of the SDRs generated for each term in the second enumeration ( 1474 ).
  • the method 1450 includes generating, by the first representation generator, a first SDR of a first term received in the first language ( 1476 ).
  • the method 1450 includes determining a distance between the first SDR and a second SDR of a second term in a second language, the second SDR retrieved from the second SDR database ( 1478 ).
  • the method 1450 includes providing an identification of the second term in the second language and an identification of a level of similarity between the first term and the second term, based upon the determined distance ( 1480 ).
  • ( 1452 )-( 1474 ) are performed as described above in connection with FIG. 14A ( 1402 )-( 1424 ).
  • the method 1450 includes generating, by the first representation generator, a first SDR of a first term received in the first language ( 1476 ). In one embodiment, the generation of the first SDR is performed as described above in connection with FIG. 2 .
  • the method 1450 includes determining a distance between the first SDR and a second SDR of a second term in a second language, the second SDR retrieved from the second SDR database ( 1478 ).
  • the method 1450 includes providing an identification of the second term in the second language and an identification of a level of similarity between the first term and the second term, based upon the determined distance ( 1480 ). In one embodiment ( 1478 )-( 1480 ) are performed as described above in connection with FIGS. 3-4 .
  • the methods and systems described herein may be used to provide an extension to a search system.
  • the system 1300 may receive a first term in a first language (e.g., a term a user wishes to use in a query of a search system).
  • the system 1300 may generate an SDR of the first term and use the generated first SDR to identify a second SDR in a second SDR database that satisfies a threshold level of similarity.
  • the system 1300 may then provide the first SDR, the second SDR, or both to a search system to enhance the user's search query, as described above in connection with FIGS. 6A-6C .
  • the components described herein may execute one or more functions automatically, that is, without human intervention.
  • the system 100 may receive a set of data documents 104 and automatically proceed to execute any one or more of the methods for preprocessing the data documents, training the reference map generator 106 , or generating SDRs 118 for each data item in the set of data documents 104 without human intervention.
  • the system 300 may receive at least one data item and automatically proceed to execute any one or more of the methods for identifying levels of similarity between the received data item and data items in the SDR database 120 , generating enumerations of similar data items, or performing other functions as described above.
  • system 300 may be part of, or include components that are part of, the so-called “Internet of Things” in which autonomous entities execute, communicate, and provide functionality such as that described herein; for instance an automated autonomous process may generate queries, receive responses from the system 300 , and provide responses to other users (human, computer, or otherwise).
  • speech-to-text or text-to-speech based interfaces are included so that, by way of example and without limitation, users may generate voice commands that the interfaces recognize and with which the interfaces generate computer-processable instructions.
  • the systems and methods described herein bring a semantic context into an individual representation; for example, even without knowing how a particular SDR was generated, the system can still compare the SDR with another SDR and use a semantic context of the two SDRs to provide insights to a user.
  • the systems and methods described herein use document-level context to provide semantic insights at the term level, enabling users to identify semantic meaning of individual terms within a corpus of documents.
  • engines, generators, modules, or components may each be provided as software, hardware, or a combination of the two, and may execute on one or more machines 100 .
  • certain components described herein are depicted as separate entities, for ease of discussion, it should be understood that this does not restrict the architecture to a particular implementation. For instance, the functionality of some or all of the described components may be encompassed by a single circuit or software function; as another example, the functionality of one or more components may be distributed across multiple components.
  • a machine 102 providing the functionality described herein may be any type of workstation, desktop computer, laptop or notebook computer, server, portable computer, mobile telephone, mobile smartphone, or other portable telecommunication device, media playing device, gaming system, mobile computing device, or any other type and/or form of computing, telecommunications or media device that is capable of communicating on any type and form of network and that has sufficient processor power and memory capacity to perform the operations described herein.
  • a machine 102 may execute, operate or otherwise provide an application, which can be any type and/or form of software, program, or executable instructions, including, without limitation, any type and/or form of web browser, web-based client, client-server application, an ActiveX control, a JAVA applet, or any other type and/or form of executable instructions capable of executing on machine 102 .
  • an application can be any type and/or form of software, program, or executable instructions, including, without limitation, any type and/or form of web browser, web-based client, client-server application, an ActiveX control, a JAVA applet, or any other type and/or form of executable instructions capable of executing on machine 102 .
  • Machines 100 may communicate with each other via a network, which may be any type and/or form of network and may include any of the following: a point to point network, a broadcast network, a wide area network, a local area network, a telecommunications network, a data communication network, a computer network, an ATM (Asynchronous Transfer Mode) network, a SONET (Synchronous Optical Network) network, an SDH (Synchronous Digital Hierarchy) network, a wireless network, and a wireline network.
  • the network may comprise a wireless link, such as an infrared channel or satellite band.
  • the topology of the network may be a bus, star, or ring network topology.
  • the network may be of any such network topology as known to those ordinarily skilled in the art capable of supporting the operations described herein.
  • the network may comprise mobile telephone networks utilizing any protocol or protocols used to communicate among mobile devices (including tables and handheld devices generally), including AMPS, TDMA, CDMA, GSM, GPRS, UMTS, or LTE.
  • the machine 102 may include a network interface to interface to a network through a variety of connections including, but not limited to, standard telephone lines, LAN or WAN links (e.g., 802.11, T1, T3, 56 kb, X.25, SNA, DECNET), broadband connections (e.g., ISDN, Frame Relay, ATM, Gigabit Ethernet, Ethernet-over-SONET), wireless connections, or some combination of any or all of the above.
  • standard telephone lines LAN or WAN links (e.g., 802.11, T1, T3, 56 kb, X.25, SNA, DECNET), broadband connections (e.g., ISDN, Frame Relay, ATM, Gigabit Ethernet, Ethernet-over-SONET), wireless connections, or some combination of any or all of the above.
  • LAN or WAN links e.g., 802.11, T1, T3, 56 kb, X.25, SNA, DECNET
  • broadband connections e.g., ISDN, Frame Relay, ATM,
  • Connections can be established using a variety of communication protocols (e.g., TCP/IP, IPX, SPX, NetBIOS, Ethernet, ARCNET, SONET, SDH, Fiber Distributed Data Interface (FDDI), RS232, IEEE 802.11, IEEE 802.11a, IEEE 802.11b, IEEE 802.11g, IEEE 802.11n, 802.15.4, Bluetooth, ZIGBEE, CDMA, GSM, WiMax, and direct asynchronous connections).
  • the computing device 100 communicates with other computing devices 100 ′ via any type and/or form of gateway or tunneling protocol such as Secure Socket Layer (SSL) or Transport Layer Security (TLS).
  • SSL Secure Socket Layer
  • TLS Transport Layer Security
  • the network interface may comprise a built-in network adapter, network interface card, PCMCIA network card, card bus network adapter, wireless network adapter, USB network adapter, modem, or any other device suitable for interfacing the computing device 100 to any type of network capable of communication and performing the operations described herein.
  • the systems and methods described above may be implemented as a method, apparatus, or article of manufacture using programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof.
  • the techniques described above may be implemented in one or more computer programs executing on a programmable computer including a processor, a storage medium readable by the processor (including, for example, volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
  • Program code may be applied to input entered using the input device to perform the functions described and to generate output.
  • the output may be provided to one or more output devices.
  • Each computer program within the scope of the claims below may be implemented in any programming language, such as assembly language, machine language, a high-level procedural programming language, or an object-oriented programming language.
  • the programming language may, for example, be LISP, PROLOG, PERL, C, C++, C#, JAVA, or any compiled or interpreted programming language.
  • Each such computer program may be implemented in a computer program product tangibly embodied in a machine-readable storage device for execution by a computer processor.
  • Method steps of the invention may be performed by a computer processor executing a program tangibly embodied on a computer-readable medium to perform functions of the invention by operating on input and generating output.
  • Suitable processors include, by way of example, both general and special purpose microprocessors.
  • the processor receives instructions and data from a read-only memory and/or a random access memory.
  • Storage devices suitable for tangibly embodying computer program instructions include, for example, all forms of computer-readable devices, firmware, programmable logic, hardware (e.g., integrated circuit chip; electronic devices; a computer-readable non-volatile storage unit; non-volatile memory, such as semiconductor memory devices, including EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROMs). Any of the foregoing may be supplemented by, or incorporated in, specially-designed ASICs (application-specific integrated circuits) or FPGAs (Field-Programmable Gate Arrays).
  • a computer can generally also receive programs and data from a storage medium such as an internal disk (not shown) or a removable disk.
  • a computer may also receive programs and data from a second computer providing access to the programs via a network transmission line, wireless transmission media, signals propagating through space, radio waves, infrared signals, etc.
  • the network environment comprises one or more clients 1202 a - 1202 n in communication with one or more remote machines 1206 a - 1206 n (also generally referred to as server(s) 1206 or computing device(s) 1206 ) via one or more networks 1204 .
  • the machine 102 described above may be provided as a machine 1202 , a machine 1206 , or any type of machine 1200 .
  • FIG. 12A shows a network 1204 between the clients 1202 and the remote machines 1206
  • the network 1204 can be a local area network (LAN), such as a company Intranet, a metropolitan area network (MAN), or a wide area network (WAN), such as the Internet or the World Wide Web.
  • LAN local area network
  • MAN metropolitan area network
  • WAN wide area network
  • a network 1204 ′ (not shown) may be a private network and a network 1204 may be a public network.
  • a network 1204 may be a private network and a network 1204 ′ a public network.
  • networks 1204 and 1204 ′ may both be private networks.
  • the network 1204 may be any type and/or form of network and may include any of the following: a point to point network, a broadcast network, a wide area network, a local area network, a telecommunications network, a data communication network, a computer network, an ATM (Asynchronous Transfer Mode) network, a SONET (Synchronous Optical Network) network, an SDH (Synchronous Digital Hierarchy) network, a wireless network, and a wireline network.
  • the network 1204 may comprise a wireless link, such as an infrared channel or satellite band.
  • the topology of the network 1204 may be a bus, star, or ring network topology.
  • the network 1204 may be of any such network topology as known to those ordinarily skilled in the art capable of supporting the operations described herein.
  • the network may comprise mobile telephone networks utilizing any protocol or protocols used to communicate among mobile devices, including AMPS, TDMA, CDMA, GSM, GPRS, or UMTS.
  • AMPS AMPS
  • TDMA Time Division Multiple Access
  • CDMA Code Division Multiple Access
  • GSM Global System for Mobile communications
  • GPRS Global System for Mobile communications
  • UMTS Universal Mobile communications
  • a client 1202 and a remote machine 1206 may be any workstation, desktop computer, laptop or notebook computer, server, portable computer, mobile telephone or other portable telecommunication device, media playing device, a gaming system, mobile computing device, or any other type and/or form of computing, telecommunications or media device that is capable of communication and that has sufficient processor power and memory capacity to perform the operations described herein.
  • the computing device 1200 may have different processors, operating systems, and input devices consistent with the device.
  • the computing device 1200 is a mobile device, digital audio player, digital media player, or a combination of such devices.
  • a computing device 1200 may execute, operate or otherwise provide an application, which can be any type and/or form of software, program, or executable instructions, including, without limitation, any type and/or form of web browser, web-based client, client-server application, an ActiveX control, or a JAVA applet, or any other type and/or form of executable instructions capable of executing on the computing device 1200 .
  • an application can be any type and/or form of software, program, or executable instructions, including, without limitation, any type and/or form of web browser, web-based client, client-server application, an ActiveX control, or a JAVA applet, or any other type and/or form of executable instructions capable of executing on the computing device 1200 .
  • a computing device 1200 provides functionality of a web server.
  • a web server 1200 comprises an open-source web server, such as the APACHE servers maintained by the Apache Software Foundation of Delaware.
  • the web server 1200 executes proprietary software, such as the Internet Information Services products provided by Microsoft Corporation of Redmond, Wash., the Oracle iPlanet web server products provided by Oracle Corporation of Redwood Shores, Calif., or the BEA WEBLOGIC products provided by BEA Systems of Santa Clara, Calif.
  • the system may include multiple, logically grouped computing devices 1200 .
  • the logical group of computing devices 1200 may be referred to as a server farm.
  • the server farm may be administered as a single entity.
  • FIGS. 12B and 12C depict block diagrams of a computing device 1200 useful for practicing an embodiment of the client 1202 or a remote machine 1206 .
  • each computing device 1200 includes a central processing unit 1221 , and a main memory unit 1222 .
  • a computing device 1200 may include a storage device 1228 , an installation device 1216 , a network interface 1218 , an I/O controller 1223 , display devices 1224 a - n , a keyboard 1226 , a pointing device 1227 , such as a mouse, and one or more other I/O devices 1230 a - n .
  • the storage device 1228 may include, without limitation, an operating system and software. As shown in FIG. 12C , each computing device 1200 may also include additional optional elements, such as a memory port 1203 , a bridge 1270 , one or more input/output devices 1230 a - 1230 n (generally referred to using reference numeral 1230 ), and a cache memory 1240 in communication with the central processing unit 1221 .
  • additional optional elements such as a memory port 1203 , a bridge 1270 , one or more input/output devices 1230 a - 1230 n (generally referred to using reference numeral 1230 ), and a cache memory 1240 in communication with the central processing unit 1221 .
  • the central processing unit 1221 is any logic circuitry that responds to and processes instructions fetched from the main memory unit 1222 .
  • the central processing unit 1221 is provided by a microprocessor unit, such as: those manufactured by Intel Corporation of Mountain View, Calif.; those manufactured by Motorola Corporation of Schaumburg, Ill.; those manufactured by Transmeta Corporation of Santa Clara, Calif.; those manufactured by International Business Machines of White Plains, N.Y.; or those manufactured by Advanced Micro Devices of Sunnyvale, Calif.
  • Other examples include SPARC processors, ARM processors, processors used to build UNIX/LINUX “white” boxes, and processors for mobile devices.
  • the computing device 1200 may be based on any of these processors, or any other processor capable of operating as described herein.
  • Main memory unit 1222 may be one or more memory chips capable of storing data and allowing any storage location to be directly accessed by the microprocessor 1221 .
  • the main memory 1222 may be based on any available memory chips capable of operating as described herein.
  • the processor 1221 communicates with main memory 1222 via a system bus 1250 .
  • FIG. 1C depicts an embodiment of a computing device 1200 in which the processor communicates directly with main memory 1222 via a memory port 1203 .
  • FIG. 12C also depicts an embodiment in which the main processor 1221 communicates directly with cache memory 1240 via a secondary bus, sometimes referred to as a backside bus. In other embodiments, the main processor 1221 communicates with cache memory 1240 using the system bus 1250 .
  • the processor 1221 communicates with various I/O devices 1230 via a local system bus 1250 .
  • Various buses may be used to connect the central processing unit 1221 to any of the I/O devices 1230 , including a VESA VL bus, an ISA bus, an EISA bus, a MicroChannel Architecture (MCA) bus, a PCI bus, a PCI-X bus, a PCI-Express bus, or a NuBus.
  • MCA MicroChannel Architecture
  • PCI bus PCI bus
  • PCI-X bus PCI-X bus
  • PCI-Express bus PCI-Express bus
  • NuBus NuBus.
  • the processor 1221 may use an Advanced Graphics Port (AGP) to communicate with the display 1224 .
  • FIG. 1C depicts an embodiment of a computer 1200 in which the main processor 1221 also communicates directly with an I/O device 1230 b via, for example, HYPERTRANSPORT, RAPIDIO, or INFINIBAND communications
  • I/O devices 1230 a - 1230 n may be present in the computing device 1200 .
  • Input devices include keyboards, mice, trackpads, trackballs, microphones, scanners, cameras, and drawing tablets.
  • Output devices include video displays, speakers, inkjet printers, laser printers, and dye-sublimation printers.
  • the I/O devices may be controlled by an I/O controller 1223 as shown in FIG. 12B .
  • an I/O device may also provide storage and/or an installation device 1216 for the computing device 1200 .
  • the computing device 1200 may provide USB connections (not shown) to receive handheld USB storage devices such as the USB Flash Drive line of devices manufactured by Twintech Industry, Inc. of Los Alamitos, Calif.
  • the computing device 1200 may support any suitable installation device 1216 , such as a floppy disk drive for receiving floppy disks such as 3.5-inch, 5.25-inch disks or ZIP disks; a CD-ROM drive; a CD-R/RW drive; a DVD-ROM drive; tape drives of various formats; a USB device; a hard-drive or any other device suitable for installing software and programs.
  • the computing device 1200 may provide functionality for installing software over a network 1204 .
  • the computing device 1200 may further comprise a storage device, such as one or more hard disk drives or redundant arrays of independent disks, for storing an operating system and other software.
  • the computing device 1200 may include a network interface 1218 to interface to the network 1204 through a variety of connections including, but not limited to, standard telephone lines, LAN or WAN links (e.g., 802.11, T1, T3, 56 kb, X.25, SNA, DECNET), broadband connections (e.g., ISDN, Frame Relay, ATM, Gigabit Ethernet, Ethernet-over-SONET), wireless connections, or some combination of any or all of the above.
  • standard telephone lines LAN or WAN links (e.g., 802.11, T1, T3, 56 kb, X.25, SNA, DECNET), broadband connections (e.g., ISDN, Frame Relay, ATM, Gigabit Ethernet, Ethernet-over-SONET), wireless connections, or some combination of any or all of the above.
  • LAN or WAN links e.g., 802.11, T1, T3, 56 kb, X.25, SNA, DECNET
  • broadband connections e.g., ISDN, Frame Re
  • Connections can be established using a variety of communication protocols (e.g., TCP/IP, IPX, SPX, NetBIOS, Ethernet, ARCNET, SONET, SDH, Fiber Distributed Data Interface (FDDI), RS232, IEEE 802.11, IEEE 802.11a, IEEE 802.11b, IEEE 802.11g, IEEE 802.11n, IEEE 802.15.4, Bluetooth, ZIGBEE, CDMA, GSM, WiMax, and direct asynchronous connections).
  • the computing device 1200 communicates with other computing devices 1200 ′ via any type and/or form of gateway or tunneling protocol such as Secure Socket Layer (SSL) or Transport Layer Security (TLS).
  • SSL Secure Socket Layer
  • TLS Transport Layer Security
  • the network interface 1218 may comprise a built-in network adapter, network interface card, PCMCIA network card, card bus network adapter, wireless network adapter, USB network adapter, modem, or any other device suitable for interfacing the computing device 1200 to any type of network capable of communication and performing the operations described herein.
  • an I/O device 1230 may be a bridge between the system bus 1250 and an external communication bus, such as a USB bus, an Apple Desktop Bus, an RS-232 serial connection, a SCSI bus, a FireWire bus, a FireWire 800 bus, an Ethernet bus, an AppleTalk bus, a Gigabit Ethernet bus, an Asynchronous Transfer Mode bus, a HIPPI bus, a Super HIPPI bus, a SerialPlus bus, a SCI/LAMP bus, a FibreChannel bus, or a Serial Attached small computer system interface bus.
  • an external communication bus such as a USB bus, an Apple Desktop Bus, an RS-232 serial connection, a SCSI bus, a FireWire bus, a FireWire 800 bus, an Ethernet bus, an AppleTalk bus, a Gigabit Ethernet bus, an Asynchronous Transfer Mode bus, a HIPPI bus, a Super HIPPI bus, a SerialPlus bus, a SCI/LAMP bus, a FibreChannel bus, or
  • a computing device 1200 of the sort depicted in FIGS. 12B and 12C typically operates under the control of operating systems, which control scheduling of tasks and access to system resources.
  • the computing device 1200 can be running any operating system such as any of the versions of the MICROSOFT WINDOWS operating systems, the different releases of the UNIX and LINUX operating systems, any version of the MAC OS for Macintosh computers, any embedded operating system, any real-time operating system, any open source operating system, any proprietary operating system, any operating systems for mobile computing devices, or any other operating system capable of running on the computing device and performing the operations described herein.
  • Typical operating systems include, but are not limited to: WINDOWS 3.x, WINDOWS 95, WINDOWS 98, WINDOWS 2000, WINDOWS NT 3.51, WINDOWS NT 4.0, WINDOWS CE, WINDOWS XP, WINDOWS 7, WINDOWS 8, and WINDOWS VISTA, all of which are manufactured by Microsoft Corporation of Redmond, Wash.; MAC OS manufactured by Apple Inc. of Cupertino, Calif.; OS/2 manufactured by International Business Machines of Armonk, N.Y.; and LINUX, a freely-available operating system distributed by Caldera Corp.
  • Red Hat Enterprise Linux a Linus-variant operating system distributed by Red Hat, Inc, of Raleigh, N.C.
  • Ubuntu a freely-available operating system distributed by Canonical Ltd. of London, England; or any type and/or form of a Unix operating system, among others.
  • the computing device 1200 can be any type and/or form of computing, telecommunications or media device that is capable of communication and that has sufficient processor power and memory capacity to perform the operations described herein.
  • the computing device 1200 may be a mobile device such as those manufactured, by way of example and without limitation, by Apple Inc. of Cupertino, Calif.; Google/Motorola Div. of Ft. Worth, Tex.; Kyocera of Kyoto, Japan; Samsung Electronics Co., Ltd. of Seoul, Korea; Nokia of Finland; Hewlett-Packard Development Company, L.P. and/or Palm, Inc. of Sunnyvale, Calif.; Sony Ericsson Mobile Communications AB of Lund, Sweden; or Research In Motion Limited of Waterloo, Ontario, Canada.
  • the computing device 1200 is a smart phone, Pocket PC, Pocket PC Phone, or other portable mobile device supporting Microsoft Windows Mobile Software.
  • the computing device 1200 is a digital audio player.
  • the computing device 1200 is a digital audio player such as the Apple IPOD, IPOD Touch, IPOD NANO, and IPOD SHUFFLE lines of devices manufactured by Apple Inc.
  • the digital audio player may function as both a portable media player and as a mass storage device.
  • the computing device 1200 is a digital audio player such as those manufactured by, for example and without limitation, Samsung Electronics America of Ridgefield Park, N.J., or Creative Technologies Ltd. of Singapore.
  • the computing device 1200 is a portable media player or digital audio player supporting file formats including, but not limited to, MP3, WAV, M4A/AAC, WMA Protected AAC, AEFF, Audible audiobook, Apple Lossless audio file formats, and .mov, .m4v, and .mp4 MPEG-4 (H.264/MPEG-4 AVC) video file formats.
  • file formats including, but not limited to, MP3, WAV, M4A/AAC, WMA Protected AAC, AEFF, Audible audiobook, Apple Lossless audio file formats, and .mov, .m4v, and .mp4 MPEG-4 (H.264/MPEG-4 AVC) video file formats.
  • the computing device 1200 comprises a combination of devices, such as a mobile phone combined with a digital audio player or portable media player.
  • the computing device 1200 is a device in the Google/Motorola line of combination digital audio players and mobile phones.
  • the computing device 1200 is a device in the iPhone smartphone line of devices manufactured by Apple Inc.
  • the computing device 1200 is a device executing the Android open source mobile phone platform distributed by the Open Handset Alliance; for example, the device 1200 may be a device such as those provided by Samsung Electronics of Seoul, Korea, or HTC Headquarters of Taiwan, R.O.C.
  • the computing device 1200 is a tablet device such as, for example and without limitation, the iPad line of devices manufactured by Apple Inc.; the PlayBook manufactured by Research In Motion; the Cruz line of devices manufactured by Velocity Micro, Inc. of Richmond, Va.; the Folio and Thrive line of devices manufactured by Toshiba America Information Systems, Inc. of Irvine, Calif.; the Galaxy line of devices manufactured by Samsung; the HP Slate line of devices manufactured by Hewlett-Packard; and the Streak line of devices manufactured by Dell, Inc. of Round Rock, Tex.
  • FIG. 12D a block diagram depicts one embodiment of a system in which a plurality of networks provide hosting and delivery services.
  • the system includes a cloud services and hosting infrastructure 1280 , a service provider data center 1282 , and an information technology (IT) network 1284 .
  • IT information technology
  • the data center 1282 includes computing devices such as, without limitation, servers (including, for example, application servers, file servers, databases, and backup servers), routers, switches, and telecommunications equipment.
  • the cloud services and hosting infrastructure 1280 provides access to, without limitation, storage systems, databases, application servers, desktop servers, directory services, web servers, as well as services for accessing remotely located hardware and software platforms.
  • the cloud services and hosting infrastructure 1280 includes a data center 1282 .
  • the cloud services and hosting infrastructure 1280 relies on services provided by a third-party data center 1282 .
  • the IT network 1204 c may provide local services, such as mail services and web services.
  • the IT network 1204 c may provide local versions of remotely located services, such as locally-cached versions of remotely-located print servers, databases, application servers, desktop servers, directory services, and web servers.
  • additional servers may reside in the cloud services and hosting infrastructure 1280 , the data center 1282 , or other networks altogether, such as those provided by third-party service providers including, without limitation, infrastructure service providers, application service providers, platform service providers, tools service providers, and desktop service providers.
  • a user of a client 1202 accesses services provided by a remotely located server 1206 a .
  • an administrator of an enterprise IT network 1284 may determine that a user of the client 1202 a will access an application executing on a virtual machine executing on a remote server 1206 a .
  • an individual user of a client 1202 b may use a resource provided to consumers by the remotely located server 1206 (such as email, fax, voice or other communications service, data backup services, or other service).
  • the data center 1282 and the cloud services and hosting infrastructure 1280 are remotely located from an individual or organization supported by the data center 1282 and the cloud services and hosting infrastructure 1280 ; for example, the data center 1282 may reside on a first network 1204 a and the cloud services and hosting infrastructure 1280 may reside on a second network 1204 b , while the IT network 1284 is a separate, third network 1204 c . In other embodiments, the data center 1282 and the cloud services and hosting infrastructure 1280 reside on a first network 1204 a and the IT network 1284 is a separate, second network 1204 c .
  • the cloud services and hosting infrastructure 1280 resides on a first network 1204 a while the data center 1282 and the IT network 1284 form a second network 1204 c .
  • FIG. 1D depicts only one server 1206 a , one server 1206 b , one server 1206 c , two clients 1202 , and three networks 1204 , it should be understood that the system may provide multiple ones of any or each of those components.
  • the servers 1206 , clients 1202 , and networks 1204 may be provided as described above in connection with FIGS. 12A-12C .
  • an IT infrastructure may extend from a first network—such as a network owned and managed by an individual or an enterprise—into a second network, which may be owned or managed by a separate entity than the entity owning or managing the first network.
  • Resources provided by the second network may be said to be “in a cloud.”
  • Cloud-resident elements may include, without limitation, storage devices, servers, databases, computing environments (including virtual machines, servers, and desktops), and applications.
  • the IT network 1284 may use a remotely located data center 1282 to store servers (including, for example, application servers, file servers, databases, and backup servers), routers, switches, and telecommunications equipment.
  • the data center 1282 may be owned and managed by the IT network 1284 or a third-party service provider (including for example, a cloud services and hosting infrastructure provider) may provide access to a separate data center 1282 .
  • a third-party service provider including for example, a cloud services and hosting infrastructure provider
  • the machine 102 a described in connection with FIG. 3 above may owned or managed by a first entity (e.g., a cloud services and hosting infrastructure provider 1280 ) while the machine 102 b described in connection with FIG. 3 above may be owned or managed by a second entity (e.g., a service provider data center 1282 ) to which a client 1202 connects directly or indirectly (e.g., using resources provided by any of the entities 1280 , 1282 , or 1284 ).
  • a first entity e.g., a cloud services and hosting infrastructure provider 1280
  • a second entity e.g., a service provider data center 1282
  • one or more networks providing computing infrastructure on behalf of customers is referred to a cloud.
  • a system in which users of a first network access at least a second network, including a pool of abstracted, scalable, and managed computing resources capable of hosting resources may be referred to as a cloud computing environment.
  • resources may include, without limitation, virtualization technology, data center resources, applications, and management tools.
  • Internet-based applications (which may be provided via a “software-as-a-service” model) may be referred to as cloud-based resources.
  • networks that provide users with computing resources may be referred to as compute clouds or “infrastructure-as-a-service” providers.
  • networks that provide storage resources such as storage area networks, may be referred to as storage clouds.
  • a resource may be cached in a local network and stored in a cloud.
  • some or all of a plurality of remote machines 1206 may be leased or rented from third-party companies such as, by way of example and without limitation, Amazon Web Services LLC of Seattle, Wash.; Rackspace US, Inc. of San Antonio, Tex.; Microsoft Corporation of Redmond, Wash.; and Google Inc. of Mountain View, Calif.
  • all the hosts 1206 are owned and managed by third-party companies including, without limitation, Amazon Web Services LLC, Rackspace US, Inc., Microsoft, and Google.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)
US14/816,133 2014-08-07 2015-08-03 Methods and systems for mapping data items to sparse distributed representations Active 2037-10-02 US10394851B2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US14/816,133 US10394851B2 (en) 2014-08-07 2015-08-03 Methods and systems for mapping data items to sparse distributed representations
US16/510,108 US20190332619A1 (en) 2014-08-07 2019-07-12 Methods and systems for mapping data items to sparse distributed representations

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201462034269P 2014-08-07 2014-08-07
US201562134202P 2015-03-17 2015-03-17
US14/816,133 US10394851B2 (en) 2014-08-07 2015-08-03 Methods and systems for mapping data items to sparse distributed representations

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/219,851 Continuation-In-Part US10885089B2 (en) 2014-08-07 2016-07-26 Methods and systems for identifying a level of similarity between a filtering criterion and a data item within a set of streamed documents

Publications (2)

Publication Number Publication Date
US20160042053A1 US20160042053A1 (en) 2016-02-11
US10394851B2 true US10394851B2 (en) 2019-08-27

Family

ID=53794218

Family Applications (1)

Application Number Title Priority Date Filing Date
US14/816,133 Active 2037-10-02 US10394851B2 (en) 2014-08-07 2015-08-03 Methods and systems for mapping data items to sparse distributed representations

Country Status (7)

Country Link
US (1) US10394851B2 (ko)
EP (1) EP3178018A1 (ko)
JP (1) JP6646650B2 (ko)
KR (1) KR102371167B1 (ko)
AU (1) AU2015299050B2 (ko)
CA (1) CA2950676C (ko)
WO (1) WO2016020368A1 (ko)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10726211B1 (en) * 2018-06-25 2020-07-28 Amazon Technologies, Inc. Automated system for dynamically generating comprehensible linguistic constituents
US10885089B2 (en) 2015-08-21 2021-01-05 Cortical.Io Ag Methods and systems for identifying a level of similarity between a filtering criterion and a data item within a set of streamed documents
US11216248B2 (en) 2016-10-20 2022-01-04 Cortical.Io Ag Methods and systems for identifying a level of similarity between a plurality of data representations
US11734332B2 (en) 2020-11-19 2023-08-22 Cortical.Io Ag Methods and systems for reuse of data item fingerprints in generation of semantic maps

Families Citing this family (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10430894B2 (en) 2013-03-21 2019-10-01 Khoros, Llc Gamification for online social communities
US10394851B2 (en) 2014-08-07 2019-08-27 Cortical.Io Ag Methods and systems for mapping data items to sparse distributed representations
KR102437689B1 (ko) * 2015-09-16 2022-08-30 삼성전자주식회사 음성 인식 서버 및 그 제어 방법
US9959271B1 (en) 2015-09-28 2018-05-01 Amazon Technologies, Inc. Optimized statistical machine translation system with rapid adaptation capability
US10185713B1 (en) * 2015-09-28 2019-01-22 Amazon Technologies, Inc. Optimized statistical machine translation system with rapid adaptation capability
US10268684B1 (en) 2015-09-28 2019-04-23 Amazon Technologies, Inc. Optimized statistical machine translation system with rapid adaptation capability
AU2017250467B2 (en) * 2016-04-15 2019-12-19 Solventum Intellectual Properties Company Query optimizer for combined structured and unstructured data records
JP6709997B2 (ja) * 2016-09-23 2020-06-17 パナソニックIpマネジメント株式会社 翻訳装置、翻訳システム、および評価サーバ
KR102630668B1 (ko) * 2016-12-06 2024-01-30 한국전자통신연구원 입력 텍스트를 자동으로 확장하는 시스템 및 방법
US10409839B2 (en) * 2017-03-30 2019-09-10 The Boeing Company Dimension optimization in singular value decomposition-based topic models
US10902462B2 (en) 2017-04-28 2021-01-26 Khoros, Llc System and method of providing a platform for managing data content campaign on social networks
US10785222B2 (en) 2018-10-11 2020-09-22 Spredfast, Inc. Credential and authentication management in scalable data networks
US11050704B2 (en) 2017-10-12 2021-06-29 Spredfast, Inc. Computerized tools to enhance speed and propagation of content in electronic messages among a system of networked computing devices
US11570128B2 (en) 2017-10-12 2023-01-31 Spredfast, Inc. Optimizing effectiveness of content in electronic messages among a system of networked computing device
US10999278B2 (en) 2018-10-11 2021-05-04 Spredfast, Inc. Proxied multi-factor authentication using credential and authentication management in scalable data networks
US11470161B2 (en) 2018-10-11 2022-10-11 Spredfast, Inc. Native activity tracking using credential and authentication management in scalable data networks
US10346449B2 (en) 2017-10-12 2019-07-09 Spredfast, Inc. Predicting performance of content and electronic messages among a system of networked computing devices
US10601937B2 (en) 2017-11-22 2020-03-24 Spredfast, Inc. Responsive action prediction based on electronic messages among a system of networked computing devices
US10963686B2 (en) * 2017-12-01 2021-03-30 International Business Machines Corporation Semantic normalization in document digitization
CN109902815B (zh) * 2017-12-11 2020-05-19 中科寒武纪科技股份有限公司 神经网络运算设备和方法
US10594773B2 (en) 2018-01-22 2020-03-17 Spredfast, Inc. Temporal optimization of data operations using distributed search and server management
US11061900B2 (en) 2018-01-22 2021-07-13 Spredfast, Inc. Temporal optimization of data operations using distributed search and server management
JP6963535B2 (ja) * 2018-06-05 2021-11-10 株式会社日立製作所 分析方法、分析装置及びプログラム
US10855657B2 (en) 2018-10-11 2020-12-01 Spredfast, Inc. Multiplexed data exchange portal interface in scalable data networks
US11132248B2 (en) * 2018-11-29 2021-09-28 Nec Corporation Automated information technology system failure recommendation and mitigation
US11397776B2 (en) 2019-01-31 2022-07-26 At&T Intellectual Property I, L.P. Systems and methods for automated information retrieval
US11556572B2 (en) * 2019-04-23 2023-01-17 Nice Ltd. Systems and methods for coverage analysis of textual queries
US10931540B2 (en) 2019-05-15 2021-02-23 Khoros, Llc Continuous data sensing of functional states of networked computing devices to determine efficiency metrics for servicing electronic messages asynchronously
US11409946B2 (en) * 2020-03-27 2022-08-09 Imp Partners Llc System and method for linking financial management accounts to source compliance documentation
US11704318B1 (en) * 2020-06-12 2023-07-18 A9.Com, Inc. Micro-partitioning based search
US11128589B1 (en) 2020-09-18 2021-09-21 Khoros, Llc Gesture-based community moderation
US11438289B2 (en) 2020-09-18 2022-09-06 Khoros, Llc Gesture-based community moderation
US12120078B2 (en) 2020-09-18 2024-10-15 Khoros, Llc Automated disposition of a community of electronic messages under moderation using a gesture-based computerized tool
US11924375B2 (en) 2021-10-27 2024-03-05 Khoros, Llc Automated response engine and flow configured to exchange responsive communication data via an omnichannel electronic communication channel independent of data source
US11438282B2 (en) 2020-11-06 2022-09-06 Khoros, Llc Synchronicity of electronic messages via a transferred secure messaging channel among a system of various networked computing devices
US11627100B1 (en) 2021-10-27 2023-04-11 Khoros, Llc Automated response engine implementing a universal data space based on communication interactions via an omnichannel electronic data channel
US11714629B2 (en) 2020-11-19 2023-08-01 Khoros, Llc Software dependency management
US12032911B2 (en) 2021-01-08 2024-07-09 Nice Ltd. Systems and methods for structured phrase embedding and use thereof
CN112732934B (zh) * 2021-01-11 2022-05-27 国网山东省电力公司电力科学研究院 电网设备分词词典和故障案例库构建方法
US12093648B2 (en) 2021-02-18 2024-09-17 Nice Ltd. Systems and methods for producing a semantic representation of a document
US11841914B2 (en) 2021-10-19 2023-12-12 Intelling Media Corp. System and method for topological representation of commentary

Citations (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5369764A (en) * 1990-04-25 1994-11-29 Blair; Gary L. Method for sharing access to database elements in a data processing system
US5787422A (en) * 1996-01-11 1998-07-28 Xerox Corporation Method and apparatus for information accesss employing overlapping clusters
EP1049030A1 (en) 1999-04-28 2000-11-02 SER Systeme AG Produkte und Anwendungen der Datenverarbeitung Classification method and apparatus
US6260036B1 (en) 1998-05-07 2001-07-10 Ibm Scalable parallel algorithm for self-organizing maps with applications to sparse data mining problems
US20010027391A1 (en) 1996-11-07 2001-10-04 Matsushita Electric Industrial Co., Ltd. Excitation vector generator, speech coder and speech decoder
US6523026B1 (en) * 1999-02-08 2003-02-18 Huntsman International Llc Method for retrieving semantically distant analogies
US6687812B1 (en) * 1999-04-20 2004-02-03 Nec Corporation Parallel processing apparatus
US6721721B1 (en) 2000-06-15 2004-04-13 International Business Machines Corporation Virus checking and reporting for computer database search results
CN1489086A (zh) 2002-10-10 2004-04-14 莎 刘 一种语义约定全文翻译系统和方法
US20040103070A1 (en) 2002-11-21 2004-05-27 Honeywell International Inc. Supervised self organizing maps with fuzzy error correction
EP1426882A2 (en) 2002-11-27 2004-06-09 Sony United Kingdom Limited Information storage and retrieval
US6922700B1 (en) * 2000-05-16 2005-07-26 International Business Machines Corporation System and method for similarity indexing and searching in high dimensional space
US7068787B1 (en) * 1998-10-23 2006-06-27 Contentguard Holdings, Inc. System and method for protection of digital works
US20060155751A1 (en) * 2004-06-23 2006-07-13 Frank Geshwind System and method for document analysis, processing and information extraction
US20070276774A1 (en) 2006-02-10 2007-11-29 Subutai Ahmad Extensible hierarchical temporal memory based system
US20080005221A1 (en) 2000-09-29 2008-01-03 Postini, Inc. Electronic document policy compliance techniques
CN101251862A (zh) 2008-03-25 2008-08-27 北京百问百答网络技术有限公司 一种基于内容的问题自动分类方法及其系统
US20090049067A1 (en) 2004-05-10 2009-02-19 Kinetx, Inc. System and Method of Self-Learning Conceptual Mapping to Organize and Interpret Data
US20090307213A1 (en) 2008-05-07 2009-12-10 Xiaotie Deng Suffix Tree Similarity Measure for Document Clustering
US7707206B2 (en) 2005-09-21 2010-04-27 Praxeon, Inc. Document processing
US7739208B2 (en) 2005-06-06 2010-06-15 Numenta, Inc. Trainable hierarchical memory system and method
US7937342B2 (en) 2006-11-28 2011-05-03 Numenta, Inc. Method and apparatus for detecting spatial patterns
US20110225108A1 (en) 2010-03-15 2011-09-15 Numenta, Inc. Temporal memory using sparse distributed representation
US8037010B2 (en) 2007-02-28 2011-10-11 Numenta, Inc. Spatio-temporal learning algorithms in hierarchical temporal networks
US20120011124A1 (en) 2010-07-07 2012-01-12 Apple Inc. Unsupervised document clustering using latent semantic density analysis
US20130054552A1 (en) 2011-08-25 2013-02-28 Numenta, Inc. Automated search for detecting patterns and sequences in data using a spatial and temporal memory system
EP2639749A1 (en) 2012-03-15 2013-09-18 CEPT Systems GmbH Methods, apparatus and products for semantic processing of text
US20140046985A1 (en) * 2012-08-09 2014-02-13 Alain Gauthier Storing hierarchical table as a markup language file
US20140279727A1 (en) * 2013-03-15 2014-09-18 William Marsh Rice University Sparse Factor Analysis for Analysis of User Content Preferences
WO2016020368A1 (en) 2014-08-07 2016-02-11 Cortical.Io Gmbh Methods and systems for mapping data items to sparse distributed representations

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7788592B2 (en) * 2005-01-12 2010-08-31 Microsoft Corporation Architecture and engine for time line based visualization of data

Patent Citations (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5369764A (en) * 1990-04-25 1994-11-29 Blair; Gary L. Method for sharing access to database elements in a data processing system
US5787422A (en) * 1996-01-11 1998-07-28 Xerox Corporation Method and apparatus for information accesss employing overlapping clusters
US20010027391A1 (en) 1996-11-07 2001-10-04 Matsushita Electric Industrial Co., Ltd. Excitation vector generator, speech coder and speech decoder
US6260036B1 (en) 1998-05-07 2001-07-10 Ibm Scalable parallel algorithm for self-organizing maps with applications to sparse data mining problems
US7068787B1 (en) * 1998-10-23 2006-06-27 Contentguard Holdings, Inc. System and method for protection of digital works
US6523026B1 (en) * 1999-02-08 2003-02-18 Huntsman International Llc Method for retrieving semantically distant analogies
US6687812B1 (en) * 1999-04-20 2004-02-03 Nec Corporation Parallel processing apparatus
EP1049030A1 (en) 1999-04-28 2000-11-02 SER Systeme AG Produkte und Anwendungen der Datenverarbeitung Classification method and apparatus
US20060212413A1 (en) * 1999-04-28 2006-09-21 Pal Rujan Classification method and apparatus
US6922700B1 (en) * 2000-05-16 2005-07-26 International Business Machines Corporation System and method for similarity indexing and searching in high dimensional space
US6721721B1 (en) 2000-06-15 2004-04-13 International Business Machines Corporation Virus checking and reporting for computer database search results
US20080005221A1 (en) 2000-09-29 2008-01-03 Postini, Inc. Electronic document policy compliance techniques
CN1489086A (zh) 2002-10-10 2004-04-14 莎 刘 一种语义约定全文翻译系统和方法
US20040103070A1 (en) 2002-11-21 2004-05-27 Honeywell International Inc. Supervised self organizing maps with fuzzy error correction
EP1426882A2 (en) 2002-11-27 2004-06-09 Sony United Kingdom Limited Information storage and retrieval
US20090049067A1 (en) 2004-05-10 2009-02-19 Kinetx, Inc. System and Method of Self-Learning Conceptual Mapping to Organize and Interpret Data
US20060155751A1 (en) * 2004-06-23 2006-07-13 Frank Geshwind System and method for document analysis, processing and information extraction
US8103603B2 (en) 2005-06-06 2012-01-24 Numenta, Inc. Trainable hierarchical memory system and method
US7739208B2 (en) 2005-06-06 2010-06-15 Numenta, Inc. Trainable hierarchical memory system and method
US20100191684A1 (en) 2005-06-06 2010-07-29 Numenta, Inc. Trainable hierarchical memory system and method
US7707206B2 (en) 2005-09-21 2010-04-27 Praxeon, Inc. Document processing
US20080059389A1 (en) 2006-02-10 2008-03-06 Jaros Robert G Sequence learning in a hierarchical temporal memory based system
US20070276774A1 (en) 2006-02-10 2007-11-29 Subutai Ahmad Extensible hierarchical temporal memory based system
US7937342B2 (en) 2006-11-28 2011-05-03 Numenta, Inc. Method and apparatus for detecting spatial patterns
US8037010B2 (en) 2007-02-28 2011-10-11 Numenta, Inc. Spatio-temporal learning algorithms in hierarchical temporal networks
CN101251862A (zh) 2008-03-25 2008-08-27 北京百问百答网络技术有限公司 一种基于内容的问题自动分类方法及其系统
US20090307213A1 (en) 2008-05-07 2009-12-10 Xiaotie Deng Suffix Tree Similarity Measure for Document Clustering
US20110225108A1 (en) 2010-03-15 2011-09-15 Numenta, Inc. Temporal memory using sparse distributed representation
US20120011124A1 (en) 2010-07-07 2012-01-12 Apple Inc. Unsupervised document clustering using latent semantic density analysis
US20130054552A1 (en) 2011-08-25 2013-02-28 Numenta, Inc. Automated search for detecting patterns and sequences in data using a spatial and temporal memory system
EP2639749A1 (en) 2012-03-15 2013-09-18 CEPT Systems GmbH Methods, apparatus and products for semantic processing of text
US20130246322A1 (en) 2012-03-15 2013-09-19 Cept Systems Gmbh Methods, Apparatus and Products for Semantic Processing of Text
US20140046985A1 (en) * 2012-08-09 2014-02-13 Alain Gauthier Storing hierarchical table as a markup language file
US20140279727A1 (en) * 2013-03-15 2014-09-18 William Marsh Rice University Sparse Factor Analysis for Analysis of User Content Preferences
WO2016020368A1 (en) 2014-08-07 2016-02-11 Cortical.Io Gmbh Methods and systems for mapping data items to sparse distributed representations

Non-Patent Citations (52)

* Cited by examiner, † Cited by third party
Title
Ampazis, N., et al., LSISOM-A Latent Semantic Indexing Approach to Self-Organizing Maps of Document Collections, Neural Processing Letters 00: 1-17, 2004 Kluwer Academic Publishers, 17 pages.
Ampazis, N., et al., LSISOM—A Latent Semantic Indexing Approach to Self-Organizing Maps of Document Collections, Neural Processing Letters 00: 1-17, 2004 Kluwer Academic Publishers, 17 pages.
Annex to Result of Consultation by Telephone in connection with EP12159672, dated Apr. 28, 2015, 6 pages.
Australian Patent Office Examination report No. 1 for patent application 2013231564.
Bishop, C.M., "Neural Networks for Pattern Recognition," Oxford University Press, New York, 1995/2010.
Communication Pursuant to Article 94(3) EPC and Annex to Communication in connection with EP12159672, dated Apr. 24, 2013, 7 pages.
Communication pursuant to Article 94(3) EPC in connection with EP12159672, dated Nov. 11, 2013, 9 pages.
Communication Under Rule 71(3) EPC Intention to Grant in connection with EP12159672, dated Jul. 20, 2016, 5 pages.
Examiner's Report received in Canadian Patent Application No. 2,864,946 dated Aug. 6, 2018, 4 pages.
Extended European Search Report and Written Opinion in connection with EP12159672, dated Jul. 10, 2012, 9 pages.
H. Ichiki et. al: "Multi-Layer Self-Organizing Semantic Maps," in A Publication of Electronics, Information and Systems Society (vol. 113-C, No. 1), 1993 Published by the Transaction of the Institute of Electrical Engineering of Japan, 15 pages.
Hawkins, J., et al, "Sequence Memory for Prediction, Inference and Behaviour," Philosophical Transactions of the Royal Society of London, Series B, Biological Sciences, 364(1521), pp. 1203-1209, 2009.
Hawkins, J., et al., "Hierarchical Temporal Memory Concepts, Theory, and Terminology," [online], Mar. 27, 2007, [originally retrieved on Jul. 25, 2012], pp. 1-20. Retrieved from the Internet: <URL http://www.numenta.com/Numenta_HTM_Concepts.pdf>.
Hawkins, J., et al., "Hierarchical Temporal Memory Including HTM Cortical Learning Algorithms," Whitepaper of Numenta, Inc., Version 0.2.1, [online], Sep. 12, 2011, [retrieved on Jul. 11, 2016], pp. 1-68. Retrieved from the Internet: <URL http://numenta.com/assets/pdf/whitepapers/hierarchical-temporal-memory-cortical-learning-algorithm-0.2.1-en.pdf>.
Honkela, T., et al., Adaptive Translation: Finding Interlingual Mappings Using Self-Organizing Maps, Artificial Neural Networks-ICANN 2008, 18th International Conference, Prague, Czech Republic, Sep. 3-6, 2008, Proceedings, Part 1, vol. 5163 of the series Lecture Notes in Computer Science, pp. 603-612, Springer Berlin Heidelberg 2008.
Honkela, T., et al., Adaptive Translation: Finding Interlingual Mappings Using Self-Organizing Maps, Artificial Neural Networks—ICANN 2008, 18th International Conference, Prague, Czech Republic, Sep. 3-6, 2008, Proceedings, Part 1, vol. 5163 of the series Lecture Notes in Computer Science, pp. 603-612, Springer Berlin Heidelberg 2008.
International Preliminary Report on Patentability in connection with PCT/EP2013/053546, dated Jun. 16, 2014, 9 pages.
International Preliminary Report on Patentability in connection with PCT/EP2015/067918, dated Feb. 7, 2017, 7 pages.
International Search Report & Written Opinion dated Apr. 26, 2018 in International Patent Application PCT/EP2017/076681 filed Oct. 19, 2017, CORTICAL.IO AG, 15 pages.
International Search Report and Written Opinion for PCT/EP2015/067918, dated Sep. 10, 2015, 9 pages.
International Search Report and Written Opinion in connection with PCT/EP2015/067918, dated Oct. 9, 2015, 10 pages.
International Search Report and Written Opinion in connection with PCT/EP2016/068329, dated Oct. 20, 2016, 10 pages.
International Search Report and Written Opinion of the International Searching Authority in connection with PCT/EP2017/076681, dated Jan. 24, 2018, 16 pages.
International Search Report in connection with WO2016/020368, dated Feb. 11, 2016.
Invitation Pursuant to Article 94(3) and Rule 71(1) EPC in connection with EP12159672, dated Apr. 28, 2015, 6 pages.
Kaski, S., et al., "Websom-Self-Organizing Maps of Document Collections," Neurocomputing, 21(1-3), pp. 101-117, 1998.
Kohonen, T., et al., "Self-Organizing Maps of Symbol Strings," Neurocomputing, 21(1-3), pp. 19-30, 1998.
Kohonen, Teuvo, "The Self-Organizing Map," Proceedings of the IEEE, vol. 78, No. 9, Sep. 1990, 17 pages.
Li, Yuan et al., "Association Retrieve Based on Concept Semantic Space," Journal of Beijing Technology University, pp. 577-580, Dec. 25, 2011.
Merkl, D., "Text Classification with Self-Organizing Maps: Some Lessons Learned," Neurocomputing, 21(1-3), pp. 61-77, 1998.
Non-Final Rejection dated Jan. 3, 2019, in related U.S. Appl. No. 15/219,851, filed Jul. 26, 2016, De Sousa Webber, 23 pages.
Non-Final Rejection dated Mar. 10, 2014, in related U.S. Appl. No. 13/441,138, filed Apr. 6, 2012, De Sousa Webber, 20 pages.
Notice of Allowance dated Sep. 16, 2014, issued in related U.S. Appl. No. 13/441,138, filed Apr. 6, 2012, De Sousa Webber, 26 pages.
Notification of Reasons for Refusal issued by the Japanese Patent Office in patent application No. 2014-561350, dated Feb. 6, 2017, 9 pages.
Observations according to Article 34 PCT in connection with PCT/EP2013/053546, dated Mar. 27, 2014, 8 pages.
Polzlbauer, G., et al., "Advanced Visualization of Self-Organizing Maps with Vector Fields," IEEE Transactions on Neural Networks, 19, pp. 911-922, 2006.
Rachkovskij, et al., Similarity-based Retrieval with Structure-Sensitive Sparse Binary Distributed Representations, Computational Intelligence, Mar. 2012.
Result of Consultation by Telephone in connection with EP12159672, dated Apr. 28, 2015, 6 pages.
Rey, G.D., et al., "Neurale Netze," 2nd edition, Hans Huber, Hofgrefe AG, Bern, 2011.
Ritter, H., et al., "Self-Organizing Semantic Maps," Biological Cybernetics, Springer Verlag Heidelberg, vol. 61, No. 4, Aug. 1, 1989, pp. 241-254.
Rodriguez, A., et al., "Derivation and Analysis of Basic Computational Operations of Thalamocortical Circuits," Journal of Cognitive Neuroscience, 16:5, pp. 856-877, 2004.
Rodriguez, R.J., et al., "Towards a Hierarchical Temporary Memory Based Self-Managed Dynamic Trust Replication Mechanism in Cognitive Mobile Ad-hoc Networks," Proceedings of the 10th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases, 2011.
Somervuo, P.J., "Online Algorithm for the Self-Organizing Map of Symbol Strings," Neural Networks, Elsevier Science Publishers, Barking, GB, vol. 17, No. 8-9, Oct. 1, 2004, pp. 1231-1239.
Starzyk, J.A., et al., "Spatio-Temporal Memories for Machine Learning: A Long-Term Memory Organization," IEEE Transactions of Neural Networks, 20(5), pp. 768-780, 2009.
State Intellectual Property Office of the People's Republic of China First Office Action in connection with PCT2013/80013771.8, dated May 19, 2016, 16 pages.
State Intellectual Property Office of the People's Republic of China Second Office Action for 201380013771.8, dated Feb. 4, 2017, 11 pages.
State Intellectual Property Office of the People's Republic of China Third Office Action for 201380013771.8, dated May 31, 2017.
Summons to Attend Oral Proceedings Pursuant to Rule 115(1) EPC in connection with EP12159672, dated Jan. 19, 2016, 6 pages.
Vesanto, J., et al., "Clustering of the Self-Organizing Map," IEEE Transactions on Neural Networks, 11(3), pp. 586-600, 2000.
Written Opinion of the International Preliminary Examining Authority in connection with PCT/EP2013/053546, dated Mar. 11, 2014, 8 pages.
Written Opinion of the International Searching Authority in connection with PCT/EP2013/053546, 8 pages.
Yang, H., et al. "Mining Multilingual Texts Using Growing Hierarchical Self-Organizing Maps," Machine Learning and Cybernetics, 2007 International Conference on Machine Learning and Cybernetics, Hong Kong, Aug. 2007, pp. 2263-2268.

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10885089B2 (en) 2015-08-21 2021-01-05 Cortical.Io Ag Methods and systems for identifying a level of similarity between a filtering criterion and a data item within a set of streamed documents
US11216248B2 (en) 2016-10-20 2022-01-04 Cortical.Io Ag Methods and systems for identifying a level of similarity between a plurality of data representations
US11714602B2 (en) 2016-10-20 2023-08-01 Cortical.Io Ag Methods and systems for identifying a level of similarity between a plurality of data representations
US10726211B1 (en) * 2018-06-25 2020-07-28 Amazon Technologies, Inc. Automated system for dynamically generating comprehensible linguistic constituents
US11734332B2 (en) 2020-11-19 2023-08-22 Cortical.Io Ag Methods and systems for reuse of data item fingerprints in generation of semantic maps

Also Published As

Publication number Publication date
US20160042053A1 (en) 2016-02-11
AU2015299050A1 (en) 2017-01-12
CA2950676A1 (en) 2016-02-11
CA2950676C (en) 2022-07-19
WO2016020368A1 (en) 2016-02-11
JP2017524200A (ja) 2017-08-24
AU2015299050B2 (en) 2020-09-24
KR20170040245A (ko) 2017-04-12
JP6646650B2 (ja) 2020-02-14
EP3178018A1 (en) 2017-06-14
KR102371167B1 (ko) 2022-03-04

Similar Documents

Publication Publication Date Title
US20210049198A1 (en) Methods and Systems for Identifying a Level of Similarity Between a Filtering Criterion and a Data Item within a Set of Streamed Documents
US11714602B2 (en) Methods and systems for identifying a level of similarity between a plurality of data representations
US10394851B2 (en) Methods and systems for mapping data items to sparse distributed representations
US20190332619A1 (en) Methods and systems for mapping data items to sparse distributed representations
US11734332B2 (en) Methods and systems for reuse of data item fingerprints in generation of semantic maps

Legal Events

Date Code Title Description
AS Assignment

Owner name: CORTICAL.IO GMBH, AUSTRIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:DE SOUSA WEBBER, FRANCISCO EDUARDO;REEL/FRAME:036237/0574

Effective date: 20150731

AS Assignment

Owner name: CORTICAL.IO AG, AUSTRIA

Free format text: CHANGE OF NAME;ASSIGNOR:CORTICAL.IO GMBH;REEL/FRAME:048188/0543

Effective date: 20171221

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4