AU2014100238A4 - Search methods and systems - Google Patents

Search methods and systems Download PDF

Info

Publication number
AU2014100238A4
AU2014100238A4 AU2014100238A AU2014100238A AU2014100238A4 AU 2014100238 A4 AU2014100238 A4 AU 2014100238A4 AU 2014100238 A AU2014100238 A AU 2014100238A AU 2014100238 A AU2014100238 A AU 2014100238A AU 2014100238 A4 AU2014100238 A4 AU 2014100238A4
Authority
AU
Australia
Prior art keywords
text
documents
query
term
terms
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
AU2014100238A
Inventor
Phillip Burns
Hamish Ogilvy
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sajari Pty Ltd
Original Assignee
Sajari Pty Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sajari Pty Ltd filed Critical Sajari Pty Ltd
Application granted granted Critical
Publication of AU2014100238A4 publication Critical patent/AU2014100238A4/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

SEARCH SYSTEMS AND METHODS 5 A method and systems for searching a plurality of documents and retrieving one or more documents relevant to a search query, the search query comprising one or of a plurality of text terms, the method comprising: accepting an input comprising an input search query comprising one or more of a plurality of query text terms; identifying the one or plurality of query text terms in the input search query; for each identified query text term in the input search query: querying a database comprising a representation of 10 each of the plurality of documents to identify one or more of the documents in the database of relevance to the query text term; wherein, the representation of each of the plurality of documents in the database comprises an index of document text terms in each the document and each the documents identified as relevant to the text term comprises an intersection between the query text term and the document text terms; assigning a result weight to the one or more documents identified as being of relevance to the is query text term; with respect to the result weight assigned to each the document identified as relevant to the query text terms, combining the results of the each the database query to form a result document list comprising a plurality of result documents, each the result document being of relevance to at least one of the query text terms; and outputting a representation of each of the result documents. 200 Search, Query: 215 201 20 2200 Doc ' N' '' N' N' N ^' For each of the query terms in the search query, conduct a DB unique search to query document database & retrieve a corresponding result list of documents relevant to that query term and assign a weight to each result list 21 201a 202a V203a 4-2O4a KV2O5a 2021?03 2051? 201b -O -0.6-3 Combine multiple query results into single result list Query document 230 database to retrieve N 3 result list of documents of enhanced relevance M240 2 7 Refine query terms Hu based on user input Accept user input on relevance| of each result & construct | 4 ................. ... ified...search...e..250 Figure 2

Description

- 1 SEARCH METHODS AND SYSTEMS TECHNICAL FIELD [0001] The present invention relates generally to systems and methods for searching data comprising 5 text portion(s) and in particular to systems and methods for compiling a search methodology and searching of documents in a database and/or corpus of documents (i.e. a large set of text documents, which may either be structured or unstructured) for desired information. [ 0002 ] The present invention further relates to identifying and retrieving documents related to text. More specifically, the present invention relates to identifying and retrieving text portions (or text 10 fragments) of interest from a larger corpus of textual material by generating a list of relevant terms from the textual material and weighting such terms to be used to analyse a database of documents for information and/or documents related to the weighted terms and conducting searches of related documents using the generated list of terms. [0003 ] The invention has been developed primarily for use as method and system for analysing a text 15 portion and search methods and systems for relating such text portion to related information and/or documents in a database. However, it will be appreciated that the invention is not limited to this particular field of use. BACKGROUND 20 [0004 ] Any discussion of the background art throughout the specification should in no way be considered as an admission that such background art is prior art, nor that such background art is widely known or forms part of the common general knowledge in the field. [ 0005 ] Traditional search works by looking for certain words or phrases (input text) in a collection of documents. Such searches (e.g. from a user seeking information e.g. in an internet search engine) also 25 rely on a very small amount of input text i.e. a few word (keywords) and, based on that small amount of text, these traditional methods search huge databases and attempt to display as the search output a selection of documents which are (hopefully) relevant to the keywords in the context that the user is seeking. Such traditional search methods work by looking for certain words or phrases in a collection of documents that coincide with words in the initial search string. Typically the first process is to omit all of 30 the documents that do not contain the input text. This is usually done with a reverse index (each term has a list of documents in which it appears) and is often called pruning and typically provides reasonable results very quickly, but it's also quite simplistic and less useful for more complex searches where the desired context of the results is not well captured by a few keywords. [ 0006 ] This type of traditional search method is very useful when there is little subject matter available 35 to the searcher and the aim is to find subject matter as fast as possible. This type of searching i.e. text or keyword to relevant document(s), is the most commonly performed search on the internet and perhaps in the world today. Searching on www.google.com, say for example, using the keywords "Britney Spears" is a typical example of this type of search. The user (searcher) performing such searches initiates the search with very little information, so the relevance of the documents returned to the user with respect to 40 the keywords is often an estimated output based on the statistically most desired outcome, since the keywords themselves produce a huge number of document matches and yet there is not enough information in the input text to inherently order all these matches in terms of relevance to the particular desires of the user/searcher. [ 0007 ] As the number of documents in the database to be searched becomes large and the amount of 45 input text becomes small, the relevance of the documents in the search results becomes impossible to determine without additional information (i.e. information that is not contained in the initial input text or -2 search query). In the case of internet search engines such as GoogleTM, YAHOOTM, MicrosoftTM BINGTM and others, the developers of the search algorithms have found ways to improve the relevance of the search results, most notably through the 'page rank' algorithm of GoogleTM, which essentially uses hypertext link structures to form a popularity index of billions of documents and millions of search terms. 5 Recency and locality have also been used to good effect. [ 0008 ] Popularity works well for internet 'text-to-document' searching, since the popularity methodology finds appropriate information relevant to the input search query in the vast majority of cases. However, this type of searching is limited and often fails to return result documents that are relevant to the context that the user is searching, particularly when the keywords may be used in many 10 unrelated contexts - one of the many examples where such issues may be noticed includes a search for the keyword "jaguar". As the reader would appreciate, the term "jaguar" on its own with now additional keywords providing further contextual information may relate to a number of unrelated 'concept spaces' including, for example, the Jaguar automobile (e.g. the British luxury and sports car manufacturer or specific vehicles such as the Jaguar XK8; the jaguar which is a big cat - a feline in the Panthera genus, 15 and is the only Panthera species found in the Americas; the Jacksonville Jaguars professional American football (NFL) team based in Jacksonville, Florida; the AppleTM operating system OSX 10.2 (code-named Jaguar) and many others. Such ambiguous terms often have dedicated pages on the internet whose sole purpose is to assist in providing contextual information - e.g a list of possible contextual spaces relating to the term 'Jaguar' may be found at http: en.wikipedia.org wiki Jaguar (disambiguation).To date, 20 GoogleTM limits the number of input terms in the search query to 50 terms or 2048 characters. The nature of the GoogleTM search tends (not always, but generally this is the case) to find fewer results as more information is added to the search query, as additional input text terms are used to exclude (prune) as many documents as possible from the search results. This is not a useful approach when the desired concept about which the user is seeking information is not known. For instance, the user may actually be 25 seeking information about the Icelandic funk band as opposed to the new wave British heavy metal band of the same name, however, due to popularity rankings employed by most modern search engines, links relating to information about the unwanted British band may be much more popular than similar links to the Icelandic band and consequently links to the desired context may not appear in the search results (or may appear on subsequent pages of the results) thus the user typically must think of additional keywords 30 to narrow the search results to the desired concept space and many iterations of this process may be required until the desired information is found. [0009 ] Other traditional search methods use technology based on matching meta information. The meta information is essentially a group of labels (or tags) applied to each document, which allow documents to be aligned in different concept spaces. To stick with the above example, meta data may be 35 inserted in the many internet pages signifying the concept of the information in the pages as, for example, "automobiles", "big cats", "football", or "band". Additional meta data tags may also be included in the meta information to further narrow or clarify the concept space e.g. "heavy metal" or "funk". However, such meta tags, while useful in narrowing the search results to the user's desired context if this is known, is still irrelevant when the context cannot be determined from the keywords in the initial search string. 40 Also, meta searching has several disadvantages, most notably is that these tags must be created for every document in the database. This is usually done manually as part of the database input process, which is extremely time consuming and also prevents batch importing of data. Although techniques such as Latent Semantic Indexing (LSI) are becoming more popular due to their ability to semantically determine appropriate tags. The second most notable issue is cross-compatibility issues surrounding different 45 databases. Often each database provider uses different conventions for each meta field, making searching across different platforms virtually impossible. In some cases meta tags are produced automatically, but in many cases this is either simply not practical, highly limiting, or results in large instances of errors in the information assigned to meta tags for documents in the database. Known methods of concept analysis, including clustering and similarity analysis methods are described by Ullman, Rajaraman 50 "Mining of massive data sets". Query extension is a method that has been proposed to improve the -3 quality of search results. For example, US Patent No US8,224,839 (Krupka, et al., assigned to Microsoft Corporation) discusses query extension, which generates a set of context words based on the query terms in the users initial search query and requires that the user making the initial query selects one of at least one of the extended query terms suggested by the query extension algorithm and adding the selected 5 terms to the user's initial query. Also, US6,169,968 (Bowman et al, assigned to Amazon.com, Inc.) discusses a method for refining search queries by analysing historical search query data and suggesting terms which commonly appear together in previous searches, and presenting such potentially related terms to the user for selection as desired. In each case, the known methods for query extension and refinement defer back to the user making the initial query for confirmation that the proposed modified 10 query is relevant. US6,633868 (to Min et al.) discusses a method for document retrieval based on information contained in a document collection about the statistics of word relationships as a measure of context on a document by document basis. Furthermore, US7,765,855 discusses query refinement by replacement of search terms, for example to correct user errors in the initial query construction such as spelling mistakes, which again presents proposed replacement terms to the user for selection. 15 [ 0010 ] Therefore a need exists for a new approach to text searching, particularly involving searching applications capable of correctly discerning the context or "concept space" in which the user is seeking information without requiring many iterations of search query construction by the user to sufficiently narrow the search result to meet the user's need for information on a particular subject and/or concept. 20 DEFINITIONS [ 0011 ] The following definitions are provided as general definitions and should in no way limit the scope of the present invention to those terms alone, but are put forth for a better understanding of the following description. [0012] Unless otherwise defined, all terms (including technical and scientific terms) used herein have 25 the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms used herein should be interpreted as having a meaning that is consistent with their meaning in the context of this specification and the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein. For the purposes of the present invention, additional terms are defined below. 30 [0013] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular articles "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise and thus are used herein to refer either to one or to more than one (i.e. to at least one) of the grammatical object of the article. By way of example, "an element" refers both to one element or to more than one element. 35 [0014] The term "about" is used herein to refer to quantities that vary by as much as 30%, preferably by as much as 20%, and more preferably by as much as 10% to a reference quantity. The use of the word 'about' (and also terms such as 'substantially' and 'generally') to qualify a number is merely an express indication that the number is not to be construed as a precise value. Use of such terms as "about", "substantially" and "generally" indicates that some degree of variation from the specific or literal 40 interpretation is intended. [0015] Throughout this specification, unless the context requires otherwise, the words "comprise", "comprises" and "comprising" and related terms "include", "includes" and "including" will be understood to imply the inclusion of a stated step or element or group of steps or elements but not the exclusion of any other step or element or group of steps or elements. 45 [0016] The term "real-time" for example "displaying real-time data," refers to the display of the data without intentional delay, given the processing limitations of the system and the time required to accurately measure and/or process the data.
-4 [0017] As used herein, the term "exemplary" is used in the sense of providing examples, as opposed to indicating quality. That is, an "exemplary embodiment" is an embodiment provided as an example, as opposed to necessarily being an embodiment of exemplary quality. [ 0018 ] The term "text" refers to the representation of written language, where such written language 5 may be the written representation of a spoken language, a computer language or alternate system of signs for encoding and decoding information as would be appreciated by the skilled addressee. [ 0019 ] The term "text portion" refers to a section of text comprising at least one 'word' wherein each word may be a written equivalent of a word in the language or alternatively a logical group of signs (eg, letters, numbers, symbols or other graphemes) having a particular meaning in the language i.e. the 10 smallest element that in isolation comprises semantic or pragmatic content (with literal or practical meaning). [0020 ] The term "text term" refers to an ordered sequence of one or more words for example a grouping of words with at least one term. A "reference text term" refers to a text term that exists, or is located, in a text portion of a reference document, where a reference document is one of many documents 15 stored in a database upon which searches are made. Each reference document is a potential result of a search. Similarly, an "input text term" refers to a text term located in a text portion of an input document or search string; alternately, it can be thought of as a string in programming terms. The term "global text term" refers to a text term that exists in the global index and hence has associated global weightings. [ 0021 ] The terms "input query" or simply "query" refers to a text string of one or more words upon 20 which a user wishes to base a search on to find documents in the database that are of relevance to those words. The words of the query comprise one or more "query terms" where a single query term may comprise one or more than one word, for example a query term may comprise two, three, four or more words. In other arrangements, the input query may not be natively formed from text, but may be otherwise represented as a text string, for example, the input query may be a music or audio file, a DNA 25 sequence, a location, a place, a latitude, a longitude, one or more numbers etcetera. [0022] The term "input document" refers to a document which contains the input text portion that a user wishes to base a search on to find documents in the database (reference documents) that are of relevance to the input document. In the case of document-to-document searching this is analogous to the input text for an internet search using a search engine such as either GoogleTM or YahooTM. 30 [ 0023 ] The term "input text portion" is related to the input document, except in this case the input may be multiple documents, or simply a grouping of one or more text terms. Thus, it is in essence a generalization of the input text to be searched against. [0024] The term "document" as used herein may relate to the traditional meaning of a document, i.e. comprising a collection of words and sentences (text) relating to a common concept or subject area. 35 Additionally, a document as used herein may be any entity that can be represented as text (even if such textual representation is not common usage). For example, a document may be a text document, an audio file, an audio/visual file, a DNA sequence, a sequence, a location, an entity, a latitude, a longitude, a place, a number, a person, a skill, a job title, a name or a company or any other entity of the like which can be represented in a textual manner such as a sequence of terms. 40 [ 0025 ] The term "local document index" refers to a database representation of the text portion of a document, either a reference document stored in a database, or a document or text portion input to the system by a user for searching against the reference documents. In the presently described arrangements, the local document index generally comprises the text terms in the document determined from parsing of the document, and an text term weight associated with each term, although other information may also be 45 stored in the local document index containing additional information to the text and which is used for calculating the relevance of the results. Specifically, a local document index referring to an input document or text portion is referred to as an "input local index" or "input local text term index", and the -5 term "local reference text term index" refers to the local document index formed for each of the reference documents in the database. Similarly, the terms "global text term index" or "global term index" or "global index" refers to an index (different to the local index) stored in the database containing summary information, such as weightings, on each text term across the entire corpus of documents stored in the 5 database. [0026 ] The term "local weighting" or "local text term weight" or similar terms refer to a numeric weighting value associated with a text term in the local text term index(s). Similarly, the term, "global weighting", "global text term weight" or similar terms refer to a numeric weighting value associated with a text term in the global text term index. 10 [0027] The term "augmented input local text term index" or "re-formed local text term index" refers to input local text term index after it has been adjusted (re-formed) to reflect user interaction with results retrieved by search queries of the database. Alternatively, the index may be re-formed based on information received from additional or external data source(s). Usually the augmented index is re formed by an adjustment of the local text term weightings of text terms stored therein, however, other 15 methods of re-forming the local index are discussed herein. [0028] The terms "local text term weight" and "global text term weight" (and variations), refer to a numerical score given to each of the text terms in either a local text term index or global text term index respectively, and each weight may be determined from a number of parameters related to each term. [ 0029 ] The term "representative text string" typically refers to a small portion (eg. a "snippet") of a 20 document which is used to identify the document in search results presented to the user. The representative text string may be a portion of text surrounding one or more text terms in the document which are found to be relevant to the user's query and thus may be useful for the user to be able to determine the relevance of the document without reviewing the whole document. Also, the representative text string may be meta information about a document, such as title, description, location, etc. Also 25 multiple representative strings, and/or multiple snippets may be used for each document or entity. [ 0030 ] The term "intersection" of terms generally refers to its standard meaning in the context of set analysis where an intersection is found for example when two or more documents share a particular text term. Such an intersection may also comprise an intersection between text terms which may be synonyms or otherwise related terms. 30 [0031 ] The term "concept space" relates to as conceptual grouping of like ideas or related concepts. A representation of a concept space may comprise associations between every pair of terms and/ or a measure of the strength of that association, e.g. a strong association or a weak association. In a practical sense, a concept space may be a network of terms and their related associations, where the association between two terms is typically represented as a numerical quantity between 0 and 1, and may be 35 computed from the co-concurrence of the terms from a given document collection (i.e. document database). If the association between two terms is zero, 0, then the terms may be said to have no similarity or are completely unrelated. Typically this implies that the two terms never appear in the same document. Alternatively, if the association between two terms is 1 or close to 1, then the terms are said to have a strong association and are highly related, thus it is likely that the two terms are related to a 40 common concept. [0032] The present invention is described below with reference to block diagrams and/or flowchart illustrations of methods, apparatus (systems) and/or computer program products according to aspects and arrangements of the invention. It is to be understood that several blocks of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, 45 can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, and/or other programmable data processing apparatus to produce a machine, such that the instructions, which execute -6 via the processor of the computer and/or other programmable data processing apparatus, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks. [ 0033 ] These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular 5 manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instructions which implement the function/act specified in the block diagrams and/or flowchart block or blocks. [0034 ] The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other 10 programmable apparatus to produce a computer-implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks. [ 0035 ] Accordingly, the present invention may be embodied in hardware and/or in software (including firmware, resident software, micro-code, etc.). Furthermore, the present invention may take the form of a 15 computer program product on a computer-usable or computer-readable storage medium having computer usable or computer-readable program code embodied in the medium for use by or in connection with an instruction execution system. In the context of this document, a computer-usable or computer-readable medium may be any non-transitory medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. 20 [ 0036 ] The system of the present invention may be implemented as a stand-alone system, or use a distributed system including the use of a client-server, web-based and/or cloud computing based applications which may reduce the computational burden on the local computing device and associated display device. [ 0037 ] The computer-usable or computer-readable medium may be, for example but not limited to, an 25 electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read only memory (EPROM or Flash memory), an optical fibre, and a portable compact disc read-only 30 memory (CD-ROM). Note that the computer-usable or computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory. [ 0038 ] Although any methods and materials similar or equivalent to those described herein can be 35 used in the practice or testing of the present invention, preferred methods and materials are described. It will be appreciated that the methods, apparatus and systems described herein may be implemented in a variety of ways and for a variety of purposes. The description herein is provided by way of example only. SUMMARY 40 [ 0039 ] It is an object of the present invention to substantially overcome or at least ameliorate one or more of the disadvantages of the prior art, or at least to provide a useful alternative. [ 0040 ] It will be appreciated by those skilled in the art that the invention is not restricted in tis use to the particular application described. Nor is the present invention restricted in its preferred embodiments or arrangements with regard to the specific features and/or element described or depicted herein. It will 45 be appreciated that the invention is not limited to the arrangement or arrangements disclosed, but is -7 capable of numerous re-arrangements , modifications and/or substitutions without departing from the scope of the invention as set forth and defined by the following statements and numbered items. [0041 ] The search methodologies disclosed herein are referred to as "Shotgun" search methods and systems using such Shotgun search methods are referred to as a Shotgun search system. Similarly, query 5 searches initiated by a Shotgun search system or which otherwise uses a Shotgun search method is referred to as a "Shotgun search". The aim of the "Shotgun" search methods and systems disclosed herein is to improve search results by automatically extending user queries into related concept spaces. Additionally, it is designed to allow users to refine the context of the results of their initial search results to produce better results. Such conceptual refinement is also a part of a machine learning process that 10 allows the Shotgun search algorithms to make better decisions for future user searches without requiring as much, or potentially, any additional user input. Generally, Shotgun search methods and systems take a user input query with a finite number of query terms (which may range from as few as one or two terms to many hundreds or thousands of terms in the case of a document used as the input search query) and multiplies the total number of search terms many times over using additional terms which are known to 15 have some particular relationship to each of the input query terms. Then for each of the input query terms and their associated additional terms added by the Shotgun search method, many parallel search queries are executed and the results of each of the parallel search queries are combined to form a final search result listing. [ 0042 ] The Shotgun search methods and systems disclosed herein incorporate the following 20 advantages: [ 0043 ] Turning one query into many parallel queries and output combined results comprising results for each of the parallel searches combined into a single set. Results from each of the parallel searches may be weighted either collectively (i.e. all results from a particular search) or individually (i.e. each individual result from all the parallel searches is given a unique weighting). The Shotgun search methods 25 are further adapted to accomplish this even for input search queries comprising many thousands of input search query terms and thus thousands of parallel sub-queries. In particular arrangements, each of the results from all the parallel searches are given a weighting which comprises a weight depending upon the particular search the result arose from in combination with an individual weight depending on the context of the result itself. 30 [ 0044 ] Allow the parallel queries to be weighted differently, so the combined result set order can be dynamically changed by raising and reducing the importance of each query when combining. In this form, the input query can be conceptually transformed into an input array with a series of "key : value" pairs representing each "sub-query : dynamic importance score" combination respectively. This technique allows the context of the results to be adjusted on the fly. 35 [ 0045 ] Accept greater input information, such as using full documents to initiate a search. [ 0046 ] Not rely on pruning techniques to throw documents out of the potential match pool, as this can typically omit useful results simply because they are missing one of the search terms, even though they may contain other equivalent ones, e.g. "orange" vs "citrus". [ 0047 ] Combine full term and concept spaces. Typically, term and concept spaces do not coexist in 40 search methods and the subsequent results, as a full term space typically reduces the calculation complexity for smaller queries (i.e. a few words), but a concept space reduces the calculations for a large input queries (i.e. full documents). The Shotgun search methods disclosed herein combine both term and concept spaces to provide an improved result set from a particular query. [ 0048 ] According to a first aspect, there is provided a method for searching a plurality of documents 45 and retrieving one or more documents relevant to a search query. The search query may comprise one or of a plurality of text terms. The method may comprise accepting an input comprising an input search query. The input search query may comprise at least one or a plurality of query text terms. The query -8 text terms may comprise either one or more single word text terms, one or more double word text terms, triple word text terms or text terms comprising four or more words. [ 0049 ] The method may further comprise identifying said one or plurality of query text terms in the input search query. 5 [0050] For each identified query text term in the input search query, the method may further comprise querying a database comprising a representation of each of the plurality of documents to identify one or more of the documents in said database of relevance to the query text term. The representation of each of the plurality of documents in the database may comprise an index of document text terms in each document and each document identified as relevant to the text term may comprise an intersection between 10 the query text term and the document text terms; [0051 ] The method may further comprise assigning a result weight to the one or more documents identified as being of relevance to the query text term. [ 0052 ] With respect to the result weight assigned to each document identified as relevant to the query text terms, the method may further comprise combining the results of each database query to form a result 15 document list. The result document list may comprise a plurality of result documents. Each result document may be of relevance to at least one of the query text terms. [0053] The method may further comprise outputting a representation of each of said result documents. [ 0054 ] According to an arrangement of the first aspect, there is provided a method for searching a plurality of documents and retrieving one or more documents relevant to a search query, said search query 20 comprising one or of a plurality of text terms, said method comprising: accepting an input comprising an input search query comprising one or more of a plurality of query text terms; identifying said one or plurality of query text terms in said input search query; for each identified query text term in said input search query: 25 querying a database comprising a representation of each of said plurality of documents to identify one or more of said documents in said database of relevance to said query text term; wherein, said representation of each of said plurality of documents in said database comprises an index of document text terms in each said document and each said documents identified as relevant to said text term comprises an intersection between said query text term and said document text terms; 30 assigning a result weight to said one or more documents identified as being of relevance to said query text term; with respect to said result weight assigned to each said document identified as relevant to said query text terms, combining the results of the each said database query to form a result document list comprising a plurality of result documents, each said result document being of relevance to at least one of 35 said query text terms; and outputting a representation of each of said result documents. [0055 ] The outputting of a representation of each of result documents may comprise ordering each result document on the basis of the relevance to each query text term and the result weight assigned to the identified documents. The method may further comprise outputting a representation of each of the result 40 documents as an ordered result list of the plurality of result documents. [ 0056 ] The representation of each result document may comprise a snippet of an associated one of the result documents the document snippet may be adapted for identification of the associated result document.
-9 [0057] The step of identifying said one or plurality of query text terms in said input search query may comprise identifying one or plurality of query text terms in said input search query. The step of identifying said one or plurality of query text terms in said input search query may further comprise identifying one or more additional text terms of relevance to at least one of said identified query text 5 terms, said additional text terms not being in said input query. The method may further comprise the step of forming one or a plurality of augmented term sets, each augmented term set comprising at least one identified query text terms and one or more of the additional text terms of relevance; [ 0058 ] For each identified query text term in the input search query and each additional text term in each augmented term set the method may comprise for each identified query text term in said input search 10 query and each additional text term in each augmented term set: querying a database comprising a representation of each of said plurality of documents to identify one or more of said documents in said database of relevance to said query text term or to said additional text term in each augmented term set; wherein, the representation of each of the plurality of documents in the database may comprise an index of document text terms in each document and each of the documents identified as relevant to the text term 15 may comprise an intersection between either the query text term or the additional text term in each augmented term set and the document text terms; assigning a result weight to one or more of the documents identified as being of relevance to either the query text term or to the additional text term in each augmented term set. [0059 ] The step of combining the results of each database query may comprise: with respect to said 20 weights assigned to each document identified as relevant to the query text terms or the additional text terms, combining the results of the each database query to form a result document list comprising a plurality of result documents. Each result document may be of relevance to at least one of the query text terms and/or at least one of the additional text terms in each augmented term set. [ 0060 ] The query text term and associated additional text terms forming each said augmented term set 25 may relate to a common concept. The terms forming each said augmented term set may relate to a common concept. [0061] The method may further comprise displaying said one or a plurality of augmented term sets to the user on a visual display apparatus. The visual display apparatus may provide one or more user input fields adapted to accept user input on the significance or relevance of each said augmented term set with 30 respect to the user's concept or concepts of interest. The method may further comprise, with respect to said user input on the significance or relevance of each said augmented term sets, assigning relevance weights to each said augmented term set. The result weights may be equivalent to or derived from said relevance weights. [ 0062 ] For each augmented term set, the querying of the database may be conducted in parallel. 35 [ 0063 ] The additional text terms of relevance to at least one of the query text terms may be associated with a concept space with respect to the at least one query text term. The additional text terms of relevance may be related to at least one concept associated with one or more of the query text terms. [ 0064 ] The input search query may comprise a textual representation of an entity. Each of the plurality of documents may be a textual representation of an entity. The entity may be selected from the group 40 comprising: a text document, an audio file, an audio/visual file, a DNA sequence, a sequence, a location, an entity, a latitude, a longitude, a place, a number, a person, a skill, a job title, a name or a company, etc. The entity may be selected from the group of: an individual; an organisation; or an object wherein the entity is identifiable as something that exists by itself [ 0065 ] The input search query may be received from a user desirous of conducting a search based on 45 the input search query. Alternately, the input search query may be automatically generated on behalf of a user and a search conducted on the automatically generated input search query on behalf of the user.
- 10 [0066 ] The weight associated with or assigned to each document identified as relevant to the query text terms or the additional text term may be either a positive or negative weight. A negative weight is indicative of a result document that is not or is likely not related to the context of the input search query. [ 0067 ] In a particular arrangement, the input search query may be generated by a computer process in 5 response to a predetermined set of conditions to identify one or more documents of relevance to one or more desired outcomes of said computer process. [ 0068 ] The method of the first aspect may further comprise accepting user input on the relevance of one or more of the documents in the result document list. The user input may comprise either a positive indication of relevance or a negative indication of relevance wherein the positive and negative indications 10 of relevance indicate respectively that the document respectively is or is not relevant to the context of interest of the user. [ 0069 ] Based on the user input on the relevance of one or more of the documents in the result document list, the method may further comprise: determining further text terms in documents receiving a positive indication of relevance and adding one or more of the further text terms to the input search query is to form an augmented search query. The method may further comprise repeating the step of querying the database using the augmented search query as the input search query to identify one or more documents of enhanced relevance to the text terms in the augmented search query and assigning weights to the one or more identified documents of enhanced relevance. [ 0070 ] With respect to said result weight assigned to each said document identified as relevant to said 20 query text terms, the method may further comprise combining the results of each of the database queries to form an augmented result document list comprising a plurality of result documents, each result document being of enhanced relevance to at least one of the query text terms. [ 0071 ] The method may further comprise outputting a representation of each of the result documents in the augmented result document list. 25 [ 0072 ] The method may further comprise, based on the user input on the relevance of one or more of the documents in the result document list, determining further text terms in documents receiving a negative indication of relevance and, where one or more of said further text terms are comprised in said input search query, subtracting one or more of said terms from the input search query corresponding to one or more of said further text terms, thereby forming an augmented search query. 30 [ 0073 ] The method may further comprise repeating the step of querying the database using the augmented search query as the input search query to identify one or more documents of enhanced relevance to the query text terms and assigning weights to one or more of the identified documents based on the weights assigned to each text term in said augmented search query. [0074] The method may further comprise, with respect to the result weight assigned to each document 35 identified as relevant to the query text terms, combining the results of the each database query to form an augmented result document list. The augmented result document list may comprise a plurality of result documents. Each result document may be of enhanced relevance to at least one of the query text terms. [ 0075 ] The method may further comprise outputting a representation of each of the result documents in the augmented result document list. 40 [ 0076 ] The method may further comprise, based on the user input on the relevance of one or more of the documents in the result document list: determining further text terms in documents receiving either a positive or a negative indication of relevance and, where one or more of the further text terms may be comprised in the input search query, adjusting the result weights assigned to each of the one or more documents based on the positive or negative indications of relevance. 45 [0077] The method may further comprise forming an augmented result document list.
- 11 [0078 ] The method may further comprise outputting a representation of each of the documents in the augmented document list. [ 0079 ] The method may further comprise, based on the user input on the relevance of one or more of the documents in the result document list, identifying one or more positive text terms in each document 5 receiving a positive indication of relevance; assigning each of said positive text terms a positive weight, and either: where said input search query does not comprise said positive text term, adding said positive text terms with associated positive weight to said input search query to form an augmented search query; or, where said input search query does comprise said positive text term, increasing the weight assigned to said text term in said search query. 10 [ 0080 ] The method may further comprise identifying one or more negative text terms in each document receiving a negative indication of relevance; assigning each of the negative text terms a negative weight, and either: where the input search query does not comprise the negative text term, adding the negative text terms with associated negative weight to the input search query to form the augmented search query; or, where the input search query does comprise the negative text term, either 15 decreasing the weight assigned to the text term in the search query or deleting the negative text terms from the input search query. [ 0081 ] The method may further comprise repeating the step of querying the database using the augmented search query as the input search query to identify one or more documents of enhanced relevance to text terms in the augmented search query and assigning weights to the one or more identified 20 documents based on the weights assigned to each text term in the augmented search query. [0082] The method may further comprise, with respect to the result weight assigned to each document identified as relevant to the query text terms, combining the results of each database query to form an augmented result document list comprising a plurality of result documents. Each result document may be of enhanced relevance to at least one of the query text terms. 25 [ 0083 ] The method may further comprise outputting a representation of each of the result documents in the augmented result document list. [0084] The input search query may comprise a plurality of words and each identified text term may comprise one or more words. The identified text terms may comprise either a single word, or two or more words adjacent in the search query. 30 [ 0085 ] The identified text terms may comprise one or more a single word text term, one or more double word text term and/or one or more triple word text terms. [ 0086 ] According to a second aspect, there is provided a system for searching a plurality of documents and retrieving one or more documents relevant to a search query. The system may comprise an input interface adapted to receive input search query comprising at least one or a plurality of query text terms. 35 The system may further comprise a parsing module adapted to identify said one or plurality of query text terms in said input search query. The system may further comprise a query module adapted to, for each identified query text term in said input search query: * querying a database comprising a representation of each of said plurality of documents to identify one or more of said documents in said database of relevance to said query text 40 term * wherein, said representation of each of said plurality of documents in said database comprises an index of document text terms in each said document and each said documents identified as relevant to said text term comprises an intersection between said query text term and said document text terms 45 [0087] The system may further comprise an assignment module adapted for assigning a result weight to the one or more documents identified as being of relevance to the at least one or plurality of query text - 12 terms. The system may further comprise a collation module adapted to, with respect to the result weight assigned to each document identified as relevant to the query text terms, combining the results of each database query to form a result document list comprising a plurality of result documents, each result document being of relevance to at least one of the query text terms. The system may further comprise a 5 visual display apparatus adapted to display a representation of each of the result documents. [ 0088 ] According to an arrangement of the second aspect, there is provided a system for searching a plurality of documents and retrieving one or more documents relevant to a search query, said system comprising: Input interface adapted to receive input search query comprising at least one or a plurality of query text terms; Parsing module adapted to identify said one or plurality of query text terms in said input 10 search query; Query module adapted to, for each identified query text term in said input search query: querying a database comprising a representation of each of said plurality of documents to identify one or more of said documents in said database of relevance to said query text term wherein, said representation of each of said plurality of documents in said database comprises an index of document text terms in each said document and each said documents identified as relevant to said text term comprises an intersection 15 between said query text term and said document text terms; Assignment module adapted for assigning a result weight to said one or more documents identified as being of relevance to said at least one or plurality of query text terms; Collation module adapted to, with respect to said result weight assigned to each said document identified as relevant to said query text terms, combining the results of the each said database query to form a result document list comprising a plurality of result documents, each said result 20 document being of relevance to at least one of said query text terms; and Visual display apparatus adapted to display a representation of each of said result documents. [ 0089] In the system of the second aspect, the parsing module may be further adapted to: * identify the one or plurality of query text terms in the input search query; * identifying one or more additional text terms of relevance to at least one of the identified 25 query text terms, the additional text terms not being in the input query; and * forming one or a plurality of augmented term sets, each term set comprising at least one identified query text terms and one or more of the additional text terms of relevance. [ 0090] In the system of the second aspect, the query module may be further adapted to, for each identified query text term in said input search query and each additional text term in said augmented term 30 sets: * query a database comprising a representation of each of the plurality of documents to identify one or more of the documents in said document database of relevance to said query text term or to the additional text term in each the augmented term set; * wherein, the representation of each of the plurality of documents in the document database 35 comprises an index of document text terms in each the document and each the documents identified as relevant to said text term comprises an intersection between either the query text term or said additional text term in each said augmented term set and the document text terms. [0091] The assignment module may be adapted to assign a result weight to said one or more 40 documents identified as being of relevance to either said query text term or to said additional text term in each said augmented term set. [0092] The system may further comprise a refinement module adapted to: * accept user input on the relevance of one or more of the documents in the result document list, wherein the user input comprises either a positive indication of relevance or a negative 45 indication of relevance wherein the positive and negative indications of relevance indicate respectively that the document is or is not relevant to the context of interest of the user; - 13 * based on the user input, reforming the term sets to form at least one or a plurality of refined term sets; * using the query module, query said document database to identify one or more of said documents in the document database of relevance to the query text term(s) or to the 5 additional text term(s) in each refined term set; and * Using the collation module, combining the results of the each database query to form a refined result document list comprising a plurality of refined result documents, each result document being of relevance to at least one of the query text terms with respect to the result weight assigned to each document identified as relevant to the refined term sets; and 10 0 Outputting representation of each of said refined result documents to said visual display apparatus. [0100] According to a third aspect, there is provided a system adapted to search a plurality of documents and retrieve one or more documents relevant to a search query, said search query comprising one or of a plurality of text terms. The system may comprise an input module adapted to accepting an 15 input comprising an input search query comprising one or more of a plurality of query text terms. The system may further comprise a Parsing module for identifying said one or plurality of query text terms in said input search query. The system may further comprise a document database comprising a representation of each of said plurality of documents to be searched, wherein, said representation of each of said plurality of documents in said database comprises an index of document text terms in each said 20 document. The system may further comprise memory for storing said input search query and at least a representation of one or more documents relevant to said search query. The system may further comprise a processor module for executing a process comprising the steps of, for each identified query text term in said input search query: * query the document database to identify said one or more of said documents in the 25 document database of relevance to said query text term; wherein, each document identified as relevant to each text term comprises an intersection between said query text term(s) and the document text terms; * assign a result weight to the one or more documents identified as being of relevance to the query text term(s); 30 0 with respect to the result weight assigned to each document identified as relevant to query text term(s), combine the results of the each database query to form a result document list comprising a plurality of result documents, each result document being of relevance to at least one of the query text term(s); and * output a representation of each of the result documents to an output device or visual 35 display apparatus. [0101] The output device may be a visual display. The system may further comprise means for accepting user input. The user input may comprise a positive or negative indication of relevance with respect to each of the result documents. [0102] The processor may be further adapted to form a refined search query based on the user input 40 and query the database to identify one or more documents in the database of enhanced relevance to the text terms of the augmented search query. [0103] According to a fourth aspect, there is provided a method for searching a plurality of documents and retrieving one or more documents relevant to a search query. The method may comprise accepting an input comprising an input search query comprising at least one, or a plurality of query text terms. The 45 method may further comprise querying a database to identify at least one additional text term that is conceptually related to at least one query text term. The method may further comprise adding said - 14 identified conceptually related terms to the original query text terms to form an augmented query. The method may further comprise querying a database with the augmented query to retrieve relevant documents to said query. [0104] According to an arrangement of the fourth aspect, there is provided a method for searching a 5 plurality of documents and retrieving one or more documents relevant to a search query, the method comprising the steps of: a) accepting an input comprising an input search query comprising at least one, or a plurality of query text terms; b) querying a database to identify at least one additional text term that is conceptually 10 related to at least one query text term; c) adding said identified conceptually related terms to the original query text terms to form an augmented query; d) querying a database with the augmented query to retrieve relevant documents to said query. 15 [0105] The method may further comprise assigning a weight to each of the conceptually related text terms. The weight may be based on how closely the query term and the conceptually related term are related. The weight may be either positive or negative. A negative weight may indicate that the conceptually related text term represents an undesirable related concept. The weight may be determined by the statistical probability of said input text term appearing in a reference document with said 20 conceptually related text term. A negative weight may be assigned to the additional terms to reduce the occurrence of result items with undesired context. [0106] According to a fifth aspect there is provided a method for indexing a plurality of documents. Each document may comprise a text portion. The method may comprise the step of parsing the text portion of each of the plurality of documents to form a plurality of respective local document indexes. 25 Each local document index may be associated with a respective document. The local document index may be stored in a database, or alternatively in a file, or set of files. Each local document index may comprise a plurality of local text terms contained in the respective document. Each local document index may further comprise a local weighting associated with each text term. The method may further comprise the step of forming a global document index. The global document index may be formed from the 30 plurality of local document indexes. The global document index may comprise a plurality of global text terms contained in the plurality of documents. The global document index may further comprise a global weighting associated with each global text term. The global weighting associated with each of the global text terms may be determined with respect to a parameter associated with a reference global text term. The global weighting associated with each of the global text terms may be determined with respect to a 35 plurality of parameters, each parameter being associated with a respective reference global text term. [0107] In an arrangement of the fifth aspect, there is provided a method for indexing a plurality of documents, each document comprising a text portion, the method comprising the steps of: a) parsing the text portion of each of the plurality of documents to form a plurality of respective local document indexes each associated with a respective document, and storing the local document index in a database, wherein 40 each local document index comprises a plurality of local text terms contained in the respective document and a local weighting associated with each text term; and b) from the plurality of local document indexes, forming a global document index comprising a plurality of global text terms contained in the plurality of documents, and a global weighting associated with each global text term; wherein the global weighting associated with each of the global text terms is determined with respect to a parameter associated with a 45 reference global text term. [0108] The local document index may be stored as a single logical computer readable file comprising the text terms and associated weightings or alternatively as a related set of logical computer readable files, - 15 wherein each individual text term, or a group of text terms is stored as a distinct logical computer readable file comprising associated details and/or weightings associated to the respective text term. [0109] The global weighting associated with each of the global text terms may be further determined with respect to the number of documents in which each global text term appears across all the plurality of 5 documents. The global weighting associated with each of the global text terms may be determined with respect to the number of documents in which the reference text term appears. [0110] The global weighting associated with each of the global text terms may be further determined with respect to user interactions. Additionally or alternatively, the global weighting associated with each of the global text terms may be further determined with respect to additional and/or external information 10 sources. The local weighting associated with each text term may comprise a combination of a plurality of weightings, each associated with each local text term. One or the plurality of weightings may be determined with respect to one or more parameters selected from the group consisting of: the number of times the term appears in a single document; the number of times the term appears in all the plurality of documents; the position of the text term in a document; the capitalisation of the term; punctuation 15 surrounding the term; words in the text portion adjacent to the term; word rarity; word sequence; combinations of text terms; or the number of words in each text term; a user-defined weighting; or other suitable parameters as would be appreciated by the skilled addressee. In addition or alternatively, the one or the plurality of weightings may be selected from the group of: font size of a word or text term, font family, font weight, font style, font decoration, font colour, subscript, superscript, where the text term 20 appears in document structure (e.g. in a heading, comment, footnote, headers, footers, or in the document's meta information) and whether the term is identified as an entity and which type of entity. The global weighting associated with each global text term may be further determined with respect to the local weighting of each text term. The local weighting for a particular text term may be different when associated with different documents thus resulting in a plurality of term weightings for the particular text 25 term in which the text term appears, and the global weighting may be determined with respect to a combination of the plurality of local weightings for the particular text term [ 0111 ] In any one or any suitable combination of the above aspects and/or arrangements, the weighting may be a positive weighting or a negative weighting. Where one or more of the plurality of weightings is a negative weighting for a selected global text term, the selected global text term may be assigned a zero 30 weighting. Alternatively, the weighting may be selected from a graduated scale of weightings from positive to negative, for example from a scale comprising for instance the graduated weightings: (very bad)--(bad)--(neutral)--(good)--(very good). [0112] The user-defined weighting may be derived from a self-learning system comprising a plurality of user-defined weightings for a selected text term, either local text term(s) or global text term(s) or both. 35 [ 0113 ] In any one or any suitable combination of the above aspects and/or arrangements, a plurality of text terms may be identified in the input query. Each of the plurality of text terms may be assigned at least one associated term weight. The at least one or plurality of text terms may comprise single word terms within the input text portion. The relevant terms may comprise double-word terms within the input text portion. The at least one or plurality of text terms may comprise triple-word terms within the input 40 text portion. The text portion may comprise a large number of text terms, for example up to or more than 5 text terms, or up to or more than 10, up to or more than 20, up to or more than 50, up to or more than 100, up to or more than 500, up to or more than 1000, up to or more than 5000, up to or more than 10000, up to or more than 20000, up to or more than 50000, up to or more than 100000, up to or more than 250000, up to or more than 500000, or up to or more than 1000000 or more text terms, and may depend 45 on available processing capabilities. [0114] The input search query may be a text string comprising a plurality of text words. The input search query on may be a text document. The input search query may be selected from one or more of the group of: a text string comprising one or more words; a text document; a book; an article; a text - 16 record; a certificate; an agreement; a contract; a manuscript; a paper; a scientific paper; a patent specification; a resume; a curriculum vitae; a social or other profile, a legal transcript; a legal document; a DNA sequence; or a news report. The search query may comprise a large number of words, for example up to or more than 5 words, up to or more than 10, up to or more than 20, up to or more than 50, up to or 5 more than 100, up to or more than 500, up to or more than 1000, up to or more than 5000, up to or more than 10000, up to or more than 20000, up to or more than 50000, up to or more than 100000, up to or more than 250000, up to or more than 500000, or up to or more than 1000000 or more words. [0115] The reference documents may be text documents representative of a document selected from one or more of the group of: a text string comprising one or more words; a text document; a book; an 10 article; a text record; a certificate; an agreement; a contract; a manuscript; a paper; a scientific paper; a patent specification; a resume; a curriculum vitae; a social or other profile, a legal transcript; a legal document; a DNA sequence; or a news report. [ 0116 ] A plurality of text terms may be identified in the input query. Each of the plurality of text terms may be assigned at least one associated term weight. The associated term weight(s) may be 15 determined with reference to the global term index. [0117] The at least one or plurality of text terms may comprise single word terms within the input text portion. The relevant terms may comprise double-word terms within the input text portion. The at least one or plurality of text terms may comprise triple-word terms within the input text portion. [ 0118 ] The input query may be a text string comprising a plurality of text words. The input query may 20 be a text document. The input query may be selected from one or more of the group of: a text string comprising one or more words; a text document; a book; an article; a text record; a certificate; an agreement; a contract; a manuscript; a paper; a scientific paper; a patent specification; a resume; a curriculum vitae; a legal transcript; a legal document; or a news report. [0119] The reference documents may be text documents, or documents comprising a text portion, 25 representative of a document selected from one or more of the group of: a book; an article; a text record; a certificate; an agreement; a contract; a manuscript; a paper; a scientific paper; a patent specification; an employment advertisement; a legal transcript; a legal document; or a news report. As part of the searching process, each of the reference documents may be assigned a document relevance score representative of the relevance to the input text portion. 30 [0120] In any one of the above aspects and/or arrangements, further information may be obtained from, for example, external data source(s) with respect to one or more of the reference document which forms the result of a search query. The further information may be utilised to re-form the input query on the basis of the information from the external data source(s). The external data source(s) may comprise the internet including for example internet accessible database(s) and/or internet web page(s), or 35 alternatively news information source(s), financial and/or stock information source(s), scientific information source(s), professional society information source(s), and the like, and may be primary, secondary and/or tertiary information source(s). Alternatively, the (potentially external) data source may comprise user usage data i.e. user usage patterns as discussed below. [0121] In particular arrangements, the external data source may be a log file comprising usage patterns 40 of a particular user. The log file may not, in fact, be external. Indeed it may be local in the sense that it may be located on the same computer system being accessed by a user of the search systems & methods. Use of the term "external" in this context is meant to imply that the data source is not contained within the system or system architecture being used to carry out the search, i.e. the data source may be external to the search system, but may conceivably be located on the same computer system as the search system. 45 [0122] In a particular example, such a user activity log file may comprise data on the particular web sites a user visits and possibly even how long they stayed on each web site (such duration data may be indicative of a user being interested in the content of a particular web site say if they spent more time - 17 looking at one site over other sites). Another example of data which may be included in such a user activity log file could be the particular programs on the user's computer that they use, and how much time is spent on each program (for example, if a user opens a photo editing software program such as for example Adobe T M PhotoshopTM, and spends many hours in that program, then this is likely to indicate that 5 that user is interested in photo editing, or more simply just photography in general). The user activity log file may also include information about emails which the user receives (and the duration spent reading individual emails) or sends and to who (for example if the user receives emails from and/or sends emails to individuals who are photographers and/or if the content of the emails received and/or sent is related to photography, then this is likely to indicate that that user is interested in photo editing, or more simply just 10 photography in general, and such indication may be combined with information obtained from the user's other usage habits (such as using photo editing software packages) to enhance the confidence to be able to determine particular subject areas, and hence particular content, that the user may be interested in. In light of the use of such user activity log file(s), arrangements of the above aspects may comprise usage of such a user activity log file as an input search query to a search system (such as the Shotgun search 15 systems & methods disclosed herein). It will be appreciated that such a user activity log file would typically be generated by a computer process and may be generated either with or without direct input from the user. The user activity log file may then form the basis for the search system to identify one or more documents of relevance to the particular user based on their recent or historical usage patterns. The log file may comprise data from only a short time period on the order of one, two or three weeks to ensure 20 that the documents retrieved by the search system are relevant to the user's most recent interests. Alternatively, the user activity log file may comprise all historical data for a particular user which would be able to, in conjunction with the search system, identify long term trends in the user's interests and to retrieve documents relevant to those interests at regular intervals, say every day, once a week, once a fortnight, once a month etc depending upon the requirements of the user. A good example of usage of 25 such user activity log files in relation to search methods disclosed herein would be in automatic news curation whereby the search system initiates a search of available news sources using the user activity log file as the input query to retrieve news items which are relevant or possibly relevant to the user's recent or historical activity and interaction data. [0123] In a particular arrangement, there may be a dedicated computer process adapted to compile one 30 or more data sources (such as, for example user activity usage data), and periodically use one or more of such data sources as an input search query to identify one or more documents of relevance to the data comprised in the one or more data sources to achieve one or more desired outcomes of said computer process, such as, for example, to retrieve an article (e.g. news article, blog post, product review etc.)relevant to the recent and interests and/or activities of the user. It will be appreciated that the user 35 may be given ability to configure the computer process in terms of the type of data that is compiled and also the results obtained by the computer process. The data sources may be generated by the computer process in response to a predetermined set of conditions. [0124] In any one of the above aspects and/or arrangements, the input search query may be a text string comprising a plurality of text words. The input search query may be a text document. The input 40 search query may be selected from one or more of the group of. a text string comprising one or more words; a text document; a book; an article; a text record; a certificate; an agreement; a contract; a manuscript; a paper; a scientific paper; a patent specification; a resume; a curriculum vitae; a legal transcript; a legal document; or a news report. The text portion may comprise a large number of words, for example up to or more than 5 words, up to or more than 10, up to or more than 20, up to or more than 45 50, up to or more than 100, up to or more than 500, up to or more than 1000, up to or more than 5000, up to or more than 10000, up to or more than 20000, up to or more than 50000, up to or more than 100000, up to or more than 250000, up to or more than 500000, or up to or more than 1000000 or more words. [0125] In alternate arrangements, the input search query may be any object that may be or is represented by a text string, for example. a text document, an audio file, an audio/visual file, a DNA - 18 sequence, a sequence, a location, an entity, a latitude, a longitude, or a number, or any other object capable of being represented by a text string. [0126] In any one of the above aspects and/or arrangements, a plurality of text terms may be identified in the text portion. Each of the plurality of text terms may be assigned at least one associated term 5 weight. The at least one or plurality of text terms may comprise single word terms within the input text portion. The relevant terms may comprise double-word terms within the input text portion. The at least one or plurality of text terms may comprise triple-word terms within the input text portion. The text portion may comprise a large number of text terms, for example up to or more than 5 text terms, up to or more than 10, up to or more than 20, up to or more than 50, up to or more than 100, up to or more than 10 500, up to or more than 1000, up to or more than 5000, up to or more than 10000, up to or more than 20000, up to or more than 50000, up to or more than 100000, up to or more than 250000, up to or more than 500000, or up to or more than 1000000 or more text terms, and may depend on available processing capabilities. [0127] In any one of the above aspects and/or arrangements, the search method may comprise 15 breaking an input text portion into many smaller text portions and subsequently executing many substantially parallel queries based on the smaller text portions, assigning weights to these queries and based on the weights, combining and ordering the query result sets into a single set of results. The breaking into text portions may further comprises the addition of additional terms that do not exist in the input text portion. The additional terms may represent a concept space (i.e. related to the concept of 20 terms in the input text). [0128] In further arrangements, additional non-text objects may be represented as text, for example DNA sequences, locations, places, latitudes, longitudes, numbers, music, or audio may form the search query and/or the result 'documents' where a document in this sense is a text-based representation of the non-text object(s). Essentially anything that can be stored on a computer readable medium, such that an 25 input can be represented by groups of characters. [0129 ] In any one of the above aspects and/or arrangements, a user may initiate a search. In alternate arrangements, a system may initiate a search on behalf of a user. [0130] In any one of the above aspects and/or arrangements, the weighting may be positive or negative, where a negative weighting represents a term with negative context (i.e. indicative of what is 30 expected to be a bad result). [0131] In any one of the above aspects and/or arrangements, the search method may comprise making an existing list of items and based on a user's interaction with an item in the item set, add text terms to an array of terms, then based on the terms process many queries in parallel and return an augmented set of items 35 [0132] In any one of the above aspects and/or arrangements, the search method may comprise using an array of text terms with associated weightings to run multiple queries for each text term in the array and based on the associated weighting, and subsequently add the result sets together to form a combined result set with a score for each potential result based on the addition of the weightings for each set of results it appeared in. 40 [0133] In any one of the above aspects and/or arrangements, the search method may comprise accepting use feedback and, based on the user feedback, the array weightings may be adjusted and multiple queries again executed in parallel and subsequently combined to form an augmented set of result items. The augmented set of items may be of improved context with regards to the user's desires. User feedback may indicate a negative context and items in the array may have negative weighting. The terms 45 chosen for addition may be conceptually related, where conceptual relation can be defined as appearing in a high percentage of documents together. The additional terms weightings in the array may be scaled based on a factor of how closely they are related to terms already in the array.
- 19 [0134] In further arrangements of the above aspects, the additional terms may represent a non-related context. [0135] In any one of the above aspects and/or arrangements, a negative weight may be assigned to the additional terms to reduce the occurrence of result items with undesired context. The negative weighting 5 is assigned to the sub query associated with the negative term, which consequently reduces the score for all results appearing in that sub query's results when the sub queries are combined. Numbered Statements of Invention Summary 1. A method for searching a plurality of documents and retrieving one or more documents relevant to a search query, said search query comprising one or of a plurality of text terms, said method io comprising: a) accepting an input comprising an input search query comprising at least one or a plurality of query text terms; b) identifying said one or plurality of query text terms in said input search query; c) for each identified query text term in said input search query: is i) querying a database comprising a representation of each of said plurality of documents to identify one or more of said documents in said database of relevance to said query text term; ii) wherein, said representation of each of said plurality of documents in said database comprises an index of document text terms in each said document 20 and each said documents identified as relevant to said text term comprises an intersection between said query text term and said document text terms; iii) assigning a result weight to said one or more documents identified as being of relevance to said query text term; d) with respect to said result weight assigned to each said document identified as relevant to 25 said query text terms, combining the results of the each said database query to form a result document list comprising a plurality of result documents, each said result document being of relevance to at least one of said query text terms; and e) outputting a representation of each of said result documents. 2. A method as Itemed in Item 1 wherein in step e), the outputting of a representation of each of 30 said result documents comprises the steps of e 1) ordering each said result document on the basis of the relevance to each said query text term and said result weight assigned to said identified documents; and e2) outputting an representation of each of said result documents as an ordered result list of said plurality of result documents. 35 3. A method as Itemed in Item 1 wherein said representation of each said result document comprising a snippet of an associated one of said result documents said document snippet being adapted for identification of said associated result document. 4. A method as Itemed in any one of the preceding Items wherein step b) comprises: bl) identifying said one or plurality of query text terms in said input search query; 40 b2) identifying one or more additional text terms of relevance to at least one of said identified query text terms, said additional text terms not being in said input query; and - 20 b3) forming one or a plurality of augmented term sets, each term set comprising at least one identified query text terms and one or more of said additional text terms of relevance; step c) comprises: c1) for each identified query text term in said input search query and each additional text 5 term in said augmented term sets: i) querying a database comprising a representation of each of said plurality of documents to identify one or more of said documents in said database of relevance to said query text term or to said additional text term in each said augmented term set; 10 ii) wherein, said representation of each of said plurality of documents in said database comprises an index of document text terms in each said document and each said documents identified as relevant to said text term comprises an intersection between either said query text term or said additional text term in each said augmented term set and said document text terms; 15 iii) assigning a result weight to said one or more documents identified as being of relevance to either said query text term or to said additional text term in each said augmented term set; and step d) comprises: dl) with respect to said result weights assigned to each said document identified as 20 relevant to said query text terms or said additional text term, combining the results of the each said database query to form a result document list comprising a plurality of result documents, each said result document being of relevance to at least one of said query text terms and/or at least one of said additional text terms in each said augmented term set. 5. A method as Itemed in Item 4, wherein the terms forming each said augmented term set relate 25 to a common concept. 6. A method as Itemed in Item 4, wherein step b3) further comprises displaying said one or a plurality of augmented term sets to the user on a visual display apparatus. 7. A method as Itemed in Item 6 wherein said visual display apparatus provides one or more user input fields adapted to accept user input on the significance or relevance of each said augmented term set 30 with respect to the user's concept or concepts of interest. 8. A method as Itemed in Item 7 wherein step b3) further comprises, with respect to said user input on the significance or relevance of each said augmented term sets, assigning relevance weights to each said augmented term set. 9. A method as Itemed in Item 8 wherein in step c I) (iii) said result weights are equivalent to or 35 derived from said relevance weights. 10. A method as Itemed in any one of the preceding clams wherein for each said query text term and each said additional text term, said querying of said database is conducted in parallel. 11. A method as Itemed in any one of Items 4 to 10 wherein said additional text terms of relevance to at least one of said query text terms define or are associated with a concept space with 40 respect to said at least one query text term. 12. A method as Itemed in any one of Items 4 to 11 wherein said additional text terms of relevance are related to at least one concept associated with one or more of said query text terms.
- 21 13. A method as Itemed in any one of the preceding Items wherein said input search query comprises a textual representation of an entity. 14. A method as Itemed in Item 13 wherein each of said plurality of documents is a textual representation of an entity. 5 15. A method as Itemed in either Item 13 or Item 14 wherein said entity is selected from the group comprising: a text document, an audio file, an audio/visual file, a DNA sequence, a sequence, a location, an entity, a latitude, a longitude, or a number. 16. A method as Itemed in Item 15 wherein said entity is selected from the group of: an individual; an organisation; or an object wherein said entity is identifiable as something that exists by 10 itself. 17. A method as Itemed in any one of the preceding Items wherein said input search query is received from a user desirous of conducting a search based on said input search query. 18. A method as Itemed in any one of the preceding Items wherein said input search query is automatically generated on behalf of a user and a search conducted on said automatically generated input 15 search query on behalf of said user. 19. A method as Itemed in any one of the preceding Items wherein the weight may be either a positive or negative weight. 20. A method as Itemed in any one of the preceding Items wherein said input search query is generated by a computer process in response to a predetermined set of conditions to identify one or more 20 documents of relevance to one or more desired outcomes of said computer process. 21. A method as Itemed in Item 19 wherein a negative weight is indicative of a result not related to the context of the input search query. 22. A method as Itemed in any one of the preceding Items further comprising the steps of: accepting user input on the relevance of one or more of the documents in the result document 25 list, wherein said user input comprises either a positive indication of relevance or a negative indication of relevance wherein said positive and negative indications of relevance indicate respectively that the document is or is not relevant to the context of interest of said user. 23. A method as Itemed in Item 22 further comprising, based on the user input on the relevance of one or more of the documents in the result document list: determining further text terms in documents 30 receiving a positive indication of relevance and adding one or more of said further text terms to the input search query to form an augmented search query; repeating step c) using said augmented search query as the input search query to identify one or more documents of enhanced relevance to said query text terms and assigning weights to said one or more identified documents; 35 dl) with respect to said result weight assigned to each said document identified as relevant to said query text terms, combining the results of the each said database query to form an augmented result document list comprising a plurality of result documents, each said result document being of enhanced relevance to at least one of said query text terms; and e l) outputting a representation of each of said result documents in said augmented result 40 document list. 24. A method as Itemed in either Item 22 or Item 23 further comprising, based on the user input on the relevance of one or more of the documents in the result document list: - 22 determining further text terms in documents receiving a negative indication of relevance and, where one or more of said further text terms are comprised in said input search query, subtracting one or more of said terms from the input search query corresponding to one or more of said further text terms, thereby forming an augmented search query; 5 repeating step c) using said augmented search query as the input search query to identify one or more documents of enhanced relevance to said query text terms and assigning weights to said one or more identified documents based on said weights assigned to each said text term in said augmented search query; wherein step d) comprises: 10 d2) with respect to said result weight assigned to each said document identified as relevant to said query text terms, combining the results of the each said database query to form an augmented result document list comprising a plurality of result documents, each said result document being of enhanced relevance to at least one of said query text terms; and step e) comprises: 15 e2) outputting a representation of each of said result documents in said augmented result document list. 25. A method as Itemed in any one of Items 22 to 24 further comprising, based on the user input on the relevance of one or more of the documents in the result document list: determining further text terms in documents receiving either a positive or a negative indication of relevance and, where one or 20 more of said further text terms are comprised in said input search query, adjusting said result weights assigned to each of said one or more documents based on said positive or negative indications of relevance; forming an augmented result document list; and outputting a representation of each of said documents in said augmented document list. 25 26. A method as Itemed in any one of Items 22 to 25 further comprising, based on the user input on the relevance of one or more of the documents in the result document list: identifying one or more positive text terms in each said document receiving a positive indication of relevance; assigning each of said positive text terms a positive weight, and either: where said input search query does not comprise said positive text term, adding said positive text terms with 30 associated positive weight to said input search query to form an augmented search query; or, where said input search query does comprise said positive text term, increasing the weight assigned to said text term in said search query; identifying one or more negative text terms in each said document receiving a negative indication of relevance; assigning each of said negative text terms a negative weight, and either: where 35 said input search query does not comprise said negative text term, adding said negative text terms with associated negative weight to said input search query to form said augmented search query; or, where said input search query does comprise said negative text term, either decreasing the weight assigned to said text term in said search query or deleting said negative text terms from said input search query; repeating step c) using said augmented search query as the input search query to identify one 40 or more documents of enhanced relevance to said query text terms and assigning weights to said one or more identified documents based on said weights assigned to each said text term in said augmented search query; d2) with respect to said result weight assigned to each said document identified as relevant to said query text terms, combining the results of the each said database query to form an augmented - 23 result document list comprising a plurality of result documents, each said result document being of enhanced relevance to at least one of said query text terms; and e2) outputting a representation of each of said result documents in said augmented result document list. 5 27. A method as Itemed in any one of the preceding Items wherein the input search query comprises a plurality of words and each identified text terms comprises one or more words. 28. A method as Itemed in any Item 49 wherein the identified text terms comprises either a single word, or two or more adjacent words. 29. A method as Itemed in Item 50 wherein said identified text term comprises either a single 10 word text term, a double word text term or a triple word text term. 30. A method for searching a plurality of documents and retrieving one or more documents relevant to a search query, said method comprising the steps of a) accepting an input comprising an input search query comprising at least one, or a plurality of query text terms; i5 b) querying a database to identify at least one additional text term that is conceptually related to at least one query text term; c) adding said identified conceptually related terms to the original query text terms to form an augmented query; d) querying a database with the augmented query to retrieve relevant documents to said 20 query. 31. A method as Itemed in Item 30 wherein step c) further comprises assigning a weight to each of the conceptually related text terms 32. A method as Itemed in Item 31 wherein the weight is based on how closely said query term and said conceptually related term are related. 25 33. A method as Itemed in Item 32 wherein the weight can be either positive or negative. 34. A method as Itemed in Item 33 wherein a negative weight indicates said conceptually related text term represents an undesirable related concept. 35. A method as Itemed in any one of Items 30 to 32 wherein the weight is determined by the statistical probability of said input text term appearing in a reference document with said conceptually 30 related text term. 36. A method as Itemed in 34 wherein a negative weight is assigned to the additional terms to reduce the occurrence of result items with undesired context. 37. A system for searching a plurality of documents and retrieving one or more documents relevant to a search query, said system comprising: 35 a) Input interface adapted to receive input search query comprising at least one or a plurality of query text terms b) Parsing module adapted to identify said one or plurality of query text terms in said input search query c) Query module adapted to, for each identified query text term in said input search query: - 24 i) querying a database comprising a representation of each of said plurality of documents to identify one or more of said documents in said database of relevance to said query text term ii) wherein, said representation of each of said plurality of documents in said 5 database comprises an index of document text terms in each said document and each said documents identified as relevant to said text term comprises an intersection between said query text term and said document text terms d) Assignment module adapted for assigning a result weight to said one or more documents identified as being of relevance to said at least one or plurality of query text terms; 10 e) Collation module adapted to, with respect to said result weight assigned to each said document identified as relevant to said query text terms, combining the results of the each said database query to form a result document list comprising a plurality of result documents, each said result document being of relevance to at least one of said query text terms; and i5 f) Visual display apparatus adapted to display a representation of each of said result documents. 38. A system as Itemed in Item 37 wherein: a) said parsing module is further adapted to: i) identify said one or plurality of query text terms in said input search query; 20 ii) identifying one or more additional text terms of relevance to at least one of said identified query text terms, said additional text terms not being in said input query; and iii) forming one or a plurality of augmented term sets, each term set comprising at least one identified query text terms and one or more of said additional text terms 25 of relevance. 39. A system as Itemed in either Item 37 or Item 38 wherein: a) said query module is further adapted to, for each identified query text term in said input search query and each additional text term in said augmented term sets: iv) query a database comprising a representation of each of said plurality of 30 documents to identify one or more of said documents in said document database of relevance to said query text term or to said additional text term in each said augmented term set; v) wherein, said representation of each of said plurality of documents in said document database comprises an index of document text terms in each said 35 document and each said documents identified as relevant to said text term comprises an intersection between either said query text term or said additional text term in each said augmented term set and said document text terms; and b) said assignment module is adapted to assign a result weight to said one or more 40 documents identified as being of relevance to either said query text term or to said additional text term in each said augmented term set. 40. A system as Itemed in any one of Items 37 to 39 further comprising: a) Refinement module adapted to: - 25 i) accept user input on the relevance of one or more of the documents in the result document list, wherein said user input comprises either a positive indication of relevance or a negative indication of relevance wherein said positive and negative indications of relevance indicate respectively that the document is or is 5 not relevant to the context of interest of said user; ii) Based on the user input, reforming said term sets to form at least one or a plurality of refined term sets; iii) Using said query module, query said document database to identify one or more of said documents in said document database of relevance to said query text term 10 or to said additional text term in each said refined term set; and iv) Using said collation module, combining the results of the each said database query to form a refined result document list comprising a plurality of refined result documents, each said result document being of relevance to at least one of said query text terms with respect to said result weight assigned to each said 15 document identified as relevant to said refined term sets; and v) Outputting representation of each of said refined result documents to said visual display apparatus. 41. A search system adapted to search a plurality of documents and retrieve one or more documents relevant to a search query, said search query comprising at least one or a plurality of text 20 terms, said search system comprising: An input module adapted to accepting an input comprising an input search query comprising said at least one or plurality of query text terms; Parsing module for identifying said one or plurality of query text terms in said input search query; 25 Document database comprising a representation of each of said plurality of documents to be searched, wherein, said representation of each of said plurality of documents in said database comprises an index of document text terms in each said document; Memory for storing said input search query and at least a representation of one or more documents relevant to said search query; 30 Processor module for executing a process comprising the steps of, for each identified query text term in said input search query: vi) query said database to identify said one or more of said documents in said database of relevance to each said query text term; wherein, each said documents identified as relevant to each said text term comprises an 35 intersection between said query text term and said document text terms; vii) assign a result weight to said one or more documents identified as being of relevance to said query text term; viii) with respect to said result weight assigned to each said document identified as relevant to said query text terms, combine the results of the each said database 40 query to form a result document list comprising a plurality of result documents, each said result document being of relevance to at least one of said query text terms; and ix) output a representation of each of said result documents to an output device.
- 26 42. A system as Itemed in Item 41 further comprising a term database comprising a plurality of additional text terms which may be added to said at least one or plurality of query text terms in said search query to form at least one or a plurality of augmented search queries, wherein said augmented search queries comprise at least one or a plurality of sets of text term, each set of text terms comprising an 5 augmented list of text terms associated with each of said at least one or plurality of query text terms respectively. 43. A system as Itemed in Item 41 wherein said output device is a visual display and wherein said system further comprises means for accepting user input comprising a positive or negative indication of relevance with respect to each of the result documents. 10 44. A system as Itemed in Item 41 or Item 43 wherein said processor is further adapted to form an augmented search query based on said user input and query said database identify said one or more of said documents in said database of enhanced relevance to said text terms of said augmented search query. 45. A computer readable medium comprising a program for searching a plurality of documents and retrieving one or more documents relevant to a search query, said search query comprising one or of a 15 plurality of text terms, said program comprising computer readable instructions to configure a computer system to carry out a method comprising the steps of: accepting an input comprising an input search query comprising at least one or a plurality of query text terms; identifying said one or plurality of query text terms in said input search query; 20 for each identified query text term in said input search query: i) querying a database comprising a representation of each of said plurality of documents to identify one or more of said documents in said database of relevance to said query text term; ii) wherein, said representation of each of said plurality of documents in said 25 database comprises an index of document text terms in each said document and each said documents identified as relevant to said text term comprises an intersection between said query text term and said document text terms; iii) assigning a result weight to said one or more documents identified as being of relevance to said query text term; 30 with respect to said result weight assigned to each said document identified as relevant to said query text terms, combining the results of the each said database query to form a result document list comprising a plurality of result documents, each said result document being of relevance to at least one of said query text terms; and outputting a representation of each of said result documents. 35 46. The computer readable medium of Item 45 wherein step b) of said method to be carried out by said computer system comprises: bl) identifying said one or plurality of query text terms in said input search query; b2) identifying one or more additional text terms of relevance to at least one of said identified query text terms, said additional text terms not being in said input query; 40 step c) of said method to be carried out by said computer system comprises: c1) for each identified query text term in said input search query and each additional text term: - 27 i) querying a database comprising a representation of each of said plurality of documents to identify one or more of said documents in said database of relevance to said query text term or to said additional text term; ii) wherein, said representation of each of said plurality of documents in said 5 database comprises an index of document text terms in each said document and each said documents identified as relevant to said text term comprises an intersection between either said query text term or said additional text term and said document text terms; iii) assigning a result weight to said one or more documents identified as being of 10 relevance to either said query text term or to said additional text term; and step d) of said method to be carried out by said computer system comprises: dl) with respect to said result weights assigned to each said document identified as relevant to said query text terms or said additional text term, combining the results of the each said database query to form a result document list comprising a plurality of result documents, each said result 15 document being of relevance to at least one of said query text terms and/or at least one of said additional text terms. 47. A method for indexing a plurality of documents, each document comprising a text portion, the method comprising: a) parsing the text portion of each of the plurality of documents to form a plurality of 20 respective local document indexes each associated with a respective document, and storing the local document index in a database, wherein each local document index comprises a plurality of local text terms contained in the respective document and a local weighting associated with each text term; b) from the plurality of local document indexes, forming a global document index comprising a plurality of global text terms contained in the plurality of documents, and a global weighting 25 associated with each global text term; wherein the global weighting associated with each of the global text terms is determined with respect to a parameter associated with a reference global text term. 48. A method as Itemed in Item 47 wherein the global weighting associated with each of the global text terms is further determined with respect to the number of documents in which each global text 30 term appears across all the plurality of documents. 49. A method as Itemed in Item 47 or Item 48 wherein the global weighting associated with each of the global text terms is determined with respect to the number of documents in which the reference text term appears. 50. A method as Itemed in any one of Items 47 to 49 wherein the global weighting associated with 35 each of the global text terms is further determined with respect to user interactions. 51. A method as Itemed in Item 50 wherein the weighting associated with each text term comprises a combination of a plurality of weightings, each associated with each global text term. 52. A method as Itemed in Item 51 wherein one or the plurality of weightings are selected from the group of: the number of times the term appears in a single document; the number of times the term 40 appears in all the plurality of documents; the position of the text term in a document; the capitalisation of the term; punctuation surrounding the term; words in the text portion adjacent to the term; word rarity; word sequence; combinations of text terms; or the number of words in each text term; or a user-defined weighting.
- 28 53. A method as Itemed in either Item 51 or Item 52 wherein the weighting may be a positive weighting or a negative weighting, or selected from a graduated scale of weightings ranging from positive to negative. 54. A method as Itemed in Item 53 wherein where one or more of the plurality of weightings is a 5 negative weighting for a selected global text term, the selected global text term is assigned a zero weighting. 55. A method as Itemed in Item 52 wherein the user-defined weighting is derived from self learning system comprising a plurality of user-defined weightings for a selected global text term. 10 56. A system for indexing a plurality of documents, each document comprising a text portion, the system comprising: a parsing module for parsing the text portion of each of the plurality of document to form a plurality of respective local document indexes each associated with a respective document, wherein each local document index comprises a plurality of local text terms contained in the respective document and a 15 local weighting associated with each text term; a database adapted for storing each of the local document indexes in a memory; a processor for analysing the plurality of local document indexes and forming a global document index from the plurality of local document indexes, the global document index comprising a plurality of global text terms contained in the plurality of documents, and a global weighting associated 20 with each global text term; wherein the global weighting associated with each of the global text terms is determined with respect to a parameter associated with a reference global text term; and wherein the global document index is stored in the database and related to each of the local document indexes. 57. A method for analysing a text portion and retrieving documents relevant to the text portion, the method comprising: 25 a) receiving an input comprising an input text portion; b) identify at least one text term in the text portion; c) assigning at least one weight associated with the at least one text term; d) forming an input local index of the at least one text term and at least one associated term weight, wherein the at least one associated local term weight is determined with reference to a global term 30 index stored in a database, the global term index comprising a plurality of global text terms and associated global text term weights, and being formed from a plurality of reference documents, wherein a representation of each of the reference documents is stored in the database; e) querying the database to identify one or more of the reference documents of relevance with respect to the input text portion; and 35 f) outputting a representation of the identified relevant reference documents. 58. A method as Itemed in Item 57 wherein the representation of each of the plurality of reference documents stored in the database comprises either the reference document or a link thereto, and the representation further comprises a respective local reference term index for each reference document. 59. A method as Itemed in Item 57 wherein the representation of each of the plurality of reference 40 documents stored in a database comprises a representative text string derived from the text portion of each reference document and a respective local reference term index.
- 29 60. A method as Itemed in any one of Items 57 to 59 wherein a plurality of text terms are identified in the text portion, each of the plurality of text terms being assigned at least one associated local term weight determined with reference to the global term index. 61. A method as Itemed in any one of Items 57 to 60 wherein step (57.b) comprises parsing of the 5 text portion to identify the at least one or plurality of text terms. 62. A method as Itemed in any one of Items 57 to 61 wherein in step (57.e) the relevant reference documents are determined from a comparison of the input local index with each of the plurality of reference local indexes associated with each respective reference document. 63. A method as Itemed in any one of Item 58 to 61 wherein in step (57.e) the relevant reference 10 documents are determined from an intersection of the at least one or plurality of text terms of the input text portion with one or more terms in the local reference term index associated with each reference document. 64. A method as Itemed in either Item 57 or Item 60 wherein the at least one or plurality of text terms comprise single word terms within the input text portion. 15 65. A method as Itemed in any one of Items 57, 60 or 64 wherein the at least one or plurality of text terms comprise double-word terms within the input text portion. 66. A method as Itemed in either Items 57, 60, 64 or 65 wherein the at least one or plurality of text terms comprise triple-word terms within the input text portion. 67. A method as Itemed in any one of Items 57, 60, 64, 65 or 66 wherein the local weights are 20 assigned to each of the at least one or plurality of terms in accordance one or more parameters selected from the group of: word rarity; punctuation; capitalisation; word sequence; combinations of terms; or the number of words in each term. 68. A method as Itemed in Item 57, wherein the representation of the identified relevant reference documents comprises a representative text string derived from the text of each of the identified relevant 25 reference documents. 69. A method as Itemed in Item 68 wherein the representative text string from each document comprises a selected number of text words before and/or after one or more selected relevant text terms with significant weights. 70. A method as Itemed in Item 57 further comprising the steps of: 30 a) displaying the relevant reference documents on a user interface, the user interface comprising input means for receiving user input with respect to each of the displayed reference documents; b) accepting user-input with respect to one or more of the displayed documents; c) re-forming the input local term index on the basis of the user input; 35 d) on the basis of the re-formed input local term index, querying the database to identify one or more relevant reference documents of enhanced relevance to the input text portion; and e) outputting a representation of the further identified reference documents of enhanced relevance. 40 71. A method as Itemed in Item 70 wherein in step (70.c)), re-forming the input local term index comprises: - 30 i. 1) re-assigning the input local text term weights of the input text terms which also appear in each of the reference documents for which user-determined input is received; and wherein step (70.d) comprises: j.1) on the basis of the re-assigned input local text term weights, querying the database to 5 identify one or more relevant reference documents of enhanced relevance to the input text portion. 72. A method as Itemed in Item 70 wherein in step (70.a) the user input means is a means for assigning positive and negative relevance weights with respect to each displayed reference document. 73. A method as Itemed in Item 70 further comprising repeating steps (70.b) to (70.e) thereby identifying and outputting one or more relevant further documents of increased enhanced relevance in 10 respect of the relevant text terms in the text portion. 74. A method as Itemed in Item 70 further comprising repeating steps (70.b) to (70.e) on the reference documents with enhanced relevance to identify and output reference documents with additional enhancement of relevance. 75. A method as Itemed in Item 70 wherein the additional relevance information comprises either 15 a positive indication of relevance of a particular document or a negative indication of relevance of a particular document. 76. A method as Itemed in Item 75 wherein for each reference document for which a positive indication of relevance is received, the associated weighting of each of the input text terms in the input local index which also appear in the local text term index of the positively-identified reference document 20 is increased by a predetermined amount. 77. A method as Itemed in Item 75 wherein for each reference document for which a negative indication of relevance is received, the associated weighting of each of the input text terms in the input local index which also appear in the local text term index of the negatively-identified reference document is decreased by a predetermined amount. 25 78. A method as Itemed in either Item 75 or 76 wherein the predetermined amount may be a multiplier applied to the index term weight. 79. A method as Itemed in Item 78 wherein the multiplier may be zero such that a selected term has no relevance to subsequent interactions. 80. A method as Itemed in Item 75 wherein where a selected text term appears in one or 30 documents which receive a positive indication of relevance, and the selected text term also appears in one or documents which receive a negative indication of relevance, the associated weighting of the selected text term in the input local index is updated based on a combination of the positive and negative indications. 81. A method as Itemed in Item 80 wherein where the selected text term appears in one or 35 documents which receive a positive indication of relevance, and the selected text term also appears in an equal number of documents which receive a negative indication of relevance, the associated weighting of the selected text term in the input local index is unchanged. 82. A method as Itemed in Item 76 wherein in step (70.c)), re-forming the input local term index comprises: 40 i.2) forming an augmented input local term index on the basis of text terms in the local term index of documents receiving a positive indication of relevance; and wherein step (70.d) comprises: -31 j.2) on the basis of the input local text term weights in the augmented input local text term index, querying the database to identify one or more relevant reference documents of enhanced relevance to the input text portion. 83. A method as Itemed in Item 82 wherein for each reference document for which a positive 5 indication of relevance is received, terms in the positively identified reference document which do not appear in the input local term index are added thereto to form the augmented local text term index together with associated local index text term weights determined . 84. A method as Itemed in any one of Items 57 to 83 wherein the text portion is a text string comprising a plurality of text words. 10 85. A method as Itemed in any one of Items 57 to 83 wherein the text portion is a text document. 86. A method as Itemed in any one of Items 57 to 83 wherein each of the reference documents is assigned a document relevance score representative of the relevance to the input text portion. 87. A method as Itemed in any one of Items 57 to 83 wherein the reference documents are text documents representative of a document selected from one or more of the group of: a book; an article; a 15 text record; a certificate; an agreement; a contract; a manuscript; a paper; a scientific paper; a resume; a patent specification; an employment advertisement; a legal transcript; a legal document; or a news report. 88. A method as Itemed in any one of Items 57 to 83 wherein the text portion is selected from one or more of the group of: a text string comprising one or more words; a text document; a book; an article; a text record; a resume; a certificate; an agreement; a contract; a manuscript; a paper; a scientific paper; a 20 patent specification; a resume; a curriculum vitae; a legal transcript; a legal document; or a news report. 89. A method for refining the results of a search, the search results comprising a representation of a selected plurality of reference documents, such reference documents displayed being of relevance to an input text portion comprising one or more search terms, the selected plurality of reference documents comprising a subset of a plurality of documents in a database, the method comprising the steps of: 25 a) forming a local term index from the search terms, the local term index comprising one or more text terms, each local text term associated with a local text term weight; b) receiving and displaying the search results on a user interface, the user interface comprising input means for receiving user input with respect to one or more of the plurality of the displayed reference documents; 30 c) accepting user input on one or more of the displayed reference documents; d) re-forming the local term index on the basis of the user input; e) on the basis of the re-formed input local term index, querying the database to identify one or more documents of enhanced relevance to the input text portion; and f) outputting a representation of the further identified reference documents of enhanced 35 relevance. 90. A method as Itemed in Item 89 wherein in step (89.d), re-forming the input local term index comprises: d. 1) re-assigning the input local text term weights of the input text terms which also appear in each of the reference documents for which user-determined input is received; and 40 wherein step (89.e) comprises: e.1) on the basis of the re-assigned input local text term weights, querying the database to identify one or more relevant reference documents of enhanced relevance to the input text portion.
- 32 91. A method as Itemed in Item 89 wherein in step (89.a) the local text term weights for each of the local text terms are equal. 92. A method as Itemed in Item 89 wherein in step (89.a) the local text term weights for each of the local text terms are derived from a global text term index, the global text term index comprising a 5 plurality of text terms associated with global text term weights, wherein the global text term weights are derived from text term analysis of a plurality of documents. 93. A method as Itemed in Item 89 wherein in step (89.d), re-forming the input local term index comprises: d.2) forming an augmented input local term index on the basis of text terms in the local term index of 10 documents receiving a positive indication of relevance; and wherein step (89.e) comprises: e.2) on the basis of the input local text term weights in the augmented input local text term index, querying the database to identify one or more relevant reference documents of enhanced relevance to the input text portion. 15 94. A method as Itemed in Item 93 wherein for each reference document for which a positive indication of relevance is received, new terms in the positively identified reference document which do not appear in the local term index are added thereto to form the augmented local text term index and associated local index text term weights for the new terms are determined. 95. A method as Itemed in either Item 93 or Item 94 wherein for each reference document for 20 which a negative indication of relevance is received, terms in the negatively identified reference documents which do not appear in the local term index are subtracted therefrom to form the augmented local text term index. 96. A system for refining the results of a search, the search results comprising a representation of a selected plurality of documents of relevance to one or more search terms, the selected plurality of 25 documents comprising a subset of a plurality of documents in a database, the system comprising: a) A processor adapted for forming a local term index from the search terms, the local term index comprising one or more text terms, each local text term associated with a local text term weight; b) Query module adapted for receiving and displaying the search results on a user 30 interface, the user interface comprising input means for receiving user input with respect to each of the displayed reference documents; c) User input interface adapted for accepting user input on one or more of the displayed documents; d) refinement module for analysing the user input re-forming the input local term index 35 on the basis of the user input and, using said query module, querying the database on the basis of the re-formed input local term index to identify one or more documents of enhanced relevance to the input text portion; and e) output interface adapted for outputting a representation of the further identified reference documents of enhanced relevance. 40 97. A system for analysing an input text portion and retrieving documents relevant to the text portion, the system comprising: a) input interface adapted for receiving an input comprising an input text portion; b) parsing module adapted to identify at least one text term in the text portion; - 33 c) assignment module adapted for assigning at least one weight associated with the at least one text term; d) indexing module adapted for forming an input local term index of the at least one text term and at least one associated local term weight, wherein the at least one associated local text 5 term weights is determined with reference to a global term index stored in a database, the global term index comprising a plurality of global text terms and associated global text term weights, and being formed from a plurality of reference documents, wherein a representation of each of the reference documents is stored in the database; e) query module adapted for querying the database to identify one or more relevant reference 10 documents with respect to the input text portion; and f) output interface adapted for outputting a representation of the identified relevant reference documents. 98. A system as Itemed in Item 97 wherein the representation of each of the plurality of reference documents stored in the database comprises either the reference document or a link thereto, and the 15 representation further comprises a respective local reference term index for each reference document. 99. A system as Itemed in Item 97 wherein the representation of each of the plurality of reference documents stored in a database comprises a representative text string derived from the text portion of each reference document and a respective local reference term index. 100. A system as Itemed in any one of Items 97 to 99 wherein a plurality of text terms are 20 identified in the text portion, each of the plurality of text terms being assigned at least one associated local term weight determined with reference to the global term index. 101. A system as Itemed in Item 97 further comprising: a) display interface adapted for displaying the relevant reference documents on a user interface, the user interface comprising input means for receiving user input with respect 25 to each of the displayed reference documents; b) user input interface adapted for accepting user-input on one or more of the displayed documents; c) processor adapted for analysing the user input and re-forming the input local text term index; 30 d) query module adapted for querying the database on the basis of the re-formed input local text term index to identify one or more relevant reference documents of enhanced relevance to the input text portion; and output module adapted for outputting a representation of the further identified reference documents of enhanced relevance to said display interface. 35 102. A method as Itemed in Item 101 wherein the re-forming of the input local term index comprises: re-assigning the input local text term weights of the input text terms which also appear in each of the reference documents for which user-determined input is received; and the querying of the database on the basis of the re-formed input local text term index comprises: on the basis of the re-assigned input local text term weights, querying the database to identify 40 one or more relevant reference documents of enhanced relevance to the input text portion. 103. A method as Itemed in Item 101 wherein the re-forming of the input local term index comprises: forming an augmented input local term index on the basis of text terms in the local term indexes of documents receiving a positive indication of relevance; and - 34 the querying of the database on the basis of the re-formed input local text term index comprises: on the basis of the input local text term weights in the augmented input local text term index, querying the database to identify one or more relevant reference documents of enhanced relevance to the input text portion. 5 104. A computer readable medium comprising a program for analysing a text portion and retrieving documents relevant to the text portion, said program controlling the operation of a data processing apparatus on which the program executes to perform the steps of a) receiving an input comprising an input text portion; b) identify at least one text term in the text portion; 10 c) assigning at least one weight associated with the at least one text term; d) forming an input local index of the at least one text term and at least one associated local term weight, wherein the at least one associated local term weight is determined with reference to a global term index stored in a database, the global term index comprising a plurality of global text terms and associated global text term weights, and being formed from a plurality of reference documents, wherein a 15 representation of each of the reference documents is stored in the database e) querying the database to identify one or more of the reference documents of relevance with respect to the input text portion; and f) outputting a representation of the identified relevant reference documents. 105. A computer readable medium comprising a program according to Item 104, wherein the 20 program executes to perform the further steps of g) displaying the relevant reference documents on a user interface, the user interface comprising input means for receiving user input with respect to each of the displayed reference documents; h) accepting user-input with respect to one or more of the displayed documents; 25 i) re-forming the input local term index on the basis of the user input; j) on the basis of the re-formed input local term index, querying the database to identify one or more relevant reference documents of enhanced relevance to the input text portion; and k) outputting a representation of the further identified reference documents of enhanced relevance. 30 106. A computer readable medium comprising a program for refining the results of a search, the search results comprising a representation of a selected plurality of reference documents, such reference documents displayed being of relevance to an input text portion comprising one or more search terms, the selected plurality of documents comprising a subset of a plurality of documents in a database, said program controlling the operation of a data processing apparatus on which the program executes to 35 perform the steps of a) forming a local term index from the search terms, the local term index comprising one or more text terms, each local text term associated with a local text term weight; b) receiving and displaying the search results on a user interface, the user interface comprising input means for receiving user input with respect to one or more of the plurality of the displayed 40 reference documents; c) accepting user input on one or more of the displayed documents; d) re-forming the input local term index on the basis of the user input; - 35 e) on the basis of the re-formed input local term index, querying the database to identify one or more documents of enhanced relevance to the input text portion; and f) outputting a representation of the further identified reference documents of enhanced relevance. 5 BRIEF DESCRIPTION OF THE DRAWINGS [0136] Arrangements of the invention will now be described, by way of an example only, with reference to the accompanying drawings wherein: [0137] Figure 1 depicts an example prior art (LSA) calculation; 10 [0138] Figure 2 is a schematic representation of a method for conducting a search from an input query according to arrangements of the present invention disclosed herein; [0139] Figure 3 is a schematic representation of a method for conducting an enhanced concept space search from an input query according to arrangements of the present invention disclosed herein; [0140] Figure 4 depicts a schematic of a general purpose computer which may be configured as a 15 special purpose computer adapted to implement the methods for conducting a search from an input query as disclosed herein; DETAILED DESCRIPTION [0141] Aspects and arrangements of the present invention will be described more fully hereinafter with 20 reference to the accompanying drawings, in which example arrangements of the invention are shown. The aspects and/or arrangements of the present invention may, however, be arranged in many different forms and should not be construed as limited to the arrangements set forth herein. Rather, the presently described arrangements are provided to provide a thorough disclosure to convey the scope of the invention envisaged to the skilled addressee in the field of the art. In the description of the figures herein, 25 like numbers refer to like elements and/or features. [0142] The methods and systems disclosed herein provide a new approach to document to document based searching. The search methods solve several key issues surrounding result relevance, database compatibility, speed and efficiency of the search. [0143] The Shotgun search methods and systems disclosed herein as so named because they ar 30 configured to run many searches in parallel and then to stitch the results of these searches together. Historically most search engine queries can be thought of in a Boolean sense, e.g. * An AND search system where all query terms must appear in every result; * An OR search system where any of the query terms must appear in every result; or * A Mixed search system where both AND and OR methods are used in combination. 35 [0144] The Sajari search method and a system utilising the Shotgun method is a Mixed search engine. It runs many searches in parallel and every query's results can be included, so in that sense it is an OR operator, but each query also supports locations, number ranges and meta data, so actually it's more complex than a standard OR query. Just to make it more complicated, each of the queries is weighted, so unlike a normal OR search where multiple result sets are combined with equal weighting, Sajari Shotgun 40 search will weight each search independently. A simple example is as follows: Query: "Investing for retirement years" Q1 = "Investing" - 36 Q2 = "retirement" Q3 = "years" Q4 = "retirement years" Result = Q1*5 + Q2*3 + Q3*0.8 + Q4*4 5 [0145] In the above case, after the stop word "for" was removed, the other terms in the input search query become individual queries. When combining their results, boost weightings (weightings) are then applied (such ratings are algorithmically determined, see the discussion of dynamic weighting below). In this particular example, looking at the weightings, the term "investing" is slightly more important than the term "retirement", but obviously a document which appears in the results of each of the queries' results 10 (i.e. the results for the separate queries using terms "investing" and "retirement") will score higher in the final, combined, result set as it will receive a positively weighted score from both sub queries. This is a simple example without scaling or meta filtering (e.g. based on location or other factors), but it illustrates how each query effectively turns a string into a sub query array, where some queries are more important than others. 15 Dynamically weighting parallel queries [ 0146 ] Dynamic weighting of queries is one of the core concepts of the Sajari search methods and systems that allows adjustment of the context of results on the fly and also to self-learn how to weight future queries. Historically, Google " used to allow the "+" operator to boost the value of a term in a given search, now days even open source technologies such as Lucene now offer the ability to "boost 20 terms" (i.e. to give the terms greater weight in determining the results) in a given search, but this concept differs to Sajari because: [0147] Typically boosting is designed to change the order of results matching a single query, where as Sajari Shotgun is designed to run many queries in parallel and boost their weightings as they are combined. One main difference here is that a non-occurrence of a term in a potential Sajari search result 25 will not prune it from the result set. Boosting is actually applied to many queries as their results are combined, whereas in Lucene and Google's old "+" operator, the boost applies to a sub component of a single query. In Sajari search systems and methods, all queries already have a "boosting" factor inherently, which is determined by algorithms designed to favour terms providing more relevance to results. These factors are always changing as users interact with results. So while existing query 30 refinement methods allow the user to define what to boost, Sajari does this implicitly for the user and also allows for user adjustment. It does this while learning for all users. [0148] Sajari search systems and methods are designed to allow dynamic adjustment of the boosting factors, typically via user feedback. For example, if a user does not like a particular result, then the boosting factor for all the queries including that result item will have their boost weighting reduced. 35 Conversely if the user indicates it is a good result, the boosting weighting for all the queries containing that result will be increased. [0149] The key thing to note here is that Sajari search systems and methods are a parallel query search engine, with dynamic query boosting. Adjustment of the boost weightings can completely change the context of the search results, particularly when the input text is large. 40 Large vs small search queries [0150] Small queries, i.e. having a small number of query search terms e.g. "movie schedule tonight" need a radically different approach to someone with a full document looking for similar documents. Saj ari Shotgun search systems and methods are actually designed to do both, but leans more towards the latter. For small queries, the difficulty is how to order the many results, the search query itself cannot be 45 used to order the results, as all the results include the query, so other techniques such as recency and popularity are then used to differentiate the results. For larger searches like using whole documents as the - 37 input search query, the results differ greatly in their degree of overlap with the input text, so it is possible to differentiate on the degree of overlap and therefore order them by contextual similarity to the input query. Term space vs Concept space 5 [0151] Term space is effectively a reverse index, much like would be included in the back of most books where you can look up a given term and then see which pages contain that particular term. Most search engines work on the same principle except they expand this technique to many documents, web pages, etc. For any given word or phrase, a list of documents with that phrase can be assembled and then ordered (popularity, recency, etc) to produce a set of results. The key thing to note is that when looking at 10 "term space", the exact term exists in all the results. [0152] Concept space is a little different to term space, as input term(s) do not necessarily need to exist in all the results. So, for example, when searching for the term "investment", it may be possible to get an article about "stocks" or "real estate", even though the term "investment" may not have appeared in the article returned in the results. For keyword searches this is interesting, but not always useful, as 15 typically there are so many exact matches that conceptual matches are not required and can also degrade results. But beyond simple keyword searches there are a host of applications where conceptual matching is very important, such as similar document searches clustering, categorisation, etc. These applications typically exhibit greater input data and hence their queries have a higher level of conceptual complexity. It should also be noted that reverse indexes based solely on a term space struggle to produce results as the 20 input text volume becomes significant. See the example below for GoogleTM, where random words were sequentially added to a search query and the query was re-run. The number of results continued to reduce for almost every keyword addition, until when reaching 17 words of query input, there were zero results. Note: GoogleTM is not a pure reverse index "AND" query based engine, it is much, much more complex, but this example does illustrate how the core principles of an "AND" based query engine breakdown as 25 the query length becomes large. This is completely expected given GoogleTM is designed and optimised for searches containing a few keywords, but illustrates how other types of searching can benefit from alternative approaches. The typical concept space 30 [0153] In many cases term-to-term overlap of the input search text and the top documents in the results will be very small, even when searching against millions of documents. In terms of overlap in relation to the Sajari search methods and systems, this relates to direct intersection of the same terms in both documents i.e. the input search query (document) and the result document(s). When this overlapping level of intersection is low, it can be noisy and produce relatively poor results. The common approach to 35 improving this is the same as the approach to reducing the calculation complexity, which is to reduce the term-document space to a concept-document space. [0154] One method for doing this is known as Latent Semantic Analysis (LSA). "Latent" essentially means concealed or not evident, so in this case it refers to the information inferred but does not actually exist in the document. LSA works well because a term-document matrix is sparse (i.e. most documents 40 do not include most terms), so the document-term matrix can be reduced to something much more manageable with significantly fewer dimensions. Effectively a large number of terms are grouped into a smaller number of concepts, which should mostly be able to explain the same document set. Apart from the ease of calculation when querying, this also has the added advantage that concepts are more likely to overlap, so the intersection of concept spaces can be more effective at finding generally similar 45 information, but the downside is that the term level detail is lost. Polysemy (words with multiple meanings) and synonymy (different words with the same meaning) are known problems for LSA. A typical example LSA calculation, which shows how LSA can be used to turn documents into concept vectors for grouping related concepts, is shown in Figure 1.
- 38 The Sajari Shotgun concept space [0155] Sajari retains the entire term-space information and supplements it with concept space information, so it does not suffer the disadvantage of losing the term level information as with LSA, but unlike LSA it adds additional information to help add additional latent information that expands into the 5 concept space. This helps to increase the potential intersection overlap between documents, which improves the quality of the search results. [0156] Obviously LSA had the advantage of reducing the sparseness of the term-document matrix, which reduces storage and calculation complexity. Sajari does this as well, but in a slightly different way - the term space is no longer stored as a matrix, but instead as a reverse-index column-based data store for 10 each term. This means that each term takes up very little space (e.g. computer memory) and yet no information is lost. [0157] The creation of the concept space works slightly similar to LSA in that it looks for terms with similar document distributions, the higher the correlation between two terms in all the documents they are in, the more likely they are to be related conceptually. The key difference between LSA and Sajari search 15 methods is that Sajari does not combine terms and reduce them to the more compact "concept", instead the terms with the highest concept correlation are stored with each respective term. The level of correlation is also stored, so some correlations are stronger than others. They may or may not be reversible. For instance the term "instrument" may appear in many documents where the word "trombone" appears, but "trombone" may appear in a very small number of documents where the term 20 "instrument" appears. In this sense the conceptual connection strength is similar to Bayes' Theorem (http://en.wikipedia.org/wiki/Bayes'_theorem). The probability of "trombone" being relevant to the input query "instrument" is used to determine if "trombone" should be added to the query and if so, how much should its results be boosted by appropriately setting its sub query weighting. In practice we typically use a variation on this formula that is reversible, mainly because otherwise we would see that every term is 25 conceptually correlated to popular terms like "the". Using a reversible formula popular terms are quickly excluded with very low correlation scores. The alternative is to instead use a threshold score to omit terms that are too popular from the calculation. In that case a variant of a bidirectional probability calculation like Bayes' Theorem can work well. It is also possible to add conceptual terms manually, which can be used for synonyms, negative concepts and other corner cases. 30 [0158] Once concept terms are stored with each term, searches will include expansion into their concept space. The actual search itself involves turning the input text into a list of terms. These input text terms then form the basis for queries to be run in parallel. For each of these parallel queries, the conceptual terms for each term are added as additional sub-queries to those being run in parallel. This effectively means a single term can be translated into multiple terms for parallel search. This process can 35 also vary in size (the number of conceptual terms added per term). It can also occur multiple times, so for example a Shotgun coefficient of 5 would add up to 5 additional terms for each term (if they exist with a strong enough concept coefficient). If a coefficient of 5 and a depth of 2 was used, then the number of terms added per term is up to 52 = 25. This means we can effectively turn just a few keywords into a small document. In practice the usage of concept expansion can be varied depending on the volume of 40 text input, the initial result intersection, the initial term score, etc. [0159] The calculation would be done as follows: the strength of the relationship can be expressed as a number between 0 and 1, where 0 is no correlation and 1 is perfectly correlated. It is roughly calculated by: [Samb] = [Na 4 b] / [max(Na, Nb)], 45 [0160 ] where: Sa4b is the relationship strength between term "a" and "b"; Na is the total number of documents that the term "a" appears in; - 39 Nb is the total number of documents that the term "b" appears in; and Nalb is the total number of documents that both "a" and "b" appear in. [0161] Various modifications of the above formula are used in the calculations to handle edge cases and produce a better overall measure of correlation. 5 [0162 ] The strength of the relationship is the above calculation is bi-directional, so Salb = Sba. The rationale behind the bi-direction can be explained with an example. Consider the three terms "the", "car", and "automotive". The word "the" is very likely to be in many of the same documents as "car" and automotive". However it is very unlikely that this is reciprocated. The word "car" and "automotive" on the other hand are far more likely to both share a high percentage of each others' documents. The 10 calculation can also be done uni-directionally, which can be more appropriate if Na >> Nb or vice versa. EXAMPLE [0163] A particularly useful example of the Sajari Shotgun search method can be seen by working through the example mentioned above where the user enters an input search query comprising only the word "jaguar": Using a shotgun coefficient of 4 and a depth of 1, "jaguar" becomes an array of input 15 terms with their associated conceptual relation coefficient, CR: {"jaguar": 1, "car": 0.5, "software": "0.1", "cat": 0.4, "jacksonville": 0.2} [0164] This means that "car" is the most conceptually correlated concept term to the term "jaguar" with a score of 0.5; "jaguar" is indeed a type of car, but it's also an apple operating system, a big cat and the Jacksonville football team. Note: the single word "jaguar" does not have enough context to determine 20 what the user wants. [0165] When the above search runs, documents with only the term "car" in them will be worth half the score of those with only the term "jaguar". Those with both will be worth 1 + 0.5 = 1.5 times the term score associated with the term "jaguar". In this case the results are likely to be similar to the single search "jaguar", but the additional information helps to sort the results in the order of those most likely to be 25 conceptually close, in this case that is those with the word "car". Obviously these dynamics become much more complicated when multiple terms or even a full document is used as the input search query instead. [ 0166 ] Another unique aspect of the Sajari Shotgun search method is the use of negative concept space, which can help to reduce the inclusion of results with incorrect context. Negative shotgun terms 30 have a negative conceptual relation coefficient CR, which means results with their inclusion will actually have their result score reduced. These typically work well with double and triple terms, so going back to the "jaguar" example, we will expand on a similar example where the user is searching for 'jacksonville jaguar". In this case we have 3 sub-queries (Q1, Q2 & Q3) in our search, where: Q1 = 'jacksonville" 35 Q2 = "jaguar" Q3 = 'jacksonville jaguar" [0167] Each subquery would have an associated Shotgun array for the concept space, but we will look in particular at the double term sub query concept space for Q3 'jacksonville jaguar": {"jacksonville jaguar": 1, "football": 0.7, "nfl": 0.4, "schedule": 0.4, "car": -0.5, "cat": -0.3} 40 [0168] In the above example there are some positive shotgun terms and some negative ones. The negative terms are essentially identifying terms which will likely cause an incorrect concept space association. In this case, "car" and "cat", having negative weightings, are identified to be incorrectly associated with Q3. So essentially, these negative coefficients help to dampen or even fully cancel out the inclusion of other shotgun terms and associated results. One added advantage for Sajari is that - 40 because users are constantly indicating whether they like or dislike search results as part of the refinement process, when a sub-query like Q3 above is the combination of other sub queries, in this case Q I and Q2, negative and positive interactions can be used to detect both correct and incorrect concept space associations for double word terms. For the above example, previously "car" and "cat" had both 5 consistently contributed to the score for results which were marked as not so good when "jacksonville jaguar" was in the query (Q3) and both terms had been added as concepts positively related to Q1. Although this seems trivial for a simple query such as this, with larger queries like documents, the ability to infer and auto correct context via concept space is extremely strong. [0169] An obvious question when looking at the above examples is to ask why you would want to turn 10 a few words into many words, as this seemingly complicates the calculation and adds little value to the results. Although it does increase the calculation processing cost, the added information actually allows even a single keyword search to extend into the concept space to best order the matching results (statistically the highest chance of being related conceptually). More importantly, when used with Sajari search methods, it also gives much more additional information to allow the user to refine their results to 15 get even better results as there are more inherent connections between the query and the results. Shotgun term addition and dynamic term boosting combined [0170] So term boosting is the adjustment of sub-query importance weighting and Shotgun term addition is the expansion of the query terms into the concept space. Together these become a powerful way to bend the context of a set of search results to whatever the user is looking for. For the above query 20 example of "jaguar", in traditional search there is nothing to dynamically boost by adjusting weightings, as there is only one term to adjust! With LSA methods, a single keyword only translates to a concept, again there is nothing to boost and also there is no guarantee the top result will even include the term "jaguar" as it is now only part of a concept. With Shotgun, the results will be focussed on both the term "jaguar" and its surrounding concepts - the concept space (and term space for multiple term input queries) 25 can then be adjusted via user feedback, which is used to dynamically boost the user's desired context of the entered text. [ 0171 ] To expand on the previous example of a single search term "jaguar". In that case the top result has the context of "jaguar" being related to a car (i.e. it includes both "jaguar" and "car" terms), but if the user was to indicate this is not what they are looking for (via a click or some other interaction), then 30 dynamic boosting would recognise the terms "car" and "jaguar" are both intersecting (between the query and negative result) and the interaction is negative, hence both would be reduced in value (for this example we will reduce by half). Because shotgun terms are relative to their parent, reducing "jaguar" has no effect as it reduces all the terms relatively, but reducing "car" has a much different effect. In this case the new query array (normalised) becomes: 35 {"jaguar": 1, "car": 0.25, "software": "0.1", "cat": 0.4, "Jacksonville": 0.2} [0172] The concept space has moved away from the concept ofj aguar referring to a "car" and closer to jaguar referring to a "cat" and this is now what the user sees in their top results. This process can be repeated and each time the results adjust the concept space accordingly. The interactions can also be positive, in which case the intersecting concept spaces are boosted positively. Obviously with multiple 40 input terms, this affect can become much more complicated, but the benefits remain. [0173] Overall Shotgun is a unique and interesting way to extend search from the term space into a combined term and concept space hybrid. Combined with Sajari technology on search refinement (boosting) as disclosed in US 2012-0278341 Al (the contents of which are incorporated wholly herein be cross-reference), this is a powerful way to search large volumes of information by arranging results in 45 order of the user's desired context. [0174] Figure 2 depicts a basic arrangement of the Shotgun search method 200 as disclosed herein for searching a plurality of documents and retrieving one or more documents relevant to a search query 210.
- 41 Referring to the Figures, the method 200 comprises accepting an input comprising an input search query 210 comprising at least one or a plurality of query text terms 201 to 205. The identified query text terms 201 to 205 may each comprise one or more than one word or term of the input query, for example a query term 201 to 205 may comprise two, three four our more words. Each multiple word term identified in the 5 input query is treated in the same manner as each identified single word term. The query text terms 201 to 205 are identified and, for each identified query text term 201 to 205 in said input search query210 the method comprises querying 220 a document database 221 comprising a representation of each of the plurality of documents 215 to be searched to identify one or more of said documents 215 in said database 220 of relevance to said query text term 201 to 205. The representation of each of the plurality of 10 documents 215 in the database 220 comprises an index of document text terms (i.e. a listing of each of the text terms in each of the documents 215). Each of documents that are identified as relevant to the particular associated text term 201 to 205 respectively, necessarily comprises an intersection between the query text term and the document text terms. The result of the querying 220 of the database 221 is a plurality of term result lists 201a 202a, 203a, 204a and 205a. Each of the term results lists 201a 202a, i 203a, 204a and 205a comprise a listing of one or more documents 215 in the database 220 which are deemed by the search to be of relevance to the corresponding query text terms 201 to 205 respectively. [0175] Each of the term results lists 201a, 202a, 203a, 204a and 205a further comprise a result weight 201b, 202b, 203b, 204b and 205b respectively relating the documents identified in each result list to the strength of the association to its associated query text term. With respect to said result weight assigned to 20 each said document identified as relevant to said query text terms, the plurality of term result lists 201a 202a, 203a, 204a and 205a are combined 230 the results of the each said database query to form a final result document list 240 comprising a listing of a plurality of result documents, each said result document being of relevance to at least one of said query text terms and the listing of result documents being weighted having regard to the result weightings for each term result list 201a, 202a, 203a, 204a and 25 205a. Finally, a representation of each of said result documents in result document list 240 is output (not shown) to the user via a visual display device (typically a computer monitor). Method 200 may further comprise an optional refinement loop 250 whereby the visual display apparatus is adapted to receive feedback 260 from the user on the relevance of each of the documents in result list 240. For example the visual display apparatus may comprise checkboxes 245 where the user can provide either a positive 30 indication of relevance (by checking 'tick' checkbox associated with the result document) or a negative indication of relevance (by checking 'cross' checkbox associated with the result document). The positive and negative indications of relevance indicate respectively that the document is or is not relevant to the context of interest of said user. The refinement loop 250 may be implemented in real-time., for example each time the user makes a positive or negative indication of relevance of any one of the result 35 documents, the method may, in real time, update the search query based on the user feedback and display a refined listing of result documents. Alternatively, the method, and systems for enacting the method may be configured to accept user relevance input on a plurality of the result documents and the refinement step may be initiated by the user once they have completed giving feedback on a plurality of result documents. In this manner it will be appreciated that the user does not need to give feedback on every one of the 40 result documents before applying the refinement loop 250, refinement may be based on feedback on only a single or a small number of the documents in result list 240. [0176] Based on the user input on the relevance of one or more of the documents in the result document list 240 the method may refine 270 the search query terms determining further text terms. For example, terms that appear in documents receiving a positive indication of relevance may be added to the 45 search query terms and/or terms that appear in documents receiving a negative indication of relevance may be subtracted from search query terms. Once the refined 'augmented' search query terms have been determined, the method may comprise querying 280 the document database 221 to identify one or more documents 215 in the document database 220 of relevance to said terms of the augmented search query. Again multiple result lists are generated and combined to form a refined result list which is output to a - 42 visual display apparatus for display to the user. It will be appreciated that refinement loop 250 may be repeated many times. [0177] Referring now to Figure 3 there is depicted an enhanced concept space search method 300 (where like elements and steps with respect to method 200 of Figure 2 are represented with like 5 numerals). In method 300, prior to querying document database for documents of relevance to each of the query terms 201 to 205 in input query 210, for each query term identified in the search query: the method comprises querying 310 a term database 311 to identify one or more additional text terms of (contextual) relevance to the particular identified query term, where such additional terms do not appear in the original search query. From the identified additional text terms of relevance, the method comprises 10 forming a plurality of augmented term sets 301, 302, 303, 304 and 305, each augmented term set comprising at least one query text term 201 to 205 as identified in the search query 210 and one or more of said additional text terms of relevance identified in and retrieved from term database 311. Then, for each of the plurality of augmented term sets 301 to 305, the method comprises querying document database 221 identify one or more of said documents in said database of relevance to said query text term 15 or to said additional text term in each said augmented term set 301 to 305. The result of the querying 320 of the database 221 is a plurality of augmented term result lists 301a, 302a, 303a, 304a and 305a. Again, result weights 301b, 302b, 303b, 304b and 305b are assigned to the augmented term result lists and with respect to the result weights assigned to each augmented term result lists identified as relevant to the augmented term sets, the plurality of term result lists 301a, 302a, 303a, 304a and 305a are combined 330 20 the results of the each said database query to form a final result document list 340 comprising a listing of a plurality of result documents, each said result document being of relevance to at least one of said query text terms and the listing of result documents being weighted having regard to the result weightings for each term result list 201a, 202a, 203a, 204a and 205a. Finally a representation of each of said result documents in result document list 340 is output (not shown) to the user via a visual display device 25 (typically a computer monitor). Further enhancement of the result list 340 may be achieved through enactment of refinement loop 250 similarly as discussed above in relation to method 200 in Figure 2. [0178] The methods of searching a plurality of documents and retrieving one or more documents relevant to a search query (and associated sub methods (e.g. methods 200 and 300 depicted in Figures 2 and 3) may be implemented using a computing device / computer system 400, such as that shown in 30 Figure 4 wherein the processes of Figures 2 and 3 may be implemented as software, such as one or more application programs executable within the computing device 400. In particular, the steps of methods 200 and 300 are effected by instructions in the software that are carried out within the computer system 400. The instructions may be formed as one or more code modules, each for performing one or more particular tasks. The software may also be divided into two separate parts, in which a first part and the 35 corresponding code modules performs the described methods and a second part and the corresponding code modules manage a user interface between the first part and the user. The software may be stored in a computer readable medium, including the storage devices described below, for example. The software is loaded into the computer system 400 from the computer readable medium, and then executed by the computer system 400. A computer readable medium having such software or computer program recorded 40 on it is a computer program product. The use of the computer program product in the computer system 400 preferably effects an advantageous apparatus for searching a plurality of documents and retrieving one or more documents relevant to a search query. 0179] With reference to Figure 4, an exemplary computing device 400 is illustrated. The exemplary computing device 400 can include, but is not limited to, one or more central processing units (CPUs) 401 45 comprising one or more processors 402, a system memory 403, and a system bus 404 that couples various system components including the system memory 403 to the processing unit 401. The system bus 404 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures.
- 43 [0180] The computing device 400 also typically includes computer readable media, which can include any available media that can be accessed by computing device 400 and includes both volatile and non volatile media and removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer 5 storage media includes media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store 10 the desired information and which can be accessed by the computing device 400. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and 15 other wireless media. Combinations of the any of the above should also be included within the scope of computer readable media. [0181] The system memory 403 includes computer storage media in the form of volatile and/or non volatile memory such as read only memory (ROM) 405 and random access memory (RAM) 406. A basic input/output system 407 (BIOS), containing the basic routines that help to transfer information between 20 elements within computing device0400, such as during start-up, is typically stored in ROM 405. RAM 406- typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 401. By way of example, and not limitation, Figure 4 illustrates an operating system 408, other program modules 409, and program data 410. [0182] The computing device 400 may also include other removable/non-removable, volatile/non 25 volatile computer storage media. By way of example only, Figure 4 illustrates a hard disk drive 411 that reads from or writes to non-removable, non-volatile magnetic media. Other removable/non-removable, volatile/non-volatile computer storage media that can be used with the exemplary computing device include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like. The hard disk drive 411 is typically 30 connected to the system bus 404 through a non-removable memory interface such as interface 412. [0183] The drives and their associated computer storage media discussed above and illustrated in Figure 4, provide storage of computer readable instructions, data structures, program modules and other data for the computing device 400. In Figure 4, for example, hard disk drive 411 is illustrated as storing an operating system 413, other program modules 414, and program data 415. Note that these components 35 can either be the same as or different from operating system 408, other program modules 409 and program data 410. Operating system 413, other program modules 414 and program data 415 are given different numbers hereto illustrate that, at a minimum, they are different copies. [0184] The computing device also includes one or more input/output (11O) interfaces 430 connected to the system bus 404 including an audio-video interface that couples to output devices including one or 40 more of a video display 434 and loudspeakers 435. Input/output interface(s) 430 also couple(s) to one or more input devices including, for example a mouse 431, keyboard 432 or touch sensitive device 433 such as for example a smart phone or tablet device. [0185] Of relevance to the descriptions below, the computing device 400 may operate in a networked environment using logical connections to one or more remote computers. For simplicity of illustration, 45 the computing device 400 is shown in Figure 4 to be connected to a network 420 that is not limited to any particular network or networking protocols, but which may include, for example Ethernet, Bluetooth or IEEE 802.X wireless protocols. The logical connection depicted in Figure 4 is a general network connection 421 that can be a local area network (LAN), a wide area network (WAN) or other network, for example, the internet. The computing device 400 is connected to the general network connection 421 - 44 through a network interface or adapter 422 which is, in turn, connected to the system bus 404. In a networked environment, program modules depicted relative to the computing device 400, or portions or peripherals thereof, may be stored in the memory of one or more other computing devices that are communicatively coupled to the computing device 400 through the general network connection 421. It 5 will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between computing devices may be used. [0186] It will be appreciated that the methods/apparatus/devices/systems described/illustrated herein at least substantially provide methods and systems for searching a plurality of documents and retrieving one or more documents relevant to a search query. 10 [0187] The methods and systems for searching a plurality of documents and retrieving one or more documents relevant to a search query as described herein, and/or shown in the drawings, are presented by way of example only and are not limiting as to the scope of the invention. Unless otherwise specifically stated, individual aspects and components of the methods and systems may be modified, or may have been substituted therefore known equivalents, or as yet unknown substitutes such as may be developed in 15 the future or such as may be found to be acceptable substitutes in the future. The methods and systems may also be modified for a variety of applications while remaining within the scope and spirit of the claimed invention, since the range of potential applications is great, and since it is intended that the presently disclosed methods and systems be adaptable to many such variations.

Claims (5)

1. A method for searching a plurality of documents and retrieving one or more documents relevant to a search query, said search query comprising one or of a plurality of text terms, said method comprising: 5 a) accepting an input comprising an input search query comprising at least one or a plurality of query text terms; b) identifying said one or plurality of query text terms in said input search query; c) for each identified query text term in said input search query: i) querying a database comprising a representation of each of said plurality of 10 documents to identify one or more of said documents in said database of relevance to said query text term; ii) wherein, said representation of each of said plurality of documents in said database comprises an index of document text terms in each said document and each said documents identified as relevant to said text term comprises an intersection between 15 said query text term and said document text terms; iii) assigning a result weight to said one or more documents identified as being of relevance to said query text term; d) with respect to said result weight assigned to each said document identified as relevant to said query text terms, combining the results of the each said database query to form a result document list 20 comprising a plurality of result documents, each said result document being of relevance to at least one of said query text terms; and e) outputting a representation of each of said result documents.
2. A method as claimed in claim 1 wherein in step e), the outputting of a representation of each of said result documents comprises the steps of 25 e 1) ordering each said result document on the basis of the relevance to each said query text term and said result weight assigned to said identified documents; and e2) outputting an representation of each of said result documents as an ordered result list of said plurality of result documents.
3. A method for refining the results of a search, the search results comprising a representation of 30 a selected plurality of reference documents, such reference documents displayed being of relevance to an input text portion comprising one or more search terms, the selected plurality of reference documents comprising a subset of a plurality of documents in a database, the method comprising the steps of: a) forming a local term index from the search terms, the local term index comprising one or more text terms, each local text term associated with a local text term weight; 35 b) receiving and displaying the search results on a user interface, the user interface comprising input means for receiving user input with respect to one or more of the plurality of the displayed reference documents; c) accepting user input on one or more of the displayed reference documents; d) re-forming the local term index on the basis of the user input; 40 e) on the basis of the re-formed input local term index, querying the database to identify one or more documents of enhanced relevance to the input text portion; and - 46 f) outputting a representation of the further identified reference documents of enhanced relevance.
4. A system for refining the results of a search, the search results comprising a representation of a selected plurality of documents of relevance to one or more search terms, the selected plurality of 5 documents comprising a subset of a plurality of documents in a database, the system comprising: A processor adapted for forming a local term index from the search terms, the local term index comprising one or more text terms, each local text term associated with a local text term weight; Query module adapted for receiving and displaying the search results on a user interface, the user interface comprising input means for receiving user input with respect to each of the displayed 10 reference documents; User input interface adapted for accepting user input on one or more of the displayed documents; refinement module for analysing the user input re-forming the input local term index on the basis of the user input and, using said query module, querying the database on the basis of the re-formed 15 input local term index to identify one or more documents of enhanced relevance to the input text portion; and output interface adapted for outputting a representation of the further identified reference documents of enhanced relevance. 20
5. A system for analysing an input text portion and retrieving documents relevant to the text portion, the system comprising: input interface adapted for receiving an input comprising an input text portion; parsing module adapted to identify at least one text term in the text portion; assignment module adapted for assigning at least one weight associated with the at least one text 25 term; indexing module adapted for forming an input local term index of the at least one text term and at least one associated local term weight, wherein the at least one associated local text term weights is determined with reference to a global term index stored in a database, the global term index comprising a plurality of global text terms and associated global text term weights, and being formed from a plurality 30 of reference documents, wherein a representation of each of the reference documents is stored in the database; query module adapted for querying the database to identify one or more relevant reference documents with respect to the input text portion; and output interface adapted for outputting a representation of the identified relevant 35 reference documents, wherein the outputting of a representation of each of said result documents comprises the steps of e 1) ordering each said result document on the basis of the relevance to each said query text term and said result weight assigned to said identified documents; and e2) outputting an representation of each of said result documents as an ordered result list 40 of said plurality of result documents.
AU2014100238A 2013-03-15 2014-03-14 Search methods and systems Ceased AU2014100238A4 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201361786784P 2013-03-15 2013-03-15
US61/786,784 2013-03-15

Publications (1)

Publication Number Publication Date
AU2014100238A4 true AU2014100238A4 (en) 2014-04-17

Family

ID=50479219

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2014100238A Ceased AU2014100238A4 (en) 2013-03-15 2014-03-14 Search methods and systems

Country Status (1)

Country Link
AU (1) AU2014100238A4 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110858210A (en) * 2018-08-17 2020-03-03 阿里巴巴集团控股有限公司 Data query method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110858210A (en) * 2018-08-17 2020-03-03 阿里巴巴集团控股有限公司 Data query method and device
CN110858210B (en) * 2018-08-17 2023-11-21 阿里巴巴集团控股有限公司 Data query method and device

Similar Documents

Publication Publication Date Title
Esteva et al. COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization
Hu et al. Improved lexically constrained decoding for translation and monolingual rewriting
JP7282940B2 (en) System and method for contextual retrieval of electronic records
Singh et al. Relevance feedback-based query expansion model using ranks combining and Word2Vec approach
EP2482204B1 (en) System and method for information retrieval from object collections with complex interrelationships
Jones et al. Generating query substitutions
Sawant et al. Neural architecture for question answering using a knowledge graph and web corpus
US8666994B2 (en) Document analysis and association system and method
US8903825B2 (en) Semiotic indexing of digital resources
Sarkar et al. A new approach to keyphrase extraction using neural networks
US20070208726A1 (en) Enhancing search results using ontologies
Wang et al. Targeted disambiguation of ad-hoc, homogeneous sets of named entities
US20160005196A1 (en) Constructing a graph that facilitates provision of exploratory suggestions
JP5616444B2 (en) Method and system for document indexing and data querying
Trillo et al. Using semantic techniques to access web data
Khoo et al. Augmenting Dublin core digital library metadata with Dewey decimal classification
US20230111911A1 (en) Generation and use of content briefs for network content authoring
Lynn et al. An improved method of automatic text summarization for web contents using lexical chain with semantic-related terms
Alaofi et al. Where Do Queries Come From?
AlJadda et al. Crowdsourced query augmentation through semantic discovery of domain-specific jargon
Mandreoli et al. Dealing with data heterogeneity in a data fusion perspective: models, methodologies, and algorithms
Nashipudimath et al. An efficient integration and indexing method based on feature patterns and semantic analysis for big data
Collarana et al. A question answering system on regulatory documents
Gupta et al. Frequent item-set mining and clustering based ranked biomedical text summarization
Haque et al. Approaches and trends of automatic bangla text summarization: challenges and opportunities

Legal Events

Date Code Title Description
FGI Letters patent sealed or granted (innovation patent)
MK22 Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry