WO2011072172A1 - Système et procédé permettant de déterminer rapidement un sous-ensemble de données non pertinentes à partir d'un vaste contenu de données - Google Patents

Système et procédé permettant de déterminer rapidement un sous-ensemble de données non pertinentes à partir d'un vaste contenu de données Download PDF

Info

Publication number
WO2011072172A1
WO2011072172A1 PCT/US2010/059775 US2010059775W WO2011072172A1 WO 2011072172 A1 WO2011072172 A1 WO 2011072172A1 US 2010059775 W US2010059775 W US 2010059775W WO 2011072172 A1 WO2011072172 A1 WO 2011072172A1
Authority
WO
WIPO (PCT)
Prior art keywords
documents
data
search
subset
relevant
Prior art date
Application number
PCT/US2010/059775
Other languages
English (en)
Inventor
Andrew Kraftsow
Mary K. O'brien
Original Assignee
Renew Data Corp.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Renew Data Corp. filed Critical Renew Data Corp.
Publication of WO2011072172A1 publication Critical patent/WO2011072172A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion

Definitions

  • Provisional Patent Application No. 61/285,168 filed on December 9, 2009 and entitled "System And Method For Quickly Determining A Subset Of Irrelevant Data From Large Data Content", the contents of which are incorporated herein by reference and are relied upon here.
  • the Provisional Patent Application No. 61/285,168 describes a system and method that operates independently or in conjunction with systems and methods described in related applications set forth below, the contents of each of which were incorporated by reference in the provisional application and therefore, constitutes a part of the technical description in the specification.
  • the present invention relates to systems and methods involving techniques for review and analysis of content data (in paper or electronic form) such as a collection of documents.
  • paper form must be converted and represented in electronic form (e.g., by well-known optical character recognition (OCR) techniques for capturing paper and portable document format (PDF created by Adobe Systems) form that is searchable).
  • OCR optical character recognition
  • PDF created by Adobe Systems portable document format
  • the present invention relates to a system and method for searching indexed content data to quickly and efficiently locate subsets of data that are either relevant or irrelevant to an issue of interest to a user.
  • this application relates to a system and method for quickly determining a subset of data that is irrelevant to an issue of interest in order to isolate only data that is of possible interest to the issue of interest.
  • the present system and method also relates utilizing advanced organizing, searching, tagging, and highlighting techniques for identifying and isolating the irrelevant data and/ or relevant data with a high degree of confidence 1 or certainty from large quantities of content data.
  • search engine technology is used to make the document review process more manageable.
  • quality and completeness of search results resulting from such conventional search engine techniques are often indefinite and therefore, unreliable. For example, one does not know whether the search engine used has indeed found every relevant document, at least not with any certainty.
  • the main search engine technique currently used is a keyword or a free-text search coupled with indexing of terms in the documents.
  • a user enters a search query consisting of one or more words or phrases and the search system uncovers all of the documents that have been indexed as having one or more those words or phrases in the search query. As the search system indexes more documents that contain the specified search terms, they are revealed to the user.
  • such a search technique only marginally reduces the number of documents to be reviewed, and the large quantities of documents returned cannot be usefully examined by the user. There is absolutely no guarantee that the desired information is contained in any of the documents that are uncovered.
  • search queries are typically developed with the object of finding every relevant document regardless of the specific nomenclature used in the document. This makes it necessary to develop lists of synonyms and phrases that encompass every imaginable word usage combination. In practice, the total number of documents retrieved by these queries is very large.
  • the present invention relates to a system and method for searching indexed content data to quickly and efficiently locate subsets of data that are either relevant or irrelevant to an issue of interest to a user.
  • the present system and method quickly and efficiently determines a "subset" of data that is irrelevant to an issue of interest in order to isolate only data that is of "possible” interest to the issue of interest.
  • This approach and technology significantly reduces the manpower and resources necessary to isolate any data that is relevant, by quickly eliminating data of little or no interest and spending any effort only on data that may be possibly relevant.
  • the system and methods of the present invention perform an advanced search of vast amounts of content data, believed to be a relatively low amount of relative data (less than 50%) based on query terms, in order to retrieve a subset of responsive content data that is irrelevant. Documents are searched to show absolute irrelevance with respect to the query terms.
  • the system and method considers all content data in a thread of correspondence (for example, an e-mail) and includes it in the subset of irrelevant data that is isolated.
  • the system also scans the content data in the thread and automatically identifies other data of interest, for example, contained in attachments and includes that data as well in the responsive set.
  • the system has an architecture, which incorporates an automatic query-builder.
  • human operators simply highlight the parts of the content data or document that seem pertinent to an issue(s) and the intelligent capabilities of the system, such as the software components of the system architecture, automatically formulate precise Boolean queries utilizing the highlighted parts of the text and then utilize those precise Boolean queries to search for irrelevant data.
  • the highlighted text that is identified need not be contiguous.
  • the system architecture utilizes functionality to construct the query. To construct the query, the system runs the highlighted text through a part-of-speech tagger module, which eliminates various parts of speech and eliminates stop-words.
  • the system has a capability, which executes some rules about the operator "within” and then builds the query.
  • the automatic query builder hardware and software of the system architecture also permits expert users to make some "AND” or “OR” decisions about non-contiguous highlights, for example, by holding down the CONTROL key on the computer keyboard, while executing the highlighting function.
  • This automatic query builder module significantly reduces the need for human operators.
  • users read the document, highlighting whatever language strings relate to the issues that they seek to address. The user associates each highlighted text to an issue (or multiple issues).
  • the automated query builder forms the queries, runs them in the background, and bulk tags the search result documents.
  • the system also displays a sample of randomly selected results so that the user can test the statistical certainty that the query was precise.
  • the system takes the input query, whether generated through an automated or manual means and generates every possible synonym of the query terms and generates synonym rings for each term and its synonyms.
  • the system then performs an AND Boolean operation taking every possible combination of the synonym rings and generates a query from the combinations.
  • a document is then determined to be irrelevant when it is not responsive to any of the queries that are posed.
  • the remaining set of "possibly" relevant data is much smaller than the original, entire set of data, as a result of which the remaining set can be more easily and efficiently searched for relevant data.
  • the system and method described here are used for isolating relevant data from a subset of data. It should be understood that any of the functionalities described here for isolating relevant data from a given set, can also be used to isolate irrelevant data.
  • the present invention also relates to a system and method for utilizing advanced searching, tagging, and highlighting techniques for identifying and isolating irrelevant or relevant data with a high degree of certainty from large quantities of content data (in paper or electronic form).
  • system and methods of the present invention can perform an advanced search of vast amounts of content data based on query terms, in order to retrieve a subset of responsive content data.
  • system and methods of the present invention can perform an advanced search of vast amounts of content data based on query terms, in order to retrieve a subset of responsive content data.
  • a probability of relevancy or degree of certainty is determined for a unit of content data or document in the returned subset, and the content data or document is removed from the subset if it does not reach a threshold probability of relevancy.
  • a statistical technique can be applied to determine whether remaining documents (that is, not in the responsive documents subset) in the collection meet a predetermined acceptance level.
  • the system considers all content data in a thread of correspondence (for example, an e-mail) and includes it in the subset of relevant data, when the user desires to find relevant data from a subset of documents.
  • the system also scans the content data in the thread and automatically identifies other data of interest, for example, contained in attachments and includes that as well.
  • the system assures greater efficiency, by taking the following steps: (a) randomly selecting a predetermined number of documents from remaining content data; (b) reviewing the randomly selected documents to determine whether the randomly selected documents include additional relevant documents; (c) if additional relevant documents are retrieved, identifying one or more specific terms in the additional content data that renders the data relevant and expanding the query terms with those specific terms, and running the search again with the expanded query terms.
  • a feedback loop criteria ensures that content data that is relevant with a high degree of certainty and probability is shown early on to human reviewers.
  • content data that is isolated and queued up for consideration is usually ordered by custodian and chronology. Even if some other method is used, the order generally remains fixed throughout the isolating process.
  • the system and methods here use a heuristic algorithm for selecting the next content data unit or document that takes into account the disposition of the content data or documents previously seen by the reviewers. The algorithm operates in both an inclusive and an exclusive direction. Content data and documents are excluded from the isolating process if they contain any previously seen relevant language strings.
  • the database must be continuously updated during the isolating process to reflect the strings that human reviewers may discover.
  • the system described here permits modification of search routines based on human input of attributes contained in content data found to be relevant. Hence, content data in a queue for consideration may be moved up. For example, attributes such as author, date, subject (if email), size, document type and social network may be used. [0025]
  • the system instead of finding all content data relevant to an issue and with a high degree of certainty, the system can search and isolate certain key content data of particular interest (e.g. "privileged" or "hot” documents).
  • Poisson distribution criteria demands that the relevance of object A has no impact on the relevance of object B.
  • the system To isolate "hot" data content, the system considers not only the text but also the author and recipient of the text. Therefore, the system searches for privileged or "hot" documents. The system has to remove duplicate documents at a different level and then has to recalculate the formulas based on the expected density of the subject matter that is being search to determine sample size. To isolate select privileged data, the system uses precise and rigorous string identifications such as the topic in conjunction with noun, verb, or object sets.
  • the system incorporates an automatic query-builder.
  • human operators simply highlight the parts of the content data or document that seem relevant to an issue(s) and the software components of the system automatically formulate precise boolean queries utilizing the highlighted parts of the text.
  • the highlighted text need not be contiguous.
  • the system runs the highlighted text through a part-of-speech tagger, which eliminates various parts of speech and eliminates stop- words.
  • the system executes some rules about the operator "within” and then builds the query.
  • the automatic query builder aspect of the system also permits expert users to make some "AND” or "OR” decisions about non-contiguous highlights by holding down the CONTROL key while executing the highlighting function. This automatic query builder
  • the Poisson distribution is a discrete probability distribution that expresses the probability of a number of events occurring in a fixed period of time if these events occur with a known average rate and independently of the time since the last event. significantly reduces the need for human operators.
  • users read the document, highlighting whatever language strings relate to the issues that they seek to address. The user associates each highlighted text to an issue (or multiple issues).
  • the automated query builder forms the queries, runs them in the background and bulk tags the search result documents.
  • the system also displays a sample of randomly selected results so that the user can test the statistical certainty that the query was precise.
  • FIG. 1 is a block diagram of a computer system or information terminal on which programs operate to implement the methods of the inventions described here.
  • FIG. 2 is a flow chart showing the Information Flow of the Overall system and method
  • FIG. 3 is a flow chart showing the setup procedure for the system and method described here.
  • FIG. 4 is a flow chart showing the searching and query generation procedure.
  • Fig. 5 is a block diagram of a computer system or information terminal on which programs can run to implement certain methods of the inventions described here.
  • Fig. 6 is a flow chart of an exemplary method of reviewing either vast collections of content data or data that separated from data determined to be irrelevant in order to identify a subset of relevant content data.
  • Fig. 7 is a flow chart of an exemplary method for reviewing either vast collections of content data or data separated from data determined to be irrelevant from which to further identify relevant content data.
  • Fig. 8 is a flow chart of a method for reviewing a collection of content data or documents (either an entire collection or a subset separated from data determined to be irrelevant), in order to identify relevant documents from the collection, according to another exemplary embodiment.
  • Fig. 9 is a flow chart of a method for reviewing a collection of content data or documents to identify relevant documents from the collection (subset of entire data or data separate from irrelevant data that is determined), according to another exemplary
  • Fig. 10 is a flow chart of a method for reviewing a collection of content data or documents (either entire set of data or a subset of data isolated from data determined to be irrelevant data) to identify relevant documents from the collection, according to another exemplary embodiment.
  • Fig s - HA and 11B represent a flow chart for a workflow of a process including application of some of the techniques discussed here.
  • Fig. 12 is a flow chart of an automated query builder feature of the present system and method.
  • Fig. 13 is a flow chart of an example illustrating a database containing emails, attachments, and stand alone files from a corporate network, all which constitute the content data for review.
  • Fig. 14 is a flow chart of an exemplary embodiment of a "smart highlighter" feature of the present system and method.
  • the computer system referenced as "Anagram” has an architecture with functionalities that are configured to identify documents that are not relevant to a "query” posed, which is assembled from a "logical expression of the issue,” in order to quickly identify a data set of documents with little or no relevance to the query.
  • this invention will be described in the general context of computer-executable instructions, such as program modules within a system architecture comprising hardware and software.
  • program modules include routines, programs, objects, scripts, components, data structures, etc., that performs particular tasks or implement particular abstract data types.
  • program modules may be located in both local and remote memory storage devices.
  • the present invention may also be practiced in and/ or with personal computers (PCs), hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like.
  • FIG. 1 is a schematic diagram of an exemplary computing environment in which the present invention may be implemented.
  • the present invention may be
  • One or more computer programs may be included in the implementation of the system and method described in this application.
  • the computer programs may be stored in a machine-readable program storage device or medium and/ or transmitted via a computer network or other transmission medium.
  • Computer 10 includes CPU 11, program and data storage 12, hard disk (and controller) 13, removable media drive (and controller) 14, network communications controller 15 (for communications through a wired or wireless network (LAN or WAN, see 15A and 15B), display (and controller) 16 and I/O controller 17, all of which are connected through system bus 19.
  • a hard disk e.g. a removable magnetic disk or a removable optical disk
  • other types of computer readable media which can store data that is accessible by a computer, such as magnetic cassettes, flash memory cards, digital video disks, Bernoulli cartridges, random access memories (Rams), read only memories (ROMs), and the like, may also be used in the exemplary operating environment.
  • a number of program modules may be stored on the hard disk 13, magnetic disk, and optical disk, ROM or RAM, including an operating system, one or more application programs, other program modules, and program data.
  • a user may enter commands and information into the computing system 10 through input devices such as a keyboard (shown at 19), mouse (shown 19) and pointing devices.
  • Other input devices may include a microphone, joystick, game pad, satellite dish, scanner, or the like.
  • serial port interface that is coupled to the system bus, but may be connected by other interfaces, such as a parallel port, game port or a universal serial bus (USB).
  • a monitor 21 or other type of display device is also connected to the system bus via an interface, such as a video adapter.
  • computers typically include other peripheral output devices (not shown), such as speakers and printers.
  • the program modules may be practiced using any computer languages including C, C++, assembly language, and the like.
  • CPU 11 indexes all data using a Boolean engine and extracts vocabulary words, which are all of the words contained in the data, and tracks the location and frequency of the word. This indexing operation is performed by an indexing module. All dictionary words are then processed through a different flow from the non-dictionary words. Dictionary words are words that are known in advance, words that are likely found in a dictionary, or by some other means, where words are defined and classified. A dictionary may be a specialized dictionary, such as the IEEE dictionary or any other such technical dictionary. The lemma of each dictionary word is identified. For example, the words “running,” “runs,” and “ran” would all be identified as "run” and then all dictionary words are hierarchically organized by the part of speech to which they belong.
  • All non-dictionary words are verified with a spell checker and all words that are identified by the system to be possible proper spellings for the non-dictionary words are linked with the improper spellings.
  • the system can take the improperly spelled words into consideration as well.
  • This operation performed on the dictionary words by the system is then performed on those dictionary words that have been spell-checked.
  • the system sends the words to a manual operator interface by which an operator can properly classify the words.
  • These words for example, are trade terms or trade names that do not appear in a dictionary or are words that are improperly spelled to such a degree that the spell checker cannot select a proper spelling. If they are words not recognized by a dictionary, the words are simply hierarchically organized by the parts of speech that they belong to.
  • the next step of the process is the actual identification process (performed by an identification module), which begins with the building of various queries from an input Logical Expression of Issue (LEI).
  • the system identifies all synonyms of each of the LEI's concepts and creates and builds synonym rings from each, which represent every synonym for each concept of the LEI.
  • the system then creates every possible Boolean combination of the synonym rings for each query term preserving concept proximity expressions in the LEI.
  • the system takes an input LEI A && W/ 5 B && P/ 1 C, where A has 2 Synonyms Al and A2, B has 1 Synonym, and Bl and C have no Synonyms, and then generates the following additional queries: A && W/5 Bl && P/ 1 C, Al && W/5 B && P/l C, Al && W/5 Bl && P/l C, A2 && W/5 B && P/l C and A2 && W/ 5 Bl && P/ 1 C.
  • W/ 5 means that the proceeding word is within five words of the preceding word
  • P/l means that the preceding word is within one paragraph from the preceding word.
  • the entire index is then searched and the system tags all of the results with their appropriate issue code, which defines why they are results and optionally highlights the query terms contained in them.
  • the system tags all of the non-responsive documents as irrelevant.
  • the system by this operation identifies every item that could be matched to the query terms, regardless of the possible relevance of the various synonyms to provide a high level of confidence in the non-responsive documents being irrelevant to the initial query. There would then be left from the results, items that might be irrelevant based on the definition of the synonyms or items that do not reach some adequate level of relevancy, but do not have a near zero probability of relevancy as the non-responsive set have after the search is performed.
  • FIG. 2 illustrates the general flow of data through the systems and methods as recited in this specification.
  • Block 200 illustrates the step of acquiring the data set, which may be any collection of data. It may be a "Mail" database file or it may simply be a large collection of documents and files.
  • the system checks to see if the data set acquired is already indexed. If the data set is already indexed, the system advances to step 230, at which point, the system takes the necessary steps to index the data as illustrated at step 220. After step 230, the system advances to step 240, where the system takes an input of a "Logical Expression Unit" and creates synonym rings, which are used to generate queries.
  • the system searches the set using the generated queries to identify irrelevant data based on the queries, the search returns a responsive set of data items.
  • the system proceeds to step 270, where a data set of irrelevant data is created from the data items that were not responsive to the search in step 250.
  • a data set of possibly relevant data items is created from the responsive set of data items returned by step 250, which is then fed into a search engine to find the relevant data at step 290.
  • FIG. 3 shows the general flow for the setup of the input acquired data set.
  • the system indexes and extracts all vocabulary words from within the acquired data set. This step includes tracking the frequency and location of all of the vocabulary words.
  • the system reviews the index created in step 310.
  • the system encodes the words in every single data item.
  • the system examines the structure of the encoded words to identify vocabulary words.
  • the system extracts all vocabulary words.
  • the system uses the communications between social networks and attempts to identify whether or not words are being used outside of established statistical norms.
  • words that are identified to have occurrence frequencies beyond statistical norms are sent to a point in the operations, illustrated at step 335 where the word locations are identified.
  • the system (at step 340) identifies the words against a dictionary or words.
  • the dictionary may be a standard dictionary, or any other source of words associated with definitions and characterizations.
  • the system takes the words that are identified as dictionary words and identifies the Lemma of the words (at step 355). For example, the system modifies "running" to "run,” and then organizes the words hierarchically by the part of speech that the word belongs to (illustrated at step 390).
  • the system forwards the words to a point in the systems' operations, illustrated at step 360, where the system performs a spell check on the words. If no possible spellings for the words are found, the system enables review of the words by a human operator (at block 375) so the human operator can allow the words. These words are most likely to be trade terms or names and as such would not be in a dictionary. They, however, may also be misspelled words, in which case the human operator can correct these miss spellings. If the words were not dictionary words, but rather trade terms, the words are forwarded to the next step in the operations, from block 377 to block 390.
  • step 380 the system forwards the words to step 380 of the operations, where the proper spellings are linked to the improper spelling. This allows for the operation when the system searches for the proper spelling of the misspelled words. The search automatically references the misspelled occurrences. If at step 370, there were possible proper spellings, the words are passed on to step 380 to associate all of the possible proper spellings with their misspellings. After step 380, the system continues onto step 355. After step 390, the setup operation ends at step 395.
  • FIG. 4 is the basic flow for the query generation and the searching of the indexed acquired data set.
  • a LEI is input into the system at step 410.
  • the system takes the LIE and breaks it up into its individual concepts. Each concept is then used to create a synonym ring, which contains the concept and all of its possible synonyms.
  • the system performs a Boolean operation (e.g. AND) and adds a single concept from each synonym ring together and maintains all proximity characters between concepts in the LEI, such as maintaining (e.g. WITHIN) character ranges, paragraph ranges, etc. This creates every possible combination of the terms in the synonym rings.
  • a Boolean operation e.g. AND
  • the system would generate 125 queries.
  • the system places all of the Queries into an ordered list. This could also be a linked list or any means of organizing all of the queries in some logical order and sets the first query as the current query.
  • the system searches the data set using the current query.
  • all of the results that are responsive to the query are tagged with their appropriate issue code and are optionally highlighted. These items can also be removed as being possibly relevant items, however this yields mixed results, as it would speed up the remaining queries, but it would also leave items not necessarily entirely tagged or highlighted based on what could have been highlighted had subsequent queries been run on them.
  • the current query is checked to see if it is also the last query. If the query is not the last query, the system advances the query list and makes the subsequent query the current query and returns to step 450. If at step 470, the query is the last query, the system advances to step 480. At step 480, the system tags all of the items that were not returned by any of the queries as being responsive as being irrelevant items. These items are then segregated from the possibly relevant items and the system operations end at step 490.
  • the systems and methods used involve techniques for organization, review and analysis of content data (in paper or electronic form), such as a collection of documents.
  • the systems and methods described here utilize advanced searching, tagging, and highlighting techniques for identifying and isolating relevant content data with a high degree of confidence 3 or certainty from large quantities of content data. It should be understood that any of the operations described below can also be used to first isolate the irrelevant data.
  • the system search techniques used here search the content data based on language "strings.”
  • the system uses Poisson-based mathematics to predict how much content data or how many documents would need to be reviewed before finding every relevant language string in the collection of content data. This is based on the principle that relevant language strings are distributed in content data in accordance with the theory of
  • the number of relevant strings in a given amount of content data or document is a function of the number of issues addressed, not a function of the size of the content data.
  • the number of relevant language strings does not exceed 50 per issue regardless of the size of the collection of content data. Because the system uses Poisson-based mathematics, the system retrieves content data with relevant language strings quickly and efficiently, thereby saving unnecessary review of irrelevant data by skilled humans. Review of irrelevant data without use of this system was inevitable because the data presented was organized by custodian and chronology.
  • the system and techniques here additionally use Poisson-based statistical sampling to prove that isolation of relevant content data is accomplished with a stated degree of certainty. In other words, that all content data with relevant language strings is retrieved.
  • the system uses a defined set of rules and a Boolean search engine to find every occurrence of relevant language strings.
  • the system marks the relevant documents in a manner that is auditable. This way of tagging yields two benefits- 1) a user knows exactly why each document was tagged as relevant; and 2) a user can "undo" the tagging if a language string is re-classified as non-relevant at a later date.
  • documents are delivered to an assembly line of skilled humans to review documents in batches (the most common situation). Identifying relevant language strings in prior batches significantly decreases the time to review documents in future batches.
  • system architecture is described the general context of computer-executable instructions, such as program modules.
  • program modules include routines, programs, objects, scripts, components, data structures, etc. that performs particular tasks or implement particular abstract data types.
  • program modules include routines, programs, objects, scripts, components, data structures, etc. that performs particular tasks or implement particular abstract data types.
  • program modules include routines, programs, objects, scripts, components, data structures, etc. that performs particular tasks or implement particular abstract data types.
  • the invention may be practiced with any number of computer system configurations including, but not limited to, distributed computing environments where tasks are performed by remote processing devices that are linked through a
  • program modules may be located in both local and remote memory storage devices.
  • the present invention may also be practiced in personal computers (PCs), hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like.
  • FIG. 5 is a schematic diagram of an exemplary computing environment in which the present system is implemented and operated.
  • the present system architecture may be implemented within a general purpose computing device 10A in the form of a
  • One or more computer programs may be included in the implementation of the system and method described in this application.
  • the computer programs may be stored in a machine-readable program storage device or medium and/ or transmitted via a computer network or other transmission medium.
  • Computer 10A includes CPU 11 A, program and data storage 12A, hard disk (and controller) 13 A, removable media drive (and controller) 14A, network communications controller 15A (for communications through a wired or wireless network (LAN or WAN, see 15AA and 15BA), display (and controller) 16A and 1/ O controller 17A, all of which are connected through system bus 19A.
  • a hard disk e.g.
  • a number of program modules may be stored on the hard disk 13, magnetic disk, and optical disk, ROM or RAM, including an operating system, one or more application programs, other program modules, and program data.
  • a user may enter commands and information into the computing system 10A through input devices such as a keyboard (shown at 19A), mouse (shown 19A) and pointing devices.
  • Other input devices may include a microphone, joystick, game pad, satellite dish, scanner, or the like.
  • serial port interface that is coupled to the system bus, but may be connected by other interfaces, such as a parallel port, game port or a universal serial bus (USB).
  • a monitor 21A or other type of display device is also connected to the system bus via an interface, such as a video adapter.
  • computers typically include other peripheral output devices (not shown), such as speakers and printers.
  • the program modules may be practiced using any computer languages including C, C++, assembly language, and the like.
  • a method for reviewing a content data or a vast collection of documents to identify relevant documents from the collection can entail a) running a search of the collection of documents based on a plurality of query terms and b) retrieving a subset of responsive documents from the collection (step S21A), 3) determining a corresponding probability of relevancy for each document in the responsive documents subset (step S23A) and 4) removing from the responsive documents subset, documents that do not reach a threshold probability of relevancy (step S25A).
  • the search techniques discussed in this disclosure are preferably automated as much as possible. Therefore, the search is preferably applied through a search engine.
  • the search can include a concept search, and the concept search is applied through a concept search engine.
  • Such searches and other automated steps or actions can be coordinated through appropriate programming, as would be appreciated by one skilled in the art.
  • the probability of relevancy of a document can be scaled according to a measure of obscurity of the search terms found in the document.
  • the method can further comprise a) randomly selecting a predetermined amount of content data or a sample number of documents from the remaining content data found to be not relevant, and b) determining whether the randomly selected documents include additional relevant documents, and in addition, optionally, identifying one or more specific terms in the additional relevant documents that render the documents relevant, expanding the query terms with the specific terms, and re-running at least the search with the expanded query terms.
  • the randomly selected content data or documents include one or more additional relevant items of content data
  • the query terms can be expanded and the search run again with the expanded query terms.
  • the method additionally comprises comparing a ratio of the additional relevant documents and the randomly selected documents to a predetermined acceptance level, to determine whether to apply a refined set of query terms.
  • the method further comprises the step of selecting two or more search terms, identifying synonyms of the search terms, and forming the query terms based on the search terms and synonyms.
  • the method further comprises the step of identifying a correspondence between a sender and a recipient, in the responsive documents subset, automatically determining one or more additional documents which are in a thread of the correspondence, the additional documents not being in the responsive documents subset, and adding the additional documents to the responsive documents subset.
  • the term "correspondence” is used herein to refer to a written or electronic communication (for example, letter, memo, e- mail, text message, etc.) between a sender and a recipient, and optionally with copies going to one or more copy recipients.
  • the method further comprises the step of determining whether any of the documents in the responsive documents subset includes an attachment that is not in the responsive documents subset, and adding the attachment to the responsive documents subset.
  • the method further comprises the step of applying a statistical technique (for example, zero- defect testing) to determine whether remaining documents not in the responsive documents set meet a predetermined acceptance level.
  • the search includes (a) a Boolean search of the collection of documents based on the plurality of query terms, the Boolean search returning a first subset of responsive documents from the collection, and (b) a second search by applying a recall query based on the plurality of query terms to remaining ones of the collection of documents which were not returned by the Boolean search, the second search returning a second subset of responsive documents in the collection, and wherein the responsive documents subset is constituted by the first and second subsets.
  • the first Boolean search may apply a measurable precision query based on the plurality of query terms.
  • the method can optionally further include automatically tagging each document in the first subset with a precision tag, reviewing the document bearing the precision tag to determine whether the document is properly tagged with the precision tag, and determining whether to narrow the precision query and rerun the Boolean search with the narrowed query terms.
  • the method can optionally further comprise automatically tagging each document in the second subset with a recall tag, reviewing the document bearing the recall tag to determine whether the document is properly tagged with the recall tag, and determining whether to narrow the recall query and rerun the second search with the narrowed query terms.
  • the method can optionally further include reviewing the first and second subsets to determine whether to modify the query terms and rerun the Boolean search and second search with modified query terms.
  • a method for reviewing a collection of documents to identify relevant documents from the collection includes running a search of the collection of documents, based on a plurality of query terms, the search returning a subset of responsive documents in the collection (step S31A), automatically identifying a correspondence between a sender and a recipient, in the responsive documents subset (step S33A), automatically determining one or more additional documents which are in a thread of the correspondence, the additional documents not being in the responsive documents subset (step S35A), and adding the additional documents to the responsive documents subset (step S37A).
  • the method can further comprise determining for each document in the responsive documents subset, a corresponding probability of relevancy, and removing from the responsive documents subset documents that do not reach a threshold probability of relevancy.
  • the probability of relevancy of a document can be scaled according to a measure of obscurity of the search terms found in the document.
  • the system and method further comprises applying a statistical technique to determine whether a remaining subset of the collection of documents not in the responsive documents subset meets a predetermined acceptance level.
  • the method additionally comprises the steps of a) randomly selecting a predetermined number of documents from a remainder of the collection of documents not in the responsive documents subset, b) determining whether the randomly selected documents include additional relevant documents, c) identifying one or more specific terms in the additional relevant documents that render the documents relevant, d) expanding the query terms with the specific terms, and e) running the search again with the expanded query terms.
  • the method further includes the steps of a) randomly selecting a
  • predetermined number of content data or documents from a remainder of the collection of documents not in the responsive documents subset b) determining whether the randomly selected documents include additional relevant documents, c) comparing a ratio of the additional relevant documents and the randomly selected documents to a predetermined acceptance level, and expanding the query terms and d) running the search with the expanded query terms, if the ratio does not meet the predetermined acceptance level.
  • the method further comprises the step of selecting two or more search terms, identifying synonyms of the search terms, and forming the query terms based on the search terms and synonyms.
  • the method additionally includes the step of determining whether any of the responsive content data or documents in the responsive documents subset includes an attachment that is not in the subset, and adding the attachment to the subset.
  • a method for reviewing a collection of documents to identify relevant documents from the collection can comprise running a search of the collection of documents, based on a plurality of query terms, the search returning a subset of responsive documents in the collection (step S41A), automatically determining whether any of the responsive documents in the responsive documents subset includes an attachment that is not in the subset (step S43A), and adding the attachment to the responsive documents subset (step S45A).
  • the method further comprises determining for each document in the responsive documents subset, a corresponding probability of relevancy, and removing from the responsive documents subset documents that do not reach a threshold probability of relevancy.
  • the probability of relevancy of a document is preferably scaled according to a measure of obscurity of the search terms found in the document.
  • the method additionally comprises applying a statistical technique to determine whether a remaining subset of the collection of documents not in the responsive documents subset meets a predetermined acceptance level.
  • the method further includes randomly selecting a predetermined number of documents from a remainder of the collection of documents not in the responsive documents subset, determining whether the randomly selected documents include additional relevant documents, identifying one or more specific terms in the additional responsive documents that render the documents relevant, expanding the query terms with the specific terms, running the search again with the expanded query terms.
  • the method further includes selecting two or more search terms, identifying synonyms of the search terms, and forming the query terms based on the search terms and synonyms.
  • the method further comprises identifying a correspondence between a sender and a recipient, in the responsive documents subset, automatically determining one or more additional documents which are in a thread of the correspondence, the additional documents not being in the responsive documents subset, and adding the additional documents to the responsive documents subset.
  • a method for reviewing a collection of documents to identify relevant documents from the collection comprises running a search of the collection of documents, based on a plurality of query terms, the search returning a subset of responsive documents from the collection (step S51A), randomly selecting a predetermined number of documents from a remainder of the collection of documents not in the responsive documents subset (step S52A), determining whether the randomly selected documents include additional relevant documents (step S53A), identifying one or more specific terms in the additional responsive documents that render the documents relevant (step S54A), expanding the query terms with the specific terms (step S55A), and re-running the search with the expanded query terms (step S56A).
  • a method for reviewing a collection of documents to identify relevant documents from the collection can comprise specifying a set of tagging rules to extend query results to include attachments and email threads (step S61A), expanding search query terms based on synonyms (step S62A), running a precision Boolean search of the collection of documents, based on two or more search terms and returning a first subset of potentially relevant documents in the collection (step S63A), calculating the probability that the results of each Boolean query are relevant by multiplying the probability of relevancy of each search term, where those individual probabilities are determined using an algorithm constructed from the proportion of relevant synonyms for each search term (step S64A), applying a recall query based on the two or more search terms to run a second concept search of remaining ones of the collection of documents which were not returned by the first Boolean search, the second search returning a second subset of potentially relevant documents in the collection (step S65A), calculating the probability that each search result in the recall query is relevant to a
  • step S67A randomly selecting a predetermined number of documents from the remaining subset of the collection and determining whether the randomly selected documents include additional relevant documents (step S68A), if additional relevant documents are found (step S69A, yes), identifying the specific language that causes relevancy, and expanding that language into a set of queries (step S70A), constructing and running precision Boolean queries of the entire document collection above (step S71A).
  • a social networking approach can be taken to measure obscurity.
  • the following method is consistent with the procedure generally used in the legal field currently for constructing query lists: (i) a list of potential query terms (keywords) is developed by the attorney team; (ii) for each word, a corresponding list of synonyms is created using a thesaurus; (iii) social network is drawn (using software) between all synonyms and keywords;
  • an obscurity factor is determined as the ratio between the number of ties at any word node and the greatest number of ties at any word node, or alternatively their respective z scores; and (vi) this obscurity factor is applied to the definitional probability calculated above.
  • the simplest complex queries consist of query terms separated by the Boolean operators AND and/ or OR.
  • queries separated by an AND operator the individual probabilities of each word in the query are multiplied together to yield the probability that the complex query will return responsive results.
  • query terms separated by an OR operator the probability of the query yielding relevant results is equal to the probability of the lowest ranked search term in the query string.
  • Query words strung together within quotation marks are typically treated as a single phrase in Boolean engines (i.e. they are treated as if the string is one word).
  • a document is returned as a result if and only if the entire phrase exists within the document.
  • the phrase is translated to its closest synonym and the probability of that word is assigned to the phrase.
  • a phrase generally has a defined part of speech (noun, verb, adjective, etc.)
  • when calculating probability one considers only the total number of possible definitions for that part of speech, thereby reducing the denominator of the equation and increasing the probability of a responsive result.
  • a and B are query terms and X is the number of words in separating them in a document which is usually a small number.
  • the purpose of this type of query, called a proximity query is to define the terms in relation to one another. This increases the probability that the words will be used responsively. The probability that a proximity query will return responsive documents equals the probability of the highest query term in the query will be responsive.
  • Fig. 12 is a flow chart of the automated query builder feature of the present system and method.
  • This aspect includes operations whereby content data or documents are loaded into a database, illustrated by block 80A.
  • the content data or documents may be displayed on the user's screen (shown at 82A).
  • the user may use a computer mouse or other method to highlight the relevant text in the content data or document, as illustrated by reference numeral 84A.
  • the highlighted text is forwarded to the automatic query builder routine in the system (see block 86A).
  • the automatic query builder routine tallies the words between the highlighted terms. The system ensures that the highlighting is contiguous (see 90A).
  • the system connects all contiguous and noncontiguous highlights within a connector using the previously tallied word counts (see block 92A). If it is not, the system replaces the within connector for the next segment with an AND connector (see 94A). Following these operations, the user designates that the highlighting is complete (see 96). The highlighted section is passed to the automatic query builder, at 98A.
  • the automatic query builder identifies sequential nouns and designated phrases. These are treated as a single word for the purpose of the word count tally (indicated by reference numeral 100A). Following this operation, the text is run through the case phrase analyzer, where known phrases are identified and appropriately designated (see 102A). The language is run through the idiom checker (see 104A) where idioms are identified and excluded from the query construction process. After this operation, the text is run through a parts-of-speech tagger routine (106A). This routine identifies parts of speech and
  • the text is run through the system query builder rules (shown at 108A) and a query is constructed (see step 110A). Once a query is constructed, the system submits the query to the Boolean search engine at 112A.
  • Fig. 13 illustrates the way related content data is identified and ultimately tagged.
  • the system considers all content data in a thread of correspondence (for example, an e-mail) and includes it in the subset of relevant data.
  • the system also scans the content data in the thread and automatically identifies other data of interest, for example, contained in attachments and includes that as well.
  • Fig. 14 illustrates a flow chart representing the steps used in a "smart highlighter" routine of the system.
  • This routine is launched (106A) allowing the user to select either a query tool (see 108A) or a bookmark tool (see 11 OA).
  • a query tool see 108A
  • a bookmark tool see 11 OA
  • the user can use it to highlight any text of interest (see 112A).
  • the highlighted text is run through an automated query builder (see 114A) and the resulting query is submitted to the Boolean-based search engine (116A).
  • the user highlights any text of interest with the bookmark tool (see 118A).
  • the system takes the highlighted text and stores it on the user's computer machine in a database file (see 120A).
  • the system stores the document name, document URL, any notes added by the user, folder names (tags) added by the user.
  • the system indexes the highlighted text (124 A), the user notes (126A) and saves updates to the index file (130A).
  • the user may navigate the database via a user interface (132A) as the system allows a word search of the highlighted text, user notes, URL or folder name etc. (134A).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un système et un procédé permettant de rechercher rapidement tous les éléments de données non pertinents dans un ensemble de données d'après une expression logique d'une question d'intérêt pour un utilisateur, qui décrit ce que recherche l'utilisateur. En effectuant cette recherche, la taille du jeu de données qui doit être recherché pour trouver des documents pertinents est considérablement réduite. Le système peut servir également à identifier des termes qui sont utilisés de façon éventuellement incohérente avec leurs moyens normalement acceptés de façon à pouvoir découvrir des mots codés. De plus, le système est capable de rechercher des termes commerciaux qui ne sont pas contenus dans des dictionnaires. Dans certains modes de réalisation, le système et le procédé de l'invention peuvent utiliser des techniques de recherche automatisée avancée, qui comprennent la capacité de mise en évidence pour déterminer des sous-ensembles de données de contenu pertinentes ou non pertinentes (sous forme papier ou électronique). Ces techniques sont avantageuses pour examiner de vastes collections de données de contenu ou de documents et identifier finalement des données pertinentes ou des documents pertinents à partir des collections. Les techniques de recherche avancée s'exécutent selon des termes de recherche, qui isolent des données de contenu pertinentes ou non pertinentes qui répondent aux termes de recherche. Si la recherche concerne des données pertinentes, une probabilité de pertinence peut être déterminée pour une unité de données de contenus ou un document dans le sous-ensemble retourné pour faciliter l'exclusion d'un document du sous-ensemble s'il n'atteint pas une probabilité seuil de pertinence. Des documents dans un fil d'une correspondance (par exemple, un courrier électronique) dans le sous-ensemble de documents de réponse peuvent être ajoutés au sous-ensemble de documents de réponse. De plus, une pièce jointe à un document dans le sous-ensemble de documents de réponse peut être ajoutée au sous-ensemble de documents de réponse. Une technique statistique est appliquée pour déterminer si des documents restants dans la collection satisfont un niveau d'acceptation prédéterminé.
PCT/US2010/059775 2009-12-09 2010-12-09 Système et procédé permettant de déterminer rapidement un sous-ensemble de données non pertinentes à partir d'un vaste contenu de données WO2011072172A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US28516809P 2009-12-09 2009-12-09
US61/285,168 2009-12-09

Publications (1)

Publication Number Publication Date
WO2011072172A1 true WO2011072172A1 (fr) 2011-06-16

Family

ID=43629215

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2010/059775 WO2011072172A1 (fr) 2009-12-09 2010-12-09 Système et procédé permettant de déterminer rapidement un sous-ensemble de données non pertinentes à partir d'un vaste contenu de données

Country Status (2)

Country Link
US (1) US20110145269A1 (fr)
WO (1) WO2011072172A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11971937B2 (en) 2011-06-17 2024-04-30 Robert Osann, Jr. Internet search results annotation, filtering, and advertising with respect to search term elements

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5542398B2 (ja) * 2009-09-30 2014-07-09 株式会社日立製作所 障害の根本原因解析結果表示方法、装置、及びシステム
JP5552448B2 (ja) * 2011-01-28 2014-07-16 株式会社日立製作所 検索式生成装置、検索システム、検索式生成方法
CN102323942B (zh) * 2011-09-01 2013-04-10 北京中创信测科技股份有限公司 一种统计查询方法
US9785628B2 (en) 2011-09-29 2017-10-10 Microsoft Technology Licensing, Llc System, method and computer-readable storage device for providing cloud-based shared vocabulary/typing history for efficient social communication
US8881007B2 (en) * 2011-10-17 2014-11-04 Xerox Corporation Method and system for visual cues to facilitate navigation through an ordered set of documents
US9405822B2 (en) * 2013-06-06 2016-08-02 Sheer Data, LLC Queries of a topic-based-source-specific search system
US10497042B2 (en) 2016-08-29 2019-12-03 BloomReach, Inc. Search ranking
US11341414B2 (en) * 2018-10-15 2022-05-24 Sas Institute Inc. Intelligent data curation
US11170017B2 (en) 2019-02-22 2021-11-09 Robert Michael DESSAU Method of facilitating queries of a topic-based-source-specific search system using entity mention filters and search tools
WO2021076862A1 (fr) * 2019-10-18 2021-04-22 Ul Llc Technologies de création dynamique de représentations pour des règlements

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US2571508A (en) 1948-09-01 1951-10-16 Rosenbaum Q B Kl Parking Co Machine for tier parking motor vehicles
US20070013483A1 (en) 2005-07-15 2007-01-18 Allflex U.S.A. Inc. Passive dynamic antenna tuning circuit for a radio frequency identification reader
US20070288445A1 (en) * 2006-06-07 2007-12-13 Digital Mandate Llc Methods for enhancing efficiency and cost effectiveness of first pass review of documents
US20080189273A1 (en) * 2006-06-07 2008-08-07 Digital Mandate, Llc System and method for utilizing advanced search and highlighting techniques for isolating subsets of relevant content data
EP1967972A1 (fr) * 2007-03-07 2008-09-10 The Boeing Company Procédés et systèmes pour rétroaction pertinente de recherche non obstructive
US20090032990A1 (en) 2005-07-04 2009-02-05 Maillefer S.A. Extrusion Method and Extrusion Apparatus

Family Cites Families (101)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5107419A (en) * 1987-12-23 1992-04-21 International Business Machines Corporation Method of assigning retention and deletion criteria to electronic documents stored in an interactive information handling system
US5278980A (en) * 1991-08-16 1994-01-11 Xerox Corporation Iterative technique for phrase query formation and an information retrieval system employing same
US5535381A (en) * 1993-07-22 1996-07-09 Data General Corporation Apparatus and method for copying and restoring disk files
US5617566A (en) * 1993-12-10 1997-04-01 Cheyenne Advanced Technology Ltd. File portion logging and arching by means of an auxilary database
JP3377290B2 (ja) * 1994-04-27 2003-02-17 シャープ株式会社 イディオム処理機能を持つ機械翻訳装置
US5535121A (en) * 1994-06-01 1996-07-09 Mitsubishi Electric Research Laboratories, Inc. System for correcting auxiliary verb sequences
US5717913A (en) * 1995-01-03 1998-02-10 University Of Central Florida Method for detecting and extracting text data using database schemas
US7069451B1 (en) * 1995-02-13 2006-06-27 Intertrust Technologies Corp. Systems and methods for secure transaction management and electronic rights protection
US5742807A (en) * 1995-05-31 1998-04-21 Xerox Corporation Indexing system using one-way hash for document service
US5778395A (en) * 1995-10-23 1998-07-07 Stac, Inc. System for backing up files from disk volumes on multiple nodes of a computer network
US5732265A (en) * 1995-11-02 1998-03-24 Microsoft Corporation Storage optimizing encoder and method
US5926811A (en) * 1996-03-15 1999-07-20 Lexis-Nexis Statistical thesaurus, method of forming same, and use thereof in query expansion in automated text searching
US20030093790A1 (en) * 2000-03-28 2003-05-15 Logan James D. Audio and video program recording, editing and playback systems using metadata
US6182029B1 (en) * 1996-10-28 2001-01-30 The Trustees Of Columbia University In The City Of New York System and method for language extraction and encoding utilizing the parsing of text data in accordance with domain parameters
EP0972254A1 (fr) * 1997-04-01 2000-01-19 Yeong Kuang Oon Procede de traitement de texte didactique et oriente contenu comportant un systeme de croyances modifie de fa on incrementielle
US6023710A (en) * 1997-12-23 2000-02-08 Microsoft Corporation System and method for long-term administration of archival storage
US6047294A (en) * 1998-03-31 2000-04-04 Emc Corp Logical restore from a physical backup in a computer storage system
DE69916272D1 (de) * 1998-06-08 2004-05-13 Kcsl Inc Methode und verfahren um relevante dokumente in einer datenbank zu finden
US6216123B1 (en) * 1998-06-24 2001-04-10 Novell, Inc. Method and system for rapid retrieval in a full text indexing system
US6256633B1 (en) * 1998-06-25 2001-07-03 U.S. Philips Corporation Context-based and user-profile driven information retrieval
US6199081B1 (en) * 1998-06-30 2001-03-06 Microsoft Corporation Automatic tagging of documents and exclusion by content
US6226630B1 (en) * 1998-07-22 2001-05-01 Compaq Computer Corporation Method and apparatus for filtering incoming information using a search engine and stored queries defining user folders
US6240409B1 (en) * 1998-07-31 2001-05-29 The Regents Of The University Of California Method and apparatus for detecting and summarizing document similarity within large document sets
US6389403B1 (en) * 1998-08-13 2002-05-14 International Business Machines Corporation Method and apparatus for uniquely identifying a customer purchase in an electronic distribution system
US6611812B2 (en) * 1998-08-13 2003-08-26 International Business Machines Corporation Secure electronic content distribution on CDS and DVDs
US7228437B2 (en) * 1998-08-13 2007-06-05 International Business Machines Corporation Method and system for securing local database file of local content stored on end-user system
US6243713B1 (en) * 1998-08-24 2001-06-05 Excalibur Technologies Corp. Multimedia document retrieval by application of multimedia queries to a unified index of multimedia data for a plurality of multimedia data types
US6269382B1 (en) * 1998-08-31 2001-07-31 Microsoft Corporation Systems and methods for migration and recall of data from local and remote storage
US6226759B1 (en) * 1998-09-28 2001-05-01 International Business Machines Corporation Method and apparatus for immediate data backup by duplicating pointers and freezing pointer/data counterparts
US6189002B1 (en) * 1998-12-14 2001-02-13 Dolphin Search Process and system for retrieval of documents using context-relevant semantic profiles
US6199067B1 (en) * 1999-01-20 2001-03-06 Mightiest Logicon Unisearch, Inc. System and method for generating personalized user profiles and for utilizing the generated user profiles to perform adaptive internet searches
US20020019814A1 (en) * 2001-03-01 2002-02-14 Krishnamurthy Ganesan Specifying rights in a digital rights license according to events
US6493711B1 (en) * 1999-05-05 2002-12-10 H5 Technologies, Inc. Wide-spectrum information search engine
US6393389B1 (en) * 1999-09-23 2002-05-21 Xerox Corporation Using ranked translation choices to obtain sequences indicating meaning of multi-token expressions
US7213005B2 (en) * 1999-12-09 2007-05-01 International Business Machines Corporation Digital content distribution using web broadcasting services
US6915435B1 (en) * 2000-02-09 2005-07-05 Sun Microsystems, Inc. Method and system for managing information retention
US7412462B2 (en) * 2000-02-18 2008-08-12 Burnside Acquisition, Llc Data repository and method for promoting network storage of data
US6421767B1 (en) * 2000-02-23 2002-07-16 Storage Technology Corporation Method and apparatus for managing a storage system using snapshot copy operations with snap groups
US6859800B1 (en) * 2000-04-26 2005-02-22 Global Information Research And Technologies Llc System for fulfilling an information need
US6636848B1 (en) * 2000-05-31 2003-10-21 International Business Machines Corporation Information search using knowledge agents
EP1314290B1 (fr) * 2000-08-31 2006-09-27 Ontrack Data International, Inc. Systeme de gestion de donnees et procede correspondant
US6678679B1 (en) * 2000-10-10 2004-01-13 Science Applications International Corporation Method and system for facilitating the refinement of data queries
US6804662B1 (en) * 2000-10-27 2004-10-12 Plumtree Software, Inc. Method and apparatus for query and analysis
US20020156756A1 (en) * 2000-12-06 2002-10-24 Biosentients, Inc. Intelligent molecular object data structure and method for application in heterogeneous data environments with high data density and dynamic application needs
US6751628B2 (en) * 2001-01-11 2004-06-15 Dolphin Search Process and system for sparse vector and matrix representation of document indexing and retrieval
US6745197B2 (en) * 2001-03-19 2004-06-01 Preston Gates Ellis Llp System and method for efficiently processing messages stored in multiple message stores
US7174368B2 (en) * 2001-03-27 2007-02-06 Xante Corporation Encrypted e-mail reader and responder system, method, and computer program product
US6976016B2 (en) * 2001-04-02 2005-12-13 Vima Technologies, Inc. Maximizing expected generalization for learning complex query concepts
US7047386B1 (en) * 2001-05-31 2006-05-16 Oracle International Corporation Dynamic partitioning of a reusable resource
US6996580B2 (en) * 2001-06-22 2006-02-07 International Business Machines Corporation System and method for granular control of message logging
US7188085B2 (en) * 2001-07-20 2007-03-06 International Business Machines Corporation Method and system for delivering encrypted content with associated geographical-based advertisements
US7793326B2 (en) * 2001-08-03 2010-09-07 Comcast Ip Holdings I, Llc Video and digital multimedia aggregator
US6778979B2 (en) * 2001-08-13 2004-08-17 Xerox Corporation System for automatically generating queries
US7284191B2 (en) * 2001-08-13 2007-10-16 Xerox Corporation Meta-document management system with document identifiers
AUPR797501A0 (en) * 2001-09-28 2001-10-25 BlastMedia Pty Limited A method of displaying content
US7363425B2 (en) * 2001-12-28 2008-04-22 Hewlett-Packard Development Company, L.P. System and method for securing drive access to media based on medium identification numbers
US20030126247A1 (en) * 2002-01-02 2003-07-03 Exanet Ltd. Apparatus and method for file backup using multiple backup devices
WO2003060771A1 (fr) * 2002-01-14 2003-07-24 Jerzy Lewak Procede et systeme d'acces aux donnees de vocabulaire identificateur
US7134020B2 (en) * 2002-01-31 2006-11-07 Peraogulne Corp. System and method for securely duplicating digital documents
US7693830B2 (en) * 2005-08-10 2010-04-06 Google Inc. Programmable search engine
US7028967B2 (en) * 2002-05-01 2006-04-18 Xerxes Corporation Tank retaining system
GB2390445B (en) * 2002-07-02 2005-08-31 Hewlett Packard Co Improvements in and relating to document storage
US6941297B2 (en) * 2002-07-31 2005-09-06 International Business Machines Corporation Automatic query refinement
US7523505B2 (en) * 2002-08-16 2009-04-21 Hx Technologies, Inc. Methods and systems for managing distributed digital medical data
US20040064447A1 (en) * 2002-09-27 2004-04-01 Simske Steven J. System and method for management of synonymic searching
US7188173B2 (en) * 2002-09-30 2007-03-06 Intel Corporation Method and apparatus to enable efficient processing and transmission of network communications
US6920523B2 (en) * 2002-10-07 2005-07-19 Infineon Technologies Ag Bank address mapping according to bank retention time in dynamic random access memories
US20040143609A1 (en) * 2003-01-17 2004-07-22 Gardner Daniel John System and method for data extraction in a non-native environment
US7478096B2 (en) * 2003-02-26 2009-01-13 Burnside Acquisition, Llc History preservation in a computer storage system
JP4265245B2 (ja) * 2003-03-17 2009-05-20 株式会社日立製作所 計算機システム
US20050097081A1 (en) * 2003-10-31 2005-05-05 Hewlett-Packard Development Company, L.P. Apparatus and methods for compiling digital communications
US7536368B2 (en) * 2003-11-26 2009-05-19 Invention Machine Corporation Method for problem formulation and for obtaining solutions from a database
US7412437B2 (en) * 2003-12-29 2008-08-12 International Business Machines Corporation System and method for searching and retrieving related messages
US7249251B2 (en) * 2004-01-21 2007-07-24 Emc Corporation Methods and apparatus for secure modification of a retention period for data in a storage system
US20060074980A1 (en) * 2004-09-29 2006-04-06 Sarkar Pte. Ltd. System for semantically disambiguating text information
EP1825395A4 (fr) * 2004-10-25 2010-07-07 Yuanhua Tang Systemes d'interrogation et de recherche plein texte et procedes d'utilisation
US20080077570A1 (en) * 2004-10-25 2008-03-27 Infovell, Inc. Full Text Query and Search Systems and Method of Use
US7640488B2 (en) * 2004-12-04 2009-12-29 International Business Machines Corporation System, method, and service for using a focused random walk to produce samples on a topic from a collection of hyper-linked pages
TWI314271B (en) * 2005-01-27 2009-09-01 Delta Electronics Inc Vocabulary generating apparatus and method thereof and speech recognition system with the vocabulary generating apparatus
WO2006110684A2 (fr) * 2005-04-11 2006-10-19 Textdigger, Inc. Systeme et procede de recherche d'une requete
US20070022134A1 (en) * 2005-07-22 2007-01-25 Microsoft Corporation Cross-language related keyword suggestion
EP1910949A4 (fr) * 2005-07-29 2012-05-30 Cataphora Inc Procédé amélioré et appareil pour analyse de données sociologiques
US7526478B2 (en) * 2005-08-03 2009-04-28 Novell, Inc. System and method of searching for organizing and displaying search results
US7487146B2 (en) * 2005-08-03 2009-02-03 Novell, Inc. System and method of searching for providing dynamic search results with temporary visual display
US7707146B2 (en) * 2005-08-03 2010-04-27 Novell, Inc. System and method of searching for providing clue-based context searching
US7844599B2 (en) * 2005-08-24 2010-11-30 Yahoo! Inc. Biasing queries to determine suggested queries
US7747639B2 (en) * 2005-08-24 2010-06-29 Yahoo! Inc. Alternative search query prediction
US20070061335A1 (en) * 2005-09-14 2007-03-15 Jorey Ramer Multimodal search query processing
US20080215623A1 (en) * 2005-09-14 2008-09-04 Jorey Ramer Mobile communication facility usage and social network creation
US7730081B2 (en) * 2005-10-18 2010-06-01 Microsoft Corporation Searching based on messages
US7650341B1 (en) * 2005-12-23 2010-01-19 Hewlett-Packard Development Company, L.P. Data backup/recovery
CN101000610B (zh) * 2006-01-11 2010-09-29 鸿富锦精密工业(深圳)有限公司 文件分散式储存系统及方法
US7478113B1 (en) * 2006-04-13 2009-01-13 Symantec Operating Corporation Boundaries
BRPI0713114A2 (pt) * 2006-05-19 2012-04-17 My Virtual Model Inc busca assistida por simulação
US8401841B2 (en) * 2006-08-31 2013-03-19 Orcatec Llc Retrieval of documents using language models
US8010534B2 (en) * 2006-08-31 2011-08-30 Orcatec Llc Identifying related objects using quantum clustering
US8442972B2 (en) * 2006-10-11 2013-05-14 Collarity, Inc. Negative associations for search results ranking and refinement
US10031938B2 (en) * 2006-12-04 2018-07-24 International Business Machines Corporation Determining Boolean logic and operator precedence of query conditions
US20080222513A1 (en) * 2007-03-07 2008-09-11 Altep, Inc. Method and System for Rules-Based Tag Management in a Document Review System
US8396838B2 (en) * 2007-10-17 2013-03-12 Commvault Systems, Inc. Legal compliance, electronic discovery and electronic document handling of online and offline copies of data
US9063923B2 (en) * 2009-03-18 2015-06-23 Iqintell, Inc. Method for identifying the integrity of information

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US2571508A (en) 1948-09-01 1951-10-16 Rosenbaum Q B Kl Parking Co Machine for tier parking motor vehicles
US20090032990A1 (en) 2005-07-04 2009-02-05 Maillefer S.A. Extrusion Method and Extrusion Apparatus
US20070013483A1 (en) 2005-07-15 2007-01-18 Allflex U.S.A. Inc. Passive dynamic antenna tuning circuit for a radio frequency identification reader
US20070288445A1 (en) * 2006-06-07 2007-12-13 Digital Mandate Llc Methods for enhancing efficiency and cost effectiveness of first pass review of documents
WO2007146107A2 (fr) 2006-06-07 2007-12-21 Digital Mandate Llc Procédés pour améliorer l'efficacité et la rentabilité d'un premier examen de contrôle de documents
US20080189273A1 (en) * 2006-06-07 2008-08-07 Digital Mandate, Llc System and method for utilizing advanced search and highlighting techniques for isolating subsets of relevant content data
EP1967972A1 (fr) * 2007-03-07 2008-09-10 The Boeing Company Procédés et systèmes pour rétroaction pertinente de recherche non obstructive
WO2009100081A1 (fr) 2008-02-04 2009-08-13 Digital Mandate Llc Système et procédé pour utiliser des techniques de recherche avancée et de mise en évidence pour isoler des sous-ensembles de données pertinentes

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HERBERT L. ROITBLAT, DOCUMENT RETRIEVAL, 2005
HERBERT L. ROITBLAT, ELECTRONIC DATA ARE INCREASINGLY IMPORTANT TO SUCCESSFUL LITIGATION, November 2004 (2004-11-01)
THE SEDONA PRINCIPLES: BEST PRACTICES RECOMMENDATIONS & PRINCIPLES FOR ADDRESSING ELECTRONIC DOCUMENT PRODUCTION, July 2005 (2005-07-01)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11971937B2 (en) 2011-06-17 2024-04-30 Robert Osann, Jr. Internet search results annotation, filtering, and advertising with respect to search term elements

Also Published As

Publication number Publication date
US20110145269A1 (en) 2011-06-16

Similar Documents

Publication Publication Date Title
US20080189273A1 (en) System and method for utilizing advanced search and highlighting techniques for isolating subsets of relevant content data
US20110145269A1 (en) System and method for quickly determining a subset of irrelevant data from large data content
US10795922B2 (en) Authorship enhanced corpus ingestion for natural language processing
US9558264B2 (en) Identifying and displaying relationships between candidate answers
US8150827B2 (en) Methods for enhancing efficiency and cost effectiveness of first pass review of documents
WO2019091026A1 (fr) Procédé de recherche rapide de document dans une base de connaissances, serveur d'application, et support d'informations lisible par ordinateur
US20100198802A1 (en) System and method for optimizing search objects submitted to a data resource
JP4701292B2 (ja) テキスト・データに含まれる固有表現又は専門用語から用語辞書を作成するためのコンピュータ・システム、並びにその方法及びコンピュータ・プログラム
KR101109236B1 (ko) 복수-의미 질의에 대한 관련 용어 제안
US8051080B2 (en) Contextual ranking of keywords using click data
US9715531B2 (en) Weighting search criteria based on similarities to an ingested corpus in a question and answer (QA) system
US8965894B2 (en) Automated web page classification
US20090070322A1 (en) Browsing knowledge on the basis of semantic relations
KR101524889B1 (ko) 간접 화법 내에서의 시맨틱 관계의 식별
US20160292153A1 (en) Identification of examples in documents
US20120246100A1 (en) Methods and systems for extracting keyphrases from natural text for search engine indexing
WO2014100459A2 (fr) Systèmes et procédés pour utiliser des informations non textuelles dans l'analyse de sujets de brevet
US20090112845A1 (en) System and method for language sensitive contextual searching
CN113544689A (zh) 为文档的来源观点生成并提供附加内容
Trabelsi et al. Bridging folksonomies and domain ontologies: Getting out non-taxonomic relations
US8600972B2 (en) Systems and methods for document searching
Joshi et al. Auto-grouping emails for faster e-discovery
WO2009035871A1 (fr) Connaissances de navigation sur la base de relations sémantiques
US9507855B2 (en) System and method for searching index content data using multiple proximity keyword searches
TW201822031A (zh) 以文字資訊建立圖表索引方法及其電腦程式產品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10800805

Country of ref document: EP

Kind code of ref document: A1

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10800805

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10800805

Country of ref document: EP

Kind code of ref document: A1