WO2023244505A1 - Method for filtering search results based on search terms in context - Google Patents

Method for filtering search results based on search terms in context Download PDF

Info

Publication number
WO2023244505A1
WO2023244505A1 PCT/US2023/024931 US2023024931W WO2023244505A1 WO 2023244505 A1 WO2023244505 A1 WO 2023244505A1 US 2023024931 W US2023024931 W US 2023024931W WO 2023244505 A1 WO2023244505 A1 WO 2023244505A1
Authority
WO
WIPO (PCT)
Prior art keywords
search term
search
results
documents
result
Prior art date
Application number
PCT/US2023/024931
Other languages
French (fr)
Inventor
Michael Alistair WILL
Neil Andrew BRADLEY
Original Assignee
Esi Laboratory, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Esi Laboratory, Llc filed Critical Esi Laboratory, Llc
Publication of WO2023244505A1 publication Critical patent/WO2023244505A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/101Collaborative creation, e.g. joint development of products or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Definitions

  • the invention is related to a search platform, such as a document review platform, in which a plurality of documents may be reviewed to determine if individual documents are relevant to a search query.
  • Second Level Responsive Review (which is often repeated multiple times) is typically conducted by senior associates.
  • Technology-Assisted Review also occasionally referred to as Artificial Intelligence or Predictive Coding based review.
  • Responsive Search Terms are Proceeding specific and are designed to identify Responsive Documents.
  • Responsive Search Terms are developed collaboratively by attorneys and their clients and may also involve input from third parties.
  • Confidentiality Search Terms are typically Proceeding/Individual specific and are designed to identify Confidential Documents.
  • Confidentiality Search Terms may originate as a result of legislation (such as, e.g., the California Consumer Privacy Act, General Data Protection Regulation, Health Insurance Portability and Accountability Act, etc.) and/or as a result of collaboration between law firms and their clients.
  • legislation such as, e.g., the California Consumer Privacy Act, General Data Protection Regulation, Health Insurance Portability and Accountability Act, etc.
  • search terms in isolation will reveal little (if anything) about a document’s categorization as a true positive document or false positive document as they provide insufficient textual context to make such a determination.
  • a computer-based method for filtering search results. The method includes first identifying a plurality of documents for review and then identifying an initial plurality of search term results potentially relevant to a search query, each search term result of the initial plurality containing an occurrence of a primary search term of a plurality of primary search terms. Each search term result of the plurality of search term results is drawn from one of the plurality of documents for review, and each search term result of the plurality of search term results is smaller than the one of the plurality of documents from which it is drawn.
  • the method then proceeds to group search term results of the initial plurality of search term results that are identical to each other to define a plurality of groups of search term results.
  • a first group of search term results of the plurality of groups of search term results comprises search term results different than those of a second group of search term results of the plurality of groups of search term results.
  • the method Upon receiving an indication from a user that the representative search term result is to be removed, the method then removes all search term results of the first group from the initial plurality of search term results to define a modified plurality of search term results.
  • each search term result is either a sentence containing the corresponding search term or is the corresponding search term combined with a previously defined number of leading or following characters.
  • the method upon receiving an indication from the user that the representative search term result is ambiguous, displays a document component containing the search term result to the user.
  • the component is larger than the corresponding search term result but smaller than the document from which the search term result is drawn.
  • the method proceeds with removing the corresponding representative search term result from the at least one group of search term results and defining a different search term result from the group of search term results as the representative search term result for display to the user.
  • multiple search term results are drawn from a single document of the plurality of documents for review, such that the document is defined as potentially relevant to the search query so long as any single search term result
  • the method further includes displaying a representative search term result for each group of the plurality of groups of search term results and receiving an indication from the user that the corresponding representative search term result is relevant to the search query or that the corresponding representative search term result is to be removed.
  • the method further includes displaying a representative search term result for each group of the plurality of groups of search term results and receiving an indication from the user that the corresponding representative search term result is relevant to the search query, that the corresponding representative search term result is to be removed, or that the user cannot make a determination based on the representative search term result.
  • the method further comprises identifying at least one form paragraph prior to identifying the initial plurality of search term results and preemptively indicating to a user that a particular representative search term result presented for evaluation is drawn from the at least one form paragraph.
  • the method further comprises displaying each potentially relevant document to the user for further review and receiving an indication from the user that the corresponding document is either relevant or irrelevant.
  • the method upon defining a group of potentially relevant documents, the method comprises identifying at least one document of the group of
  • the further review identifies the document as at least partially privileged or confidential and implements a redaction process.
  • names or contact information are extracted.
  • names or contact information are extracted from a load file accompanying batches of documents or the names or contact information are extracted from body text using Regular Expressions, Email Parsing Libraries, Natural Language Processing (NLP), Machine Learning, or third-party APIs.
  • NLP Natural Language Processing
  • the method further includes processing the documents of the plurality of documents for review prior to identifying the initial plurality of search term results. Processing includes lemmatization of the corresponding documents, and at least one primary search term is a lemma.
  • the method further includes processing the documents of the plurality of documents for review prior to identifying the initial plurality of search term results, wherein processing includes normalization, sentence tokenization, or removal of stop words.
  • the identification of the initial plurality of search term results potentially responsive to the search query is based on a Boolean comparison of each of the plurality of primary search terms to each of the plurality of documents.
  • Figures 1A and IB schematically illustrates an exemplary review process in accordance with this disclosure compared with a traditional review process.
  • Figure 2 is a flowchart illustrating an exemplary computer-based method of filtering search results in accordance with this disclosure.
  • Figures 3A-3C illustrate an exemplary document processed in accordance with the exemplary method of FIG. 2.
  • Figures 4A and 4B illustrate individual search term results in accordance with the exemplary method of FIG. 2.
  • Figure 5 illustrates the classification of groups of documents based on exemplary search term results in the context of the exemplary method of FIG. 2.
  • Figures 7A-7B illustrate the presentation of a full document to a user in the context of the exemplary method of FIG. 2.
  • Figure 9 illustrates the identification of domains associated with email addresses extracted in FIG. 8.
  • Figure 11 illustrates the population of the presentation of FIG. 10 with additional context.
  • Figure 13 illustrates lemmatization of document text in accordance with this disclosure.
  • Figure 14 illustrates the expansion of search terms for use in generating search results in accordance with this disclosure.
  • Figures 16A-D illustrate a process for classifying email legends in accordance with this disclosure.
  • Figures 17A-17B illustrate the automatic classification of search term results originating from legends in accordance with this disclosure.
  • Figures 18A-18B illustrate an interface for bulk classification of documents in accordance with this disclosure.
  • a system and method described herein allow a user to quickly and efficiently filter search results and/or documents by viewing grouped search results in context. As discussed in more detail below, such grouping may follow a deduplication process, such that the grouped search results have already been deduplicated. By grouping search results in situations where context is related or identical, users may be able to quickly eliminate false positives from bulk search results, thereby drastically
  • a plurality of documents is identified (200) for review. This may be by loading a database or receiving a batch of documents for review. In some cases, such a batch of documents may be accompanied by a file manifest, or a load file, containing extracted metadata. Alternatively, the plurality of documents may be initially created or compiled by scraping underlying sources of documents, such as email accounts, or by receiving access to files. Further, the documents may be retrieved from hard copies, and as such may be the result of a scanning process.
  • search term result typically includes context, it would be longer than the search term contained therein.
  • a search term result may then be, for example, a sentence containing the corresponding search term.
  • the search term result may be a corresponding search term combined with a previously defined number of leading and/or following characters.
  • the search term result may be the search term combined with 80 preceding and 80 following characters.
  • SUBSTITUTE SHEET (RULE 26) sentences appear across a large number of documents. If such a fragment does not indicate a positive search result, then any identical fragment would similarly not indicate a positive search term result. As such, by grouping, or deduplicating, such identical search term results, the results may be reviewed more efficiently.
  • the user may then indicate that the representative search result is to be removed from the search results, or that it should be retained in the results.
  • the user may indicate that the representative search term result is to be removed because, for example, it represents a false positive.
  • the method may then receive an indication (290) from the user that the representative search term result is to be removed or retained.
  • the method then removes (300) all search term results of the corresponding group from the initial plurality of search term results to define a modified plurality of search term results (310).
  • the method may instead receive an indication (at 290) from the user that representative search term result is to be retained, and in such a scenario, the
  • FIG. 4A shows an example of a term that was included in the plurality of search term results potentially relevant to a search query because it included the term “attorney work product,” which implies that the document may be privileged.
  • the search term result then includes the context, which appears to assert privilege.
  • a user may indicate that the result is a true positive result by, for example, ticking a box that indicates that the result should be “pinned” and thereby retained in the modified plurality of search results.
  • the user indicates a true positive respond by clicking a check icon indicating that the result should be retained in the modified plurality of search results.
  • FIG. 4B shows an example of a term that was included in the plurality of search term results potentially relevant to a search query because it included the term “confidential and privilege.”
  • this language was part of boilerplate, such as a legend, and merely indicated that documents may contain privileged information. As such, this language alone does not indicate relevance to the search query, and the user may indicate that the representative search term result is therefore a false positive and can be removed by ticking the box corresponding to the trash icon, or selecting an “X” icon.
  • the result in FIG. 4A may be shaded in, e g., green, while the result in FIG. 4B may be shaded in, e.g., red.
  • the representative search term result shown in FIG. 4B was part of a group that included 14 identical search term results. Accordingly, upon receiving the indication from the user (at 290), the method removes all 14 results (at 300).
  • SUBSTITUTE SHEET ( RULE 26) [00137] This process may then be repeated for each group of the plurality of groups until all groups have been evaluated. For any group for which the method receives an indication from a user that the representative search term result is to be removed, all search term results of the corresponding group are then removed. Accordingly, once the method receives the indication (at 290), a representative search term result of a different group may be presented to a user (at 270). In some embodiments, multiple such representative search term results are presented simultaneously, such that a user can continue down a list. This is shown in interface examples discussed in more detail below.
  • Figure 5 illustrates the classification of groups of documents based on exemplary search term results in the context of the exemplary method of FIG. 2. As shown, several groups are presented, and for each group, the user can provide an indication that the representative search term should be removed or retained (at 290). The indication for each line item, or search result, may then be shown on the right, while the results themselves may be shaded in e.g., green, red, or a neutral color to further visually represent the result.
  • Figures 6A-6B illustrate the presentation of a document component, or snippet, to a user in the context of the exemplary method of FIG. 2.
  • the user may choose not to classify a representative search result, thereby leaving a box unchecked.
  • the user may choose to proactively indicate that a representative search result is ambiguous (320) and that further context is required in order to classify the document.
  • the method may then present (330) a document component containing the search term result to the user.
  • the component is typically larger than the corresponding search term result, but smaller than the document from which the search term result is drawn.
  • FIGS. 3A-3C Such a component example is illustrated in FIGS. 3A-3C.
  • SUBSTITUTE SHEET (RULE 26) term result is removed from the corresponding group of search term results (350). However, because additional context was required for the user to make the determination, the remaining search term results of the group remain in the modified plurality of search term results until they are separately evaluated.
  • a different representative search term result is typically defined for the corresponding group (at 260) such that the replacement representative result can be presented to a user (at 270) for evaluation and only the specific search term result removed (at 350) is removed from the modified plurality defined (at 310). The process then repeats until all such results are reviewed. Because the user had previously indicated that the search term result did not provide sufficient context, in some such scenarios, the user would continue to review all search term results of the corresponding group consecutively.
  • such components may be presented to the user as a group for evaluation in a similar manner to the review of the search term results themselves.
  • Figures 7A-7B illustrate the presentation of a full document to a user in the context of the exemplary method of FIG. 2. If, after reviewing a component, the user still cannot make a determination with respect to the relevant search term result, the method may proceed to present the corresponding full document to the user for review. As in the case of the component review discussed above, if a user indicates that a full document is to be removed from the search results, such a decision would remove the corresponding search term result from a group, but would not impact a classification associated with the group as a whole.
  • the user may simply indicate that the document is ambiguous or otherwise quarantine the document rather than classifying it. In this way, any full document review can be deferred to later in the process.
  • search term results may be initially reviewed independently outside of the larger context of the document from which it is drawn. If the search term results without any broader context are sufficient to determine that a particular result should not be included in the modified plurality of search results, it can thereby be removed.
  • search results are reviewed by users in the context of search term results, rather than as complete documents, multiple search term results may be drawn from a single document.
  • any single true positive search term result would be sufficient for inclusion of the corresponding document as a potentially relevant document, and any document is defined as potentially relevant to the search query so long as any single search term result associated with the corresponding document remains in the modified plurality of search term results. For example, if a document included three distinct sentences or clauses that appeared as search term results and one of them was indicated by the user for removal, the remaining two search term results drawn from that document would remain in the modified plurality of search term results, and may thereby lead to inclusion of the corresponding document as potentially relevant.
  • the complete process is repeated in sequence for different types of searches, as noted above. Accordingly, once a set of documents are defined as potentially relevant, the method may utilize that as the identified plurality of documents for review (at 200) for a second search.
  • the first search may be for responsiveness while the second search may be for privilege.
  • Figure 8 illustrates the extraction of email addresses from the body of a document in accordance with this disclosure.
  • Figure 9 illustrates the identification of domains associated with email addresses extracted in FIG. 8.
  • Figure 10 illustrates the presentation of domains identified in FIG. 9 to a user.
  • Figure 11 illustrates the population of the presentation of FIG. 10 with additional context.
  • a privilege check is implemented independent of, or in addition to, a main search.
  • the method may first implement a first pass of filtering the search results to identify a plurality of potentially relevant documents. The method may then proceed to search documents for secondary search terms.
  • the secondary search terms may be different than the primary search terms, and in the example discussed herein, the secondary search terms may comprise names or contact information of relevant parties.
  • documents are identified as containing a secondary search term, such a document may be retained for further review in order to determine if the document is privileged and therefore should not be disclosed.
  • the further review may then identify a document that is at least partially privileged or confidential.
  • a document may then be removed from the defined set of potentially relevant documents.
  • the method may proceed to implement a redaction process while retaining the corresponding document.
  • This may be, for example, by extracting email addresses from a load file accompanying a batch, or it may be by extracting email addresses from body text using, for example, Regular Expressions, Email Parsing Libraries, Natural Language Processing
  • the method may then proceed to parse out a domain for each email address extracted from the load file and email body.
  • the method may similarly extract names and roles of such relevant parties.
  • Figures 10 and 11 then illustrate the use of the method to extract additional information related to the parties.
  • Such information may be used by humans, rule-based algorithms, or artificial intelligence algorithms, among other methods, to determine if or confirm that the domain name corresponds to a law firm or any other party that may indicate or imply a privilege claim.
  • the retrieval of secondary information may be from a data source external to the plurality of documents, and the secondary information may then be used to determine whether to include the corresponding name or contact information as a secondary search term.
  • the external data source may be the internet or some other external source, such as, e.g., a database or directory of law firms.
  • Figure 12 illustrates a visualization of relationships in accordance with this disclosure.
  • the method proceeds to generate a map illustrating communications between parties associated with each name or contact information extracted.
  • a map may be based on relationships illustrated in email transmissions, such as email transmissions between parties or that certain parties are copied on the same communications.
  • Figure 13 illustrates lemmatization of document text in accordance with this disclosure.
  • the plurality of documents may be processed (210) prior to executing any search operation. This may include, for example, normalization, sentence tokenization, removal of stop words/punctuation/numbers/emojis and others, lemmatization of terms, or other processing steps.
  • processing (210) may further include text embedding or transformers.
  • the method reviews the initial plurality of search term results prior to displaying the representative search term results to the user. Where the method determines that any particular group was drawn from an identified form paragraph, the method may then preemptively mark the corresponding group as slated for removal. Accordingly, a user may modify the indication, but would not be required to proactively further indicate that a particular representative search term result is to be removed.
  • SUBSTITUTE SHEET (RULE 26) method described above for classifying search term results generally.
  • the method may initially utilize keywords, or search terms, expected to appear in a legend or form paragraph.
  • keywords may include, for example, “privileged” or “confidential” among others.
  • Figures 17A-17B illustrate the automatic classification of legends in accordance with this disclosure. As shown, if a search result during a later search is drawn from a verified legend, the corresponding group may be preemptively marked for removal from the initial plurality of search term results. Alternatively, the corresponding group may not be presented to the user at all.
  • Figures 18A-18B illustrate an interface for bulk classification of documents in accordance with this disclosure. As discussed above, a user may be presented with a large number of groups of search term results in a single interface in which all such results may be quickly classified.
  • data processing apparatus and like terms encompass all kinds of apparatus, devices, and machines for processing data, including by way of example a
  • a computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and it can be deployed in any form, including as a standalone program or as a module, component, subroutine, object, or other unit suitable for use in a computing environment.
  • a computer program may, but need not, correspond to a file in a file system.
  • a program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code).
  • a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
  • processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer.
  • a processor will receive instructions and data from a read only memory or a random access memory or both.
  • the essential elements of a computer are a processor for performing actions in accordance with instructions and one or more memory devices for storing instructions and data.
  • a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
  • mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
  • a computer need not have such devices.
  • a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device (e.g., a universal serial bus (USB) flash drive), to name just a few.
  • Devices suitable for storing computer program instructions and data include all forms of non volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks.
  • the processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
  • embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and a pointing device e.g., a mouse or a trackball
  • Computers may further be provided in other forms, such as in the form of handheld devices or smartphones, as well as in the form of tablet devices.
  • Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback;
  • SUBSTITUTE SHEET (RULE 26) and input from the user can be received in any form, including acoustic, speech, or tactile input.
  • a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user’s client device in response to requests received from the web browser.
  • a device for providing interaction with a user may be referred to as a user interface device.
  • Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components.
  • the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network.
  • Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), an inter-network (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks).
  • LAN local area network
  • WAN wide area network
  • inter-network e.g., the Internet
  • peer-to-peer networks e.g., ad hoc peer-to-peer networks.
  • the computing system can include clients and servers.
  • a client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
  • a server transmits data (e.g., an HTML page) to a client device (e.g., for purposes of displaying data to and receiving user input from a user interacting with the client device).
  • client device e.g., for purposes of displaying data to and receiving user input from a user interacting with the client device.
  • Data generated at the client device e.g., a result of the user interaction

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Data Mining & Analysis (AREA)
  • General Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Computational Linguistics (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Technology Law (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A computer-based method is provided for filtering search results, including first identifying a plurality of documents for review and then identifying an initial plurality of search term results potentially relevant to a search query, each search term result of the initial plurality containing an occurrence of a primary search term of a plurality of primary search terms. The method then proceeds to group identical search term results to define a plurality of groups of search term results. Representative search term result of each group are presented for evaluation. Upon receiving an indication from a user that the representative search term result is to be removed, the method then removes all search term results of the corresponding group to define a modified plurality of search term results. Each document from which any search result of the modified plurality of search term results is drawn is defined as a potentially relevant document.

Description

METHOD FOR FILTERING SEARCH RESULTS BASED ON SEARCH TERMS IN CONTEXT
CROSS-REFERENCE TO RELATED APPLICATION
[0001] This application claims the benefit of U.S. Provisional Patent Application No. 63/352,733, filed June 16, 2022, the contents of which are incorporated by reference herein in its entirety.
FIELD OF THE INVENTION
[0002] The invention is related to a search platform, such as a document review platform, in which a plurality of documents may be reviewed to determine if individual documents are relevant to a search query.
BACKGROUND
[0003] THE GROWTH OF ELECTRONICALLY STORED INFORMATION
[0004] The exponential growth in the volume of Electronically Stored
Information presents a formidable challenge to Proceeding Participants. In particular, the costs associated with Document Review are already significant and can only climb higher as the volume of Electronically Stored Information continues its inexorable rise.
[0005] A study published by the RAND Corporation in 2012 revealed that on average 73% of Document Production costs was attributable to Document Review - a process requiring an average of 23 attorneys that can take weeks, months or even years (in the case of rolling Document Productions) to complete.
[0006] DOCUMENT REVIEW TYPES
[0007] There are three types of Document Review:
[0008] Responsive Review
[0009] There are two sub-types of Responsive Review:
[0010] First Level Responsive Review
- 1 -
SUBSTITUTE SHEET ( RULE 26) [0011] The objective of First Level Responsive Review is to separate Responsive Documents from Non-Responsive Documents.
[0012] First Level Responsive Review is typically conducted by paralegals, contract attorneys, and/or junior associates.
[0013] Failure to conduct First Level Responsive Review properly will increase the volume of Documents requiring Second Level Responsive Review (resulting in increased costs). There is also an increased risk that embarrassing Non-Responsive Documents will be disclosed.
[0014] Second Level Responsive Review
[0015] The objective of Second Level Responsive Review is to perform a quality control check of the Responsive Documents and Non-Responsive Documents identified during the First Level Responsive Review.
[0016] Second Level Responsive Review (which is often repeated multiple times) is typically conducted by senior associates.
[0017] Failure to conduct Second Level Responsive Review increases risk that embarrassing Non-Responsive Documents will be disclosed.
[0018] Privilege Review
[0019] Privilege Review involves identifying and extracting Privileged
Documents from the corpus of Responsive Documents so they can be redacted (in whole or in part) and included in the privilege log.
[0020] Like Second Level Responsive Review, Privilege Review is often repeated multiple times in large part because previously unknown attorneys/law firms often come to light as Document Review progresses.
[0021] Privilege Review is typically conducted by senior associates and is the most time consuming and expensive part of Document Review.
[0022] Failure to conduct Privilege Review properly will inevitably result in serious consequences and may result in a finding that privilege has been waived.
[0023] Confidential Review
- 2 -
SUBSTITUTE SHEET ( RULE 26) [0024] Confidential Review involves identifying and extracting Confidential Documents from the corpus of Responsive Documents so they can be redacted (in whole or in part) or designated as “Attorney Eyes Only.” Like Second Level Responsive Review, Confidential Review is often repeated multiple times. Confidential Documents may include, for example, Personally Identifiable Information or other personal data, as well as commercially sensitive information or trade secrets.
[0025] Confidential Review is typically conducted by senior associates and is just as expensive and time consuming as Privilege Review.
[0026] Failure to conduct Confidential Review properly also has serious consequences and may result in contractual liability, court sanctions and regulatory financial penalties.
[0027] IDENTIFYING REVIEW DOCUMENTS
[0028] Existing Review Platforms make use of the following functionality to identify Documents:
[0029] Search Terms; and
[0030] Technology-Assisted Review, also occasionally referred to as Artificial Intelligence or Predictive Coding based review.
[0031] Search Terms
[0032] There are three types of Search Terms:
[0033] Responsive Search Terms
[0034] Responsive Search Terms are Proceeding specific and are designed to identify Responsive Documents.
[0035] Responsive Search Terms are developed collaboratively by attorneys and their clients and may also involve input from third parties.
[0036] Privilege Search Terms
[0037] There are two sub-types of Privilege Search Terms each of which are designed to identify Privileged Documents:
[0038] General Privilege Search Terms
- 3 -
SUBSTITUTE SHEET ( RULE 26) [0039] General Privilege Search Terms are created by attorneys and law firms and evolve over a period of time based on experience obtained in prior Proceedings.
[0040] Attorney Privilege Search Terms
[0041] Attorney Privilege Search Terms are specific to the Proceedings and are designed to identify (via individual name, firm name, internet domain and/or email address) the authors/recipients of Privileged Documents (i.e. outside and in-house counsel).
[0042] Confidentiality Search Terms
[0043] Confidentiality Search Terms are typically Proceeding/Individual specific and are designed to identify Confidential Documents.
[0044] Confidentiality Search Terms may originate as a result of legislation (such as, e.g., the California Consumer Privacy Act, General Data Protection Regulation, Health Insurance Portability and Accountability Act, etc.) and/or as a result of collaboration between law firms and their clients.
[0045] It is important to note, however, that search terms in isolation will reveal little (if anything) about a document’s categorization as a true positive document or false positive document as they provide insufficient textual context to make such a determination.
[0046] Technology-Assisted Review
[0047] Technology- Assisted Review is also used to identify Responsive
Documents although its use is not as prevalent as Search Terms.
[0048] Currently, there is a reluctance of the part of Proceeding Participants to rely on Technology-Assisted Review to identify Privileged Documents and Confidential Documents due to the unique characteristics of each type of Document. For example, an attachment to an email exchanged between two non-attorneys will not be a Privileged Document and yet the same attachment sent to a law firm requesting legal advice on its content will be a Privileged Document.
[0049] MEASURING DOCUMENT REVIEW EFFECTIVENESS
- 4 -
SUBSTITUTE SHEET ( RULE 26) [0050] Document Review effectiveness is measured by Precision (a measure of completeness) and Recall (a measure of efficiency).
[0051] It is important to note that Recall and Precision have an inverse relationship so that improving one tends to degrade the other.
[0052] Conducting a high Recall I low Precision Document Review is extremely expensive because of the additional time required to identify, classify and categorize the high number of False Positive Documents.
[0053] Conducting a low Recall / high Precision Document Review is less expensive but raises the specter of important Documents being excluded from the Proceedings (in the case of Responsive Documents) or inadvertently disclosed (in the case of Confidential Documents and Privileged Documents).
[0054] For the reasons outlined above, Proceeding Participants have attempted to strike a difficult balance between Recall and Precision.
SUMMARY
[0055] A computer-based method is provided for filtering search results. The method includes first identifying a plurality of documents for review and then identifying an initial plurality of search term results potentially relevant to a search query, each search term result of the initial plurality containing an occurrence of a primary search term of a plurality of primary search terms. Each search term result of the plurality of search term results is drawn from one of the plurality of documents for review, and each search term result of the plurality of search term results is smaller than the one of the plurality of documents from which it is drawn.
[0056] The method then proceeds to group search term results of the initial plurality of search term results that are identical to each other to define a plurality of groups of search term results. A first group of search term results of the plurality of groups of search term results comprises search term results different than those of a second group of search term results of the plurality of groups of search term results.
- 5 -
SUBSTITUTE SHEET ( RULE 26) [0057] The method then proceeds to display a representative search term result of the first group of search term results for evaluation.
[0058] Upon receiving an indication from a user that the representative search term result is to be removed, the method then removes all search term results of the first group from the initial plurality of search term results to define a modified plurality of search term results.
[0059] The method then defines each document from which any search result of the modified plurality of search term results is drawn as a potentially relevant document.
[0060] In some embodiments, each search term result of the plurality of search term results is larger than the search term, such that search term results are defined as identical only if contents of the search term results other than the search term are identical.
[0061] In some such embodiments, each search term result is either a sentence containing the corresponding search term or is the corresponding search term combined with a previously defined number of leading or following characters.
[0062] Alternatively, in some such embodiments, upon receiving an indication from the user that the representative search term result is ambiguous, the method displays a document component containing the search term result to the user. The component is larger than the corresponding search term result but smaller than the document from which the search term result is drawn. Upon receiving a further indication from the user that the snippet is to be removed, the method proceeds with removing the corresponding representative search term result from the at least one group of search term results and defining a different search term result from the group of search term results as the representative search term result for display to the user.
[0063] In some embodiments, multiple search term results are drawn from a single document of the plurality of documents for review, such that the document is defined as potentially relevant to the search query so long as any single search term result
- 6 -
SUBSTITUTE SHEET ( RULE 26) associated with the corresponding document remains in the modified plurality of search term results.
[0064] In some embodiments, the method further includes displaying a representative search term result for each group of the plurality of groups of search term results and receiving an indication from the user that the corresponding representative search term result is relevant to the search query or that the corresponding representative search term result is to be removed.
[0065] In some embodiments, the method further includes displaying a representative search term result for each group of the plurality of groups of search term results and receiving an indication from the user that the corresponding representative search term result is relevant to the search query, that the corresponding representative search term result is to be removed, or that the user cannot make a determination based on the representative search term result.
[0066] In some embodiments, the method further comprises identifying at least one form paragraph prior to identifying the initial plurality of search term results and defining a search term result as part of the initial plurality of search term results only upon determining that the corresponding search term result is not drawn from the at least one form paragraph.
[0067] In some embodiments, the method further comprises identifying at least one form paragraph prior to identifying the initial plurality of search term results and preemptively indicating to a user that a particular representative search term result presented for evaluation is drawn from the at least one form paragraph.
[0068] In some embodiments, the method further comprises displaying each potentially relevant document to the user for further review and receiving an indication from the user that the corresponding document is either relevant or irrelevant.
[0069] In some embodiments, upon defining a group of potentially relevant documents, the method comprises identifying at least one document of the group of
- 7 -
SUBSTITUTE SHEET ( RULE 26) potentially relevant documents as containing a secondary search term, and retaining the at least one document for further review.
[0070] In some such embodiments, the further review identifies the document as at least partially privileged or confidential and implements a redaction process.
[0071] In some embodiments in which secondary search terms are identified, such secondary search terms comprise names or contact information extracted from the plurality of documents for review.
[0072] In some such embodiments, the method includes extracting the names or contact information from header information of the plurality of documents for review. For each name extracted, the method then retrieves secondary information from a data source external to the plurality of documents. The method then determines, based on the secondary information, whether to include the corresponding name or contact information as a secondary search term.
[0073] In some such embodiments, the data source external to the plurality of documents is the internet or an external database.
[0074] In some embodiments in which names or contact information are extracted. Such names or contact information are extracted from a load file accompanying batches of documents or the names or contact information are extracted from body text using Regular Expressions, Email Parsing Libraries, Natural Language Processing (NLP), Machine Learning, or third-party APIs.
[0075] In some embodiments in which names or contact information are extracted, the method includes extracting names or contact information from email header information of the plurality of documents for review . The method then generates a map illustrating communications between parties associated with each name or contact information extracted, the map based on indications that emails were transmitted between those parties or that those parties were copied on the same communications. The method then displays the map to the user and receiving an indication from the user that at least
- 8 -
SUBSTITUTE SHEET ( RULE 26) one name or contact information visualized in the map should be included as a secondary search term.
[0076] In some embodiments, the method further includes processing the documents of the plurality of documents for review prior to identifying the initial plurality of search term results. Processing includes lemmatization of the corresponding documents, and at least one primary search term is a lemma.
[0077] In some embodiments, the method further includes processing the documents of the plurality of documents for review prior to identifying the initial plurality of search term results, wherein processing includes normalization, sentence tokenization, or removal of stop words.
[0078] In some embodiments, each primary search term of the plurality of primary search terms is one of a single word, a continuous series of words, or a plurality of words located within a threshold number of words of each other.
[0079] In some embodiments, the method includes the generation of additional search term results from the initial plurality of search term results via human or artificial intelligence supplementation.
[0080] In some embodiments, the identification of the initial plurality of search term results potentially responsive to the search query is based on a Boolean comparison of each of the plurality of primary search terms to each of the plurality of documents.
BRIEF DESCRIPTION OF THE DRAWINGS
[0081] Figures 1A and IB schematically illustrates an exemplary review process in accordance with this disclosure compared with a traditional review process.
[0082] Figure 2 is a flowchart illustrating an exemplary computer-based method of filtering search results in accordance with this disclosure.
[0083] Figures 3A-3C illustrate an exemplary document processed in accordance with the exemplary method of FIG. 2.
[0084] Figures 4A and 4B illustrate individual search term results in accordance with the exemplary method of FIG. 2.
- 9 -
SUBSTITUTE SHEET ( RULE 26) [0085] Figure 5 illustrates the classification of groups of documents based on exemplary search term results in the context of the exemplary method of FIG. 2.
[0086] Figures 6A-6B illustrate the presentation of a document component, occasionally referred to herein as a “snippet,” to a user in the context of the exemplary method of FIG. 2. Such a document component may comprise all or part of a document unit, such as a text string, sentence, paragraph, table, or table cell. Other document parsing schemas are possible as well.
[0087] Figures 7A-7B illustrate the presentation of a full document to a user in the context of the exemplary method of FIG. 2.
[0088] Figure 8 illustrates the extraction of email addresses from the body of a document in accordance with this disclosure.
[0089] Figure 9 illustrates the identification of domains associated with email addresses extracted in FIG. 8.
[0090] Figure 10 illustrates the presentation of domains identified in FIG. 9 to a user.
[0091] Figure 11 illustrates the population of the presentation of FIG. 10 with additional context.
[0092] Figure 12 illustrates a visualization of relationships in accordance with this disclosure.
[0093] Figure 13 illustrates lemmatization of document text in accordance with this disclosure.
[0094] Figure 14 illustrates the expansion of search terms for use in generating search results in accordance with this disclosure.
[0095] Figure 15 illustrates an email legend to be excluded from or automatically classified within search results in accordance with this disclosure. Such an email legend may correspond to a footer containing boilerplate language, for example.
[0096] Figures 16A-D illustrate a process for classifying email legends in accordance with this disclosure.
- 10 -
SUBSTITUTE SHEET ( RULE 26) [0097] Figures 17A-17B illustrate the automatic classification of search term results originating from legends in accordance with this disclosure.
[0098] Figures 18A-18B illustrate an interface for bulk classification of documents in accordance with this disclosure.
[0099] Figures 19A-19B illustrate the grouping and categorization of search results by subject matter in accordance with this disclosure.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[00100] The description of illustrative embodiments according to principles of the present invention is intended to be read in connection with the accompanying drawings, which are to be considered part of the entire written description.
[00101] The features and benefits of the invention are illustrated by reference to the exemplified embodiments. Accordingly, the invention expressly should not be limited to such exemplary embodiments illustrating some possible non-limiting combination of features that may exist alone or in other combinations of features; the scope of the invention being defined by the claims appended hereto.
[00102] This disclosure describes the best mode or modes of practicing the invention as presently contemplated. This description is not intended to be understood in a limiting sense, but provides an example of the invention presented solely for illustrative purposes by reference to the accompanying drawings to advise one of ordinary skill in the art of the advantages and construction of the invention. In the various views of the drawings, like reference characters designate like or similar parts.
[00103] A system and method described herein allow a user to quickly and efficiently filter search results and/or documents by viewing grouped search results in context. As discussed in more detail below, such grouping may follow a deduplication process, such that the grouped search results have already been deduplicated. By grouping search results in situations where context is related or identical, users may be able to quickly eliminate false positives from bulk search results, thereby drastically
- 11 -
SUBSTITUTE SHEET ( RULE 26) reducing the number of documents from which those search results are drawn that require more comprehensive or fully manual review.
[00104] The system and method are described herein in the context of document review processes, such as legal reviews in in preparation for litigation. In the context of such document review, documents are generally reviewed for several distinct reasons. These reviews typically include a review for responsiveness, privileged information, and confidential information.
[00105] In the context of a responsiveness review, bulk documents are reviewed to determine if they are substantively responsive to a requirement, and therefore should be included in a set of results. For example, when responding to a discovery request, the responsiveness review would identify all documents that should be disclosed in response, barring any privileged or confidential information that would prevent such disclosure.
[00106] Accordingly, a search might initially be overbroad, and a comprehensive review of those search results would typically be implemented to eliminate false positives. In the context of responsiveness review, a “false positive” would therefore be a document that should not be disclosed.
[00107] In the context of a privilege or confidentiality review, bulk documents are reviewed to determine if any documents that are substantively responsive should, nevertheless, be withheld or redacted due to some basis for privilege or confidentiality. This might be, for example, due to a particular document being a privileged communication between an attorney and a client.
[00108] Accordingly, a search to identify such privileged or confidential information might be overbroad, and a review of such results would eliminate false positives. However, in the context of these reviews, a “false positive” would be a document that should be disclosed, while true positives would be withheld or redacted.
[00109] It is noted that while this discussion is presented in the context of document review, such as in preparation for litigation, a similar method of filtering search results may be utilized in other contexts as well (e.g., privacy review). Such a
- 12 -
SUBSTITUTE SHEET ( RULE 26) method may then be used to increase a precision level, speed, or efficiency, of partially automated searching more generally.
[00110] Figures 1A and IB schematically illustrate an exemplary review process in accordance with this disclosure compared with a traditional review process, such as the Electronic Discovery Reference Model (EDRM). As shown in FIG. 1 A, in implementing a traditional electronic document review process, an initial search may identify a plurality of documents for review using some automated search process, and may then simply present those documents to attorneys for review. As such, all documents are reviewed in a linear fashion in their entirety to determine whether such a document is responsive and, if so, if the document is privileged or contains confidential information and therefore should be withheld or requires redaction.
[00111] In contrast, as shown in FIG. IB, in the method disclosed herein, instead of searching for responsive documents, an initial search instead identifies potentially relevant fragments of documents, referred to herein as “search term results” made up of search results including a search term. Once a plurality of search term results is identified in an initial search, such fragments may be presented to a user. Where those fragments contain sufficient information to make a determination with respect to a particular review, the user can then make their determination on the basis of just the search term results. Accordingly, in a typical implementation, the process may be 80% fragment based and 20% full document based.
[00112] Further, because documents are presented based on fragments, the fragments may be deduplicated, where they are identical, and/or grouped where they are similar. Accordingly, where multiple documents contain the same potentially relevant search terms with the same immediate context, an initial search may identify all instances of the potentially relevant term and the method may then group the corresponding search term results. A user may make a determination with respect to all such grouped identical search term results simultaneously.
- 13 -
SUBSTITUTE SHEET ( RULE 26) [00113] Accordingly, as shown in the fragment-based portion of the review, processes may be automated or semi-automatic. Accordingly, a legend review may identify boilerplate text in order to support a review for privilege or otherwise exclude the legend from any substantive evaluation of the corresponding document. As discussed in more detail below, this may be by identifying keywords typically associated with privilege (such as terms explicitly indicating that a document may be privileged), and determining if those keywords appeared only in boilerplate text. In such a scenario, the keyword may not actually indicate anything with respect to the substance of the corresponding document. For example, an attorney email may include a signature block having a legend indicating possibly privileged content which does not bear on whether any particular email is privileged.
[00114] Further, search term results and documents as a whole may be handled in bulk based on the fragments including the identified search terms. This is discussed in more detail below in reference to the method disclosed.
[00115] Figure 2 is a flowchart illustrating an exemplary computer-based method of filtering search results in accordance with this disclosure. Further, FIGS. 3A-3C illustrate an exemplary document processed in accordance with the exemplary method of FIG. 2. The document presented in FIGS. 3A-3C may inform the portions of a document referred to using specific terms, and will therefore be referenced throughout the discussion of the method.
[00116] Initially, a plurality of documents is identified (200) for review. This may be by loading a database or receiving a batch of documents for review. In some cases, such a batch of documents may be accompanied by a file manifest, or a load file, containing extracted metadata. Alternatively, the plurality of documents may be initially created or compiled by scraping underlying sources of documents, such as email accounts, or by receiving access to files. Further, the documents may be retrieved from hard copies, and as such may be the result of a scanning process.
- 14 -
SUBSTITUTE SHEET ( RULE 26) [00117] It is assumed generally that the plurality of documents is in a condition that allows for automated review. As such, the plurality of documents are typically text documents or contain a text layer that can be processed. Alternatively, if documents are initially provided as image files, text may be extracted using, for example, an optical character recognition (OCR) process. Similarly, the documents may contain features in order to ease the ability of an automated system to process the documents further. For example, the documents may contain tags created in a markup language, such as HTML or XML, or the documents may be provided as standard format email files.
[00118] In some embodiments, the plurality of documents may be processed (210) prior to executing any search operation. This may include, for example, text normalization (such as, e.g., lower or upper case transformations), sentence tokenization, removal of stop words, punctuation, numbers, emojis, or other elements, lemmatization of terms, or other processing steps, discussed in more detail below with respect to FIG. 13.
[00119] The method then proceeds to identify search terms (220) for use in the search process. The search terms used for the search may vary based on the type of search to be performed. As such, a search for documents responsive to a discovery request may include substantive terms identified by the parties involved in a legal conflict for review. Alternatively, or in addition, an automated or semi-automated process may be implemented to identify additional appropriate terms (e.g. via the use of synonyms). Such a list of additional appropriate terms may be generated by human and/or artificial intelligence).
[00120] Alternatively, a search for documents that are believed to be privileged may rely on a different set of search terms. Search terms related to responsiveness may be substantive, while search terms related to privilege may instead relate to names of individuals, such as attorneys, as well as law firms, each of which may be likely to have privileged discussions.
- 15 -
SUBSTITUTE SHEET ( RULE 26) [00121] The search terms used may take different forms, and may include, for example, individual phrases, terms within a range of each other, stemmed terms, terms including wildcards, and Boolean operators, among others.
[00122] Once search terms are selected, the method implements a search (230) based on those search terms. As a result of the implemented search, the method identifies (240) an initial plurality of search term results potentially relevant to a search query. Each search term result included in the initial plurality of search term results contains an occurrence of a search term included in a plurality of search terms included in the search. The initial plurality of search term results may be used to generate additional search term results via human or artificial intelligence supplementation (e.g., unsupervised machine learning algorithms, such as transformers, text embedding, such as vector representations, and other similarity based methods).
[00123] The search query discussed herein is the objective of the particular search being implemented. As such, in the context of a responsiveness search, the search query is to generally identify documents responsive to a discovery request. Similarly, for a privilege review, the search query would instead be to determine whether a document is privileged.
[00124] Accordingly, the identification of search term results potentially relevant to the search query (at 240) is typically based on whether the search term result contains an occurrence of a search term of the plurality. It is noted that in some embodiments, the search methodology may be implemented repeatedly. For example, a first search may be implemented for a responsiveness review while a second search may be applied to the batch of documents or only to documents determined to be responsive to determine if any such responsive documents are privileged. In some embodiments, a search may precede the responsiveness search as well, such as in order to identify legends to exclude from later search results. In discussing such results, search terms associated with a first search pass may be referred to as primary search terms while search terms associated with a second search pass may be referred to as secondary search terms. It will be understood
- 16 -
SUBSTITUTE SHEET ( RULE 26) that the classification of “primary” and “secondary” are for the purpose of distinguishing sets of search terms, and do not necessarily indicate timing or importance of the searches relative to each other.
[00125] Each search term result contains an occurrence of a search term and typically contains context associated with the search term. As shown in FIGS. 3A-3C, the search term result may then include the search term (which may be either a single word or phrase or a grouping of words or phrases within a range) as well as search term context surrounding the search term. Each such search term result is drawn from one of the plurality of documents for review, and is a fragment of the document from which it is drawn.
[00126] Typically, because the search term result includes context, it would be longer than the search term contained therein. Such a search term result may then be, for example, a sentence containing the corresponding search term. Alternatively, the search term result may be a corresponding search term combined with a previously defined number of leading and/or following characters. For example, the search term result may be the search term combined with 80 preceding and 80 following characters.
[00127] The method then proceeds to group search term results (250) of the initial plurality of search term results that are identical to each other. As such, a plurality of groups of search term results may then be defined, such that a first group of search term results comprises search term results different than those of a second group of search term results. All search term results of the first and second group are drawn from the initial plurality of search term results and as noted above, the search term results are typically larger than the underlying search term. Accordingly, search term results are defined as identical, and are thereby grouped, only if contents of the search term results other than the search term are identical.
[00128] Grouping in this manner may be referred to elsewhere herein as “deduplication.” Because the search term results represent fragments of documents, rather than complete documents, the fragments may be duplicative where similar
- 17 -
SUBSTITUTE SHEET ( RULE 26) sentences appear across a large number of documents. If such a fragment does not indicate a positive search result, then any identical fragment would similarly not indicate a positive search term result. As such, by grouping, or deduplicating, such identical search term results, the results may be reviewed more efficiently.
[00129] In this implementation of the method, the grouping described (and provided in step 250) represents a deduplication step. However, as noted above, in other embodiments, deduplication and grouping may represent distinct steps. In such an embodiment, the search term results may first be deduplicated (at 250) followed by a second grouping step (at 255) where similar search term results that have already been deduplicated (at 250) may then be grouped.
[00130] Once grouped, representative search term result is typically defined for each group (260) such that the representative result can be presented to a user (at 270) for evaluation. The user may then review the representative search term result and determine (at 280) if the search term result, taken alone, provides sufficient context for the underlying search term to determine if the search term result is relevant to the search query.
[00131] If the user determines that the representative search term result provides sufficient context, they may then indicate that the representative search result is to be removed from the search results, or that it should be retained in the results. The user may indicate that the representative search term result is to be removed because, for example, it represents a false positive.
[00132] The method may then receive an indication (290) from the user that the representative search term result is to be removed or retained. In such a scenario, where the user indicates that the search term result is to be removed, the method then removes (300) all search term results of the corresponding group from the initial plurality of search term results to define a modified plurality of search term results (310). Alternatively, the method may instead receive an indication (at 290) from the user that representative search term result is to be retained, and in such a scenario, the
- 18 -
SUBSTITUTE SHEET ( RULE 26) corresponding group may continue to be present in the modified plurality of search term results.
[00133] Figures 4A and 4B illustrate individual search term results in accordance with the exemplary method of FIG. 2. While the example relates to use in a privilege review, a similar approach is used in other types of reviews discussed herein. FIG. 4A shows an example of a term that was included in the plurality of search term results potentially relevant to a search query because it included the term “attorney work product,” which implies that the document may be privileged. The search term result then includes the context, which appears to assert privilege. As such, a user may indicate that the result is a true positive result by, for example, ticking a box that indicates that the result should be “pinned” and thereby retained in the modified plurality of search results. Various interface implementations are contemplated, and in the illustrated embodiment, the user indicates a true positive respond by clicking a check icon indicating that the result should be retained in the modified plurality of search results.
[00134] In contrast, FIG. 4B shows an example of a term that was included in the plurality of search term results potentially relevant to a search query because it included the term “confidential and privilege.” However, when viewing the search term results, it is clear that this language was part of boilerplate, such as a legend, and merely indicated that documents may contain privileged information. As such, this language alone does not indicate relevance to the search query, and the user may indicate that the representative search term result is therefore a false positive and can be removed by ticking the box corresponding to the trash icon, or selecting an “X” icon.
[00135] In order to enhance the contrast between the true positive search result in FIG. 4A and the false positive search result in FIG. 4B, the result in FIG. 4A may be shaded in, e g., green, while the result in FIG. 4B may be shaded in, e.g., red.
[00136] As shown, the representative search term result shown in FIG. 4B was part of a group that included 14 identical search term results. Accordingly, upon receiving the indication from the user (at 290), the method removes all 14 results (at 300).
- 19 -
SUBSTITUTE SHEET ( RULE 26) [00137] This process may then be repeated for each group of the plurality of groups until all groups have been evaluated. For any group for which the method receives an indication from a user that the representative search term result is to be removed, all search term results of the corresponding group are then removed. Accordingly, once the method receives the indication (at 290), a representative search term result of a different group may be presented to a user (at 270). In some embodiments, multiple such representative search term results are presented simultaneously, such that a user can continue down a list. This is shown in interface examples discussed in more detail below.
[00138] Figure 5 illustrates the classification of groups of documents based on exemplary search term results in the context of the exemplary method of FIG. 2. As shown, several groups are presented, and for each group, the user can provide an indication that the representative search term should be removed or retained (at 290). The indication for each line item, or search result, may then be shown on the right, while the results themselves may be shaded in e.g., green, red, or a neutral color to further visually represent the result.
[00139] Figures 6A-6B illustrate the presentation of a document component, or snippet, to a user in the context of the exemplary method of FIG. 2. In some embodiments, the user may choose not to classify a representative search result, thereby leaving a box unchecked. Similarly, or alternatively, the user may choose to proactively indicate that a representative search result is ambiguous (320) and that further context is required in order to classify the document. In such a scenario, the method may then present (330) a document component containing the search term result to the user. The component is typically larger than the corresponding search term result, but smaller than the document from which the search term result is drawn. Such a component example is illustrated in FIGS. 3A-3C.
[00140] Upon receiving a further indication from the user that the component is to be removed from the search term results (340), the corresponding representative search
- 20 -
SUBSTITUTE SHEET ( RULE 26) term result is removed from the corresponding group of search term results (350). However, because additional context was required for the user to make the determination, the remaining search term results of the group remain in the modified plurality of search term results until they are separately evaluated.
[00141] Accordingly, a different representative search term result is typically defined for the corresponding group (at 260) such that the replacement representative result can be presented to a user (at 270) for evaluation and only the specific search term result removed (at 350) is removed from the modified plurality defined (at 310). The process then repeats until all such results are reviewed. Because the user had previously indicated that the search term result did not provide sufficient context, in some such scenarios, the user would continue to review all search term results of the corresponding group consecutively.
[00142] In some embodiments, where the larger component is identical across multiple search term results, such components may be presented to the user as a group for evaluation in a similar manner to the review of the search term results themselves.
[00143] Figures 7A-7B illustrate the presentation of a full document to a user in the context of the exemplary method of FIG. 2. If, after reviewing a component, the user still cannot make a determination with respect to the relevant search term result, the method may proceed to present the corresponding full document to the user for review. As in the case of the component review discussed above, if a user indicates that a full document is to be removed from the search results, such a decision would remove the corresponding search term result from a group, but would not impact a classification associated with the group as a whole.
[00144] In some embodiments, if the user cannot make a determination based on the component, the user may simply indicate that the document is ambiguous or otherwise quarantine the document rather than classifying it. In this way, any full document review can be deferred to later in the process.
- 21 -
SUBSTITUTE SHEET ( RULE 26) [00145] After evaluating a number of groups of search term results, the method may then return to plurality of documents initially identified for review. As such, each search term result remaining in the modified plurality of search term results after such processing is traced back to the underlying document from which it was drawn. Each document from which any search result of the modified plurality of search results is drawn is defined (360) as a potentially relevant document.
[00146] In this way, the search term results may be initially reviewed independently outside of the larger context of the document from which it is drawn. If the search term results without any broader context are sufficient to determine that a particular result should not be included in the modified plurality of search results, it can thereby be removed.
[00147] Because search results are reviewed by users in the context of search term results, rather than as complete documents, multiple search term results may be drawn from a single document. As such, any single true positive search term result would be sufficient for inclusion of the corresponding document as a potentially relevant document, and any document is defined as potentially relevant to the search query so long as any single search term result associated with the corresponding document remains in the modified plurality of search term results. For example, if a document included three distinct sentences or clauses that appeared as search term results and one of them was indicated by the user for removal, the remaining two search term results drawn from that document would remain in the modified plurality of search term results, and may thereby lead to inclusion of the corresponding document as potentially relevant.
[00148] In some embodiments, the complete process is repeated in sequence for different types of searches, as noted above. Accordingly, once a set of documents are defined as potentially relevant, the method may utilize that as the identified plurality of documents for review (at 200) for a second search. For example, the first search may be for responsiveness while the second search may be for privilege.
- 22 -
SUBSTITUTE SHEET ( RULE 26) [00149] Figure 8 illustrates the extraction of email addresses from the body of a document in accordance with this disclosure. Figure 9 illustrates the identification of domains associated with email addresses extracted in FIG. 8. Figure 10 illustrates the presentation of domains identified in FIG. 9 to a user. Figure 11 illustrates the population of the presentation of FIG. 10 with additional context.
[00150] In some embodiments, a privilege check is implemented independent of, or in addition to, a main search. In implementing such a privilege check the method may first implement a first pass of filtering the search results to identify a plurality of potentially relevant documents. The method may then proceed to search documents for secondary search terms. As discussed above, the secondary search terms may be different than the primary search terms, and in the example discussed herein, the secondary search terms may comprise names or contact information of relevant parties.
[00151] In such an embodiment, if documents are identified as containing a secondary search term, such a document may be retained for further review in order to determine if the document is privileged and therefore should not be disclosed. The further review may then identify a document that is at least partially privileged or confidential. In some embodiments, or in some scenarios, a document may then be removed from the defined set of potentially relevant documents. In other embodiments or scenarios, the method may proceed to implement a redaction process while retaining the corresponding document.
[00152] As noted above, in some scenarios, the secondary search terms comprise names or contact information. Such names or contact information may be extracted from the plurality of documents for review themselves. For example, the method may further comprise extracting the names or contact information from header or body information of the plurality of documents for review.
[00153] This may be, for example, by extracting email addresses from a load file accompanying a batch, or it may be by extracting email addresses from body text using, for example, Regular Expressions, Email Parsing Libraries, Natural Language Processing
- 23 -
SUBSTITUTE SHEET ( RULE 26) (NLP), Machine Learning, Third-party APIs, or other automated or semi-automated processes. An example of a regular expression designed to extract an email address is shown, for example, in FIG. 8.
[00154] As shown in FIG. 9, the method may then proceed to parse out a domain for each email address extracted from the load file and email body. The method may similarly extract names and roles of such relevant parties. Figures 10 and 11 then illustrate the use of the method to extract additional information related to the parties. Such information may be used by humans, rule-based algorithms, or artificial intelligence algorithms, among other methods, to determine if or confirm that the domain name corresponds to a law firm or any other party that may indicate or imply a privilege claim.
[00155] For each name extracted, the retrieval of secondary information may be from a data source external to the plurality of documents, and the secondary information may then be used to determine whether to include the corresponding name or contact information as a secondary search term. The external data source may be the internet or some other external source, such as, e.g., a database or directory of law firms.
[00156] Figure 12 illustrates a visualization of relationships in accordance with this disclosure. In some embodiments, after extracting names or contact information from email header or body information of the plurality of documents for review, the method proceeds to generate a map illustrating communications between parties associated with each name or contact information extracted. Such a map may be based on relationships illustrated in email transmissions, such as email transmissions between parties or that certain parties are copied on the same communications.
[00157] This approach may illustrate relationships between individual people, parties, and entities, and may be based on temporal aspects (such as timing of communications) type and name of documents, type and name of attachments to the document, and entities involved. Upon displaying such a map to the user, the method may receive an indication from the user that at least one name or contact information visualized in the map should be included as a secondary search term.
- 24 -
SUBSTITUTE SHEET ( RULE 26) [00158] In some embodiments, authors of privileged content may be further identified using existing named entity recognition engines, such as GPT, Bard, GATE, NLTK, and/or Spacy.
[00159] Figure 13 illustrates lemmatization of document text in accordance with this disclosure. As discussed above, the plurality of documents may be processed (210) prior to executing any search operation. This may include, for example, normalization, sentence tokenization, removal of stop words/punctuation/numbers/emojis and others, lemmatization of terms, or other processing steps. In some embodiments, processing (210) may further include text embedding or transformers.
[00160] In the example shown in FIG. 13, a comparison is presented between stemming and lemmatizing a document. In some existing methodologies, certain words in documents are stemmed so that certain variations of the same word are identified as identical to a particular search term. However, such an approach ignores the morphological root of a word, and may therefore miss certain relevant variations of a term.
[00161] Instead, the method disclosed herein may lemmatize the plurality of documents prior to implementing the search (at 230). Lemmatization includes conversion of each word to its base, or dictionary form. The method pairs this with the modification of all search terms to lemmas. Such an approach considers context and increases accuracy of the search. As one example, when comparing the terms “studying” and “studies,” a stemming approach may search for “study,” thereby ignoring the similar term “studies.” Alternatively, the stemmed term may be “stud,” thereby including additional unrelated false positives. However, lemmatizing both terms would result in the term “study” appearing in both contexts.
[00162] Figure 14 illustrates the expansion of search terms for use in generating search results in accordance with this disclosure. A search term to be included as part of a search may be, for example, a single word, a continuous series of words, or a plurality of words located within a threshold number of words of each other. In this context, the
- 25 -
SUBSTITUTE SHEET ( RULE 26) identification of the initial plurality of search term results potentially responsive to the search query may be based on a Boolean comparison of each of the plurality of primary search terms to each of the plurality of documents.
[00163] In some embodiments, after a user identifies a set of search terms, the method may suggest additional search terms or combinations for use in the search. As shown in FIG. 14, a user may then choose potential synonyms to include in such results.
[00164] Figure 15 illustrates an email legend to be excluded from or automatically classified within search results in accordance with this disclosure. As discussed above, an email legend may be a form paragraph that contains keywords relevant to a search (such as, for example, the term “privileged”), but would not impact the classification of a particular document as privileged. This is because the legend is boilerplate text and is therefore not relevant to the content of the document itself.
[00165] Accordingly, in some embodiments, the method further comprises identifying at least one form paragraph prior to identifying the initial plurality of search results. In some such embodiments, the method then defines a search term result as part of the initial plurality of search term results only upon determining that the corresponding search term result is not drawn from the at least one form paragraph. In this way, the form paragraph would not trigger the inclusion of a document in the search term results, and such language would not lead to review.
[00166] In alternative implementations, after the method identifies at least one form paragraph, the method reviews the initial plurality of search term results prior to displaying the representative search term results to the user. Where the method determines that any particular group was drawn from an identified form paragraph, the method may then preemptively mark the corresponding group as slated for removal. Accordingly, a user may modify the indication, but would not be required to proactively further indicate that a particular representative search term result is to be removed.
[00167] Figures 16A-D illustrate a process for classifying email legends in accordance with this disclosure. Such a classification process may be similar to the main
- 26 -
SUBSTITUTE SHEET ( RULE 26) method described above for classifying search term results generally. As such, the method may initially utilize keywords, or search terms, expected to appear in a legend or form paragraph. Such keywords may include, for example, “privileged” or “confidential” among others.
[00168] The method may then deduplicate and/or group any form paragraphs identified using such keywords so that each is presented to a user only once, and the user may indicate that a candidate form paragraph is, indeed, a legend, or that it should otherwise be excluded from search results.
[00169] Accordingly, one example of each potential legend is presented to the user such that the user can indicate that a legend is verified, rejected, or that it should be quarantined for further review. Such classification may be resilient across batches, such that a single database may be utilized across multiple search processes.
[00170] Figures 17A-17B illustrate the automatic classification of legends in accordance with this disclosure. As shown, if a search result during a later search is drawn from a verified legend, the corresponding group may be preemptively marked for removal from the initial plurality of search term results. Alternatively, the corresponding group may not be presented to the user at all.
[00171] Figures 18A-18B illustrate an interface for bulk classification of documents in accordance with this disclosure. As discussed above, a user may be presented with a large number of groups of search term results in a single interface in which all such results may be quickly classified.
[00172] Figures 19A-19B illustrate the grouping and categorization of search results by subject matter in accordance with this disclosure. As shown, in some embodiments, where a search term result might otherwise be ambiguous, it may be presented to a user grouped with similarly categorized search term results which may then be classified as a group.
[00173] The term “data processing apparatus” and like terms encompass all kinds of apparatus, devices, and machines for processing data, including by way of example a
- 27 -
SUBSTITUTE SHEET ( RULE 26) programmable processor, a computer, a system on a chip, or multiple ones, or combinations, of the foregoing. The apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). The apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, a cross-platform runtime environment, a virtual machine, or a combination of one or more of them. The apparatus and execution environment can realize various different computing model infrastructures, such as web services, distributed computing and grid computing infrastructures.
[00174] A computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and it can be deployed in any form, including as a standalone program or as a module, component, subroutine, object, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
[00175] The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform actions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
- 28 -
SUBSTITUTE SHEET ( RULE 26) [00176] Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a processor for performing actions in accordance with instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device (e.g., a universal serial bus (USB) flash drive), to name just a few. Devices suitable for storing computer program instructions and data include all forms of non volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
[00177] To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Computers may further be provided in other forms, such as in the form of handheld devices or smartphones, as well as in the form of tablet devices. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback;
- 29 -
SUBSTITUTE SHEET ( RULE 26) and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user’s client device in response to requests received from the web browser. A device for providing interaction with a user may be referred to as a user interface device.
[00178] Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), an inter-network (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks).
[00179] The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, a server transmits data (e.g., an HTML page) to a client device (e.g., for purposes of displaying data to and receiving user input from a user interacting with the client device). Data generated at the client device (e.g., a result of the user interaction) can be received from the client device at the server.
[00180] While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any inventions or of what may be claimed, but rather as descriptions of features specific to particular embodiments
- 30 -
SUBSTITUTE SHEET ( RULE 26) of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
[00181] Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
[00182] While the present invention has been described at some length and with some particularity with respect to the several described embodiments, it is not intended that it should be limited to any such particulars or embodiments or any particular embodiment, but it is to be construed with references to the appended claims so as to provide the broadest possible interpretation of such claims in view of the prior art and, therefore, to effectively encompass the intended scope of the invention. Furthermore, the foregoing describes the invention in terms of embodiments foreseen by the inventor for which an enabling description was available, notwithstanding that insubstantial modifications of the invention, not presently foreseen, may nonetheless represent equivalents thereto.
- 31 -
SUBSTITUTE SHEET ( RULE 26)

Claims

What is claimed is:
1. A computer-based method of filtering search results, the method comprising: identifying a plurality of documents for review; identifying an initial plurality of search term results potentially relevant to a search query, each search term result of the initial plurality containing an occurrence of a primary search term of a plurality of primary search terms and each search term result of the plurality of search term results being drawn from one of the plurality of documents for review, and wherein each search term result of the plurality of search term results is smaller than the one of the plurality of documents from which it is drawn; grouping search term results of the initial plurality of search term results that are identical to each other to define a plurality of groups of search term results, such that a first group of search term results of the plurality of groups of search term results comprises search term results different than those of a second group of search term results of the plurality of groups of search term results; displaying a representative search term result of the first group of search term results for evaluation; upon receiving an indication from a user that the representative search term result is to be removed, removing all search term results of the first group from the initial plurality of search term results to define a modified plurality of search term results; defining each document from which any search result of the modified plurality of search term results is drawn as a potentially relevant document.
2. The method of claim 1 wherein each search term result of the plurality of search term results is larger than the search term, such that search term results are defined as identical only if contents of the search term results other than the search term are identical.
3. The method of claim 2 wherein each search term result is either a sentence containing the corresponding search term or is the corresponding search term combined with a previously defined number of leading or following characters.
- 32 -
SUBSTITUTE SHEET ( RULE 26)
4. The method of claim 2 wherein, upon receiving an indication from the user that the representative search term result is ambiguous, displaying a document component containing the search term result to the user, wherein the component is larger than the corresponding search term result but smaller than the document from which the search term result is drawn, and wherein, upon receiving a further indication from the user that the snippet is to be removed, removing the corresponding representative search term result from the at least one group of search term results and defining a different search term result from the group of search term results as the representative search term result for display to the user.
5. The method of claim 1, wherein multiple search term results are drawn from a single document of the plurality of documents for review, such that the document is defined as potentially relevant to the search query so long as any single search term result associated with the corresponding document remains in the modified plurality of search term results.
6. The method of claim 1 further comprising displaying a representative search term result for each group of the plurality of groups of search term results and receiving an indication from the user that the corresponding representative search term result is relevant to the search query or that the corresponding representative search term result is to be removed.
7. The method of claim 1 further comprising displaying a representative search term result for each group of the plurality of groups of search term results and receiving an indication from the user that the corresponding representative search term result is relevant to the search query, that the corresponding representative search term result is to be removed, or that the user cannot make a determination based on the representative search term result.
8. The method of claim 1 further comprising identifying at least one form paragraph prior to identifying the initial plurality of search term results and defining a search term
- 33 -
SUBSTITUTE SHEET ( RULE 26) result as part of the initial plurality of search term results only upon determining that the corresponding search term result is not drawn from the at least one form paragraph.
9. The method of claim 1 further comprising identifying at least one form paragraph prior to identifying the initial plurality of search term results and preemptively indicating to a user that a particular representative search term result presented for evaluation is drawn from the at least one form paragraph.
10. The method of claim 1 further comprising displaying each potentially relevant document to the user for further review and receiving an indication from the user that the corresponding document is either relevant or irrelevant.
11. The method of claim 1 further comprising, upon defining a group of potentially relevant documents: identifying at least one document of the group of potentially relevant documents as containing a secondary search term, and retaining the at least one document for further review.
12. The method of claim 10 wherein the further review identifies the document as at least partially privileged or confidential and implements a redaction process.
13. The method of claim 10 wherein the secondary search terms comprise names or contact information extracted from the plurality of documents for review.
14. The method of claim 13 further comprising: extracting the names or contact information from header information of the plurality of documents for review; for each name extracted, retrieving secondary information from a data source external to the plurality of documents; determining, based on the secondary information, whether to include the corresponding name or contact information as a secondary search term.
15. The method of claim 14, wherein the data source external to the plurality of documents is the internet or an external database.
- 34 -
SUBSTITUTE SHEET ( RULE 26)
16. The method of claim 13, wherein the names or contact information are extracted from a load file accompanying batches of documents or wherein the names or contact information are extracted from body text using Regular Expressions, Email Parsing Libraries, Natural Language Processing (NLP), Machine Learning, or third-party APIs.
17. The method of claim 13 further comprising: extracting names or contact information from email header information of the plurality of documents for review; generating a map illustrating communications between parties associated with each name or contact information extracted, the map based on indications that emails were transmitted between those parties or that those parties were copied on the same communications; displaying the map to the user; and receiving an indication from the user that at least one name or contact information visualized in the map should be included as a secondary search term.
18. The method of claim 1 further comprising processing the documents of the plurality of documents for review prior to identifying the initial plurality of search term results, wherein processing includes lemmatization of the corresponding documents, and wherein at least one primary search term is a lemma.
19. The method of claim 1 further comprising processing the documents of the plurality of documents for review prior to identifying the initial plurality of search term results, wherein processing includes normalization, sentence tokenization, or removal of stop words.
20. The method of claim 1 wherein each primary search term of the plurality of primary search terms is one of a single word, a continuous series of words, or a plurality of words located within a threshold number of words of each other.
21. The method of claim 1 further comprising the generation of additional search term results from the initial plurality of search term results via human or artificial intelligence supplementation.
- 35 -
SUBSTITUTE SHEET ( RULE 26)
22. The method of claim 1 wherein the identification of the initial plurality of search term results potentially responsive to the search query is based on a Boolean comparison of each of the plurality of primary search terms to each of the plurality of documents.
- 36 -
SUBSTITUTE SHEET ( RULE 26)
PCT/US2023/024931 2022-06-16 2023-06-09 Method for filtering search results based on search terms in context WO2023244505A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263352733P 2022-06-16 2022-06-16
US63/352,733 2022-06-16

Publications (1)

Publication Number Publication Date
WO2023244505A1 true WO2023244505A1 (en) 2023-12-21

Family

ID=89191801

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/024931 WO2023244505A1 (en) 2022-06-16 2023-06-09 Method for filtering search results based on search terms in context

Country Status (1)

Country Link
WO (1) WO2023244505A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100312725A1 (en) * 2009-06-08 2010-12-09 Xerox Corporation System and method for assisted document review
US20130179405A1 (en) * 2006-11-28 2013-07-11 Commvault Systems, Inc. Systems and methods for creating copies of data, such as archive copies
US20140201203A1 (en) * 2013-01-15 2014-07-17 Prafulla Krishna System, method and device for providing an automated electronic researcher
US20140317147A1 (en) * 2013-04-22 2014-10-23 Jianqing Wu Method for Improving Document Review Performance
US20150074102A1 (en) * 2005-12-05 2015-03-12 Collarity, Inc. Generation of refinement terms for search queries
US20210004416A1 (en) * 2019-07-02 2021-01-07 Microsoft Technology Licensing, Llc Extracting key phrase candidates from documents and producing topical authority ranking

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150074102A1 (en) * 2005-12-05 2015-03-12 Collarity, Inc. Generation of refinement terms for search queries
US20130179405A1 (en) * 2006-11-28 2013-07-11 Commvault Systems, Inc. Systems and methods for creating copies of data, such as archive copies
US20100312725A1 (en) * 2009-06-08 2010-12-09 Xerox Corporation System and method for assisted document review
US20140201203A1 (en) * 2013-01-15 2014-07-17 Prafulla Krishna System, method and device for providing an automated electronic researcher
US20140317147A1 (en) * 2013-04-22 2014-10-23 Jianqing Wu Method for Improving Document Review Performance
US20210004416A1 (en) * 2019-07-02 2021-01-07 Microsoft Technology Licensing, Llc Extracting key phrase candidates from documents and producing topical authority ranking

Similar Documents

Publication Publication Date Title
Ghenai et al. Catching Zika fever: Application of crowdsourcing and machine learning for tracking health misinformation on Twitter
US9558264B2 (en) Identifying and displaying relationships between candidate answers
US10055488B2 (en) Categorizing users based on similarity of posed questions, answers and supporting evidence
US20200050638A1 (en) Systems and methods for analyzing the validity or infringment of patent claims
Amarouche et al. Product opinion mining for competitive intelligence
US9286290B2 (en) Producing insight information from tables using natural language processing
US9053418B2 (en) System and method for identifying one or more resumes based on a search query using weighted formal concept analysis
CA2774278C (en) Methods and systems for extracting keyphrases from natural text for search engine indexing
US20160196336A1 (en) Cognitive Interactive Search Based on Personalized User Model and Context
US9342561B2 (en) Creating and using titles in untitled documents to answer questions
US20080189273A1 (en) System and method for utilizing advanced search and highlighting techniques for isolating subsets of relevant content data
RU2704531C1 (en) Method and apparatus for analyzing semantic information
Foong et al. Cyberbullying system detection and analysis
US20160196313A1 (en) Personalized Question and Answer System Output Based on Personality Traits
US20100198802A1 (en) System and method for optimizing search objects submitted to a data resource
US20110145269A1 (en) System and method for quickly determining a subset of irrelevant data from large data content
JP2017527913A (en) Systems and processes for analyzing, selecting, and capturing sources of unstructured data by experience attributes
US20120254166A1 (en) Signature Detection in E-Mails
Chamorro-Padial et al. Finding answers to COVID-19-specific questions: An information retrieval system based on latent keywords and adapted TF-IDF
Joshi et al. Auto-grouping emails for faster e-discovery
Saini et al. A Hybrid LSTM-BERT and Glove-based Deep Learning Approach for the Detection of Fake News
Barreira et al. A framework for digital forensics analysis based on semantic role labeling
WO2023244505A1 (en) Method for filtering search results based on search terms in context
Bonsu Weighted accuracy algorithmic approach in counteracting fake news and disinformation
Nkongolo Wa Nkongolo News Classification and Categorization with Smart Function Sentiment Analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23824450

Country of ref document: EP

Kind code of ref document: A1