EP2601573A1 - Procédé et système permettant d'intégrer des systèmes basés sur le web à des applications locales de traitement de documents - Google Patents

Procédé et système permettant d'intégrer des systèmes basés sur le web à des applications locales de traitement de documents

Info

Publication number
EP2601573A1
EP2601573A1 EP11823869.0A EP11823869A EP2601573A1 EP 2601573 A1 EP2601573 A1 EP 2601573A1 EP 11823869 A EP11823869 A EP 11823869A EP 2601573 A1 EP2601573 A1 EP 2601573A1
Authority
EP
European Patent Office
Prior art keywords
program code
computer program
entity
document
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP11823869.0A
Other languages
German (de)
English (en)
Other versions
EP2601573A4 (fr
Inventor
Marc Light
Joel Hurwitz
Khalid Al-Kofahi
Craig Larson
Kevin Koch
David Demoss
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Thomson Reuters Global Resources ULC
Original Assignee
Thomson Reuters Global Resources ULC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US12/806,116 external-priority patent/US9501467B2/en
Priority claimed from US12/806,119 external-priority patent/US11386510B2/en
Application filed by Thomson Reuters Global Resources ULC filed Critical Thomson Reuters Global Resources ULC
Publication of EP2601573A1 publication Critical patent/EP2601573A1/fr
Publication of EP2601573A4 publication Critical patent/EP2601573A4/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/252Integrating or interfacing systems involving database management systems between a Database Management System and a front-end application
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/972Access to data in other repository systems, e.g. legacy data or dynamic Web page generation

Definitions

  • the present invention relates generally to natural language processing, information retrieval and more particularly to determining relevancy of terms within documents in the context of searching for authority, such as legal authority, and in facilitating the generation of documents, such as legal briefs.
  • the invention relates to determine how relevant or important terms or aspects are to documents and in particular to the content of that document.
  • the invention relates to processes, software and systems for use in delivery of services related to the legal, corporate, and other professional sectors and more particularly delivery of such services in connection with a subscriber's work function, e.g., preparing documents in a word processing environment and application.
  • the invention relates to a system that presents searching functions to users, such as subscribers to a professional services related service, processes search terms and applies search syntax across document databases, and displays search results generated in response to the search function and processing.
  • ISP Information Service Provider
  • websites such as over the Internet
  • ISP Information Service Provider
  • the user's system can interface with an ISP's service to check proper citation form and to check on the status of the relied on authority to confirm that the statute has not been revised or repealed or that a case has not been reversed or otherwise called into question.
  • Systems may include an applet or application executing locally on the user's computer that interfaces with the ISP network- based system.
  • an application runs locally at a user's computer or access device that is operating a word processor application and automatically, such as by a user manipulating via a user interface screen, accesses the ISP service over a network connection, e.g., the Internet.
  • the ISP then applies one or more search engines across one or more databases to retrieve documents in response to terms identified in the user-created document or user defined queries or search terms.
  • the search engine(s) compare the terms that appear in the document (e.g., "summary judgment") to arrive at a set of one or more documents within a database or network of databases for presenting to the user.
  • the system may also perform a series of enhanced functions to rank or otherwise score or present the documents to the user.
  • the system may use functions such as Term Frequency-Inverse Document Frequency (TFIDF) in comparing terms appearing in a document against a collection of documents.
  • TFIDF Term Frequency-Inverse Document Frequency
  • ISP search engines can be used to search for non-standard terms and strings, because they are limited to case law a single layer search is often ineffective or at least less effective when dealing with such terms. For instance, existing ISP SEs are likely to find zero or few relevant cases on an issue represented in non-standard form or terms.
  • the known systems suffer from the disadvantage of being less effective when dealing with uncommon or non-standard terms or expression and, therefore, fail to identify and present documents, e.g., case law, that would otherwise being helpful and of interest as being related to the uncommon or non-standard terms used by the user in the user-created document.
  • the attorney In addition to testimony, the attorney must consider and attempt to identify, collect and incorporate into the witness outline a vast collection of pleadings, documents, exhibits, etc., for planning and for fast and effective reference and possible display at and during trial. For instance, where an attorney is questioning a witness at trial it is a recognized need to be able to reference the past testimony of the witness and others to good effect and to quickly locate and present, such as by overhead projector, video screen, Elmo and other means, documents as exhibits to assist in the questioning and presentation of evidence to a jury or other fact-finder.
  • ISP Information Service Provider
  • West provides a service called LiveNote that provides to users: live feed of a transcript, audio and video directly on the attorney's or user's computer; streaming live transcript, audio and video feed off-site to remote participants; effective management of transcripts and related evidence in a case; performing sophisticated full-text searches across transcripts in a case to quickly retrieve critical testimony; highlight, annotate and analyze all transcripts; view hyperlinked exhibits; create dynamic reports on keywords, issues, annotations and exhibit lists that will automatically update as a case evolves; quickly prepare PowerPoint slides of transcript text synced with video to present at trial, hearings, or meetings; shared cases over a network so multiple team members can work simultaneously, or save a project locally and synchronize your work to the network case at a later time;
  • control of a deposition or hearing integrating innovative technology with realtime resources; and enables swift, efficient and secure online collaboration at various locations.
  • West LiveNote may also be used in an online fashion, e.g., LiveNote Web, to provide users additional access and functionality.
  • Remote Access Server is an additional online type service similar to LiveNote Web.
  • LiveNote Web and RAS as well as other such systems, allow users with subscriptions to login to a case over the World Wide Web. After logging in, users may download case information, including transcripts and documents, to their computers and work from a web-based or local application, such as West LiveNote.
  • the present inventors recognized a need to provide information consumers relational and event information about entities, such as companies, persons, cities, that are mentioned in electronic documents.
  • documents such as news feeds, SEC (Securities and Exchange Commission) filings or scientific articles, may indicate that Company A merged with Company B, that Lawyer C moved to Firm D, or that the interaction of protein E with protein F produces result G.
  • the present inventors devised, among other things, systems, methods, and software that allow users to readily access additional informational resources, such as online legal research tools, while using other applications, such as word processors.
  • the invention is directed to providing a seamless user experience in connecting functions between word processing applications and ISP searching and research services.
  • the invention provides an additional layer of searching over the prior art and an enhanced searching capability to ascertain and present documents responsive to text or terms appearing in a user's working document that may not match perfectly or neatly in the manner generally presented in relevant case law, statutes and the like. Often the situation arises where a user uses loose terms or expression or may not know the exact term of art or phrase or legal standard that applies in researching or writing about a particular issue.
  • the invention may be used in connection with searching based on known terms but is particularly powerful when a user uses terms not traditionally used in connection with an issue or a subject, e.g., "everyone agrees to the underlying events" as opposed to "no genuine issue of material fact" in the context of summary judgment proceedings.
  • the invention provides the enhanced feature of searching not only primary sources, e.g., case law and statute databases, but also searches secondary sources of collections or sets of referencing texts to identify and present case law relevant to an issue being researched.
  • Reference Text documents included in Reference Text Collections or Sets (RTC or RTS), e.g., ALR, are documents that are not part of the body of law or direct legal authority but that do cite to case law, statutes, regulations and other legal authorities.
  • the invention processes the search criteria to yield a responsive set of referencing text documents from the RTC based on a user search request or query, such as may be highlighted or otherwise derived from a working document operating in a word processor application by the user.
  • the responsive set of referencing text documents are identified by matching search terms or criteria with text appearing in the referencing text documents that is associated with case law cited in the referencing text documents.
  • the system identifies those citations related to the highlighted or search terms found in the referencing text documents to yield a set of "referencing text results", which is a set of case law cited in the referencing text documents.
  • the invention generates a set of search results comprised of two sets of case law for presenting to a user on a subject of interest.
  • the first set of case law is generated by performing the search on the primary case law database and the second set of case law is generated based on the citations contained in the set of referencing text documents that relates to the user search request.
  • the invention provides a seamless integration of searching functions and database resources from the word processor environment that includes not only primary case law but also secondary sources of non-case law.
  • the invention when searching from the word processing environment for terms or highlighted statement contained within a working document, provides an additional layer of searching in addition to traditional ISP systems and provides an enhanced way of searching for responsive legal authority based on terms not traditionally used and that appear in secondary sources, e.g., ALR.
  • the system provides searching in both the primary and the secondary sources and presents responsive case law from the primary source and case law that is cited in responsive referencing text documents.
  • the system may rank, together or separately, the two sets of case law, the primary search results from the primary database of case law and the set of referencing text results.
  • the system may also reduce, such as through a de-duplication process, the set of search results or the component search results.
  • the system may display to the user the respective responsive search results either combined or separated.
  • the set or search results are then available for user examination and may be incorporated into the working document.
  • One exemplary computer-implemented system provides an add-on software framework that integrates into a host word processing application on a client access device.
  • the invention provides a web-based control of or interaction with desktop applications.
  • the add-on software framework allows users to select from one or more web applications on a web server, with each of the web application capable of controlling operation of the host word processing application (via appropriate APIs and an embedded browser control with framework).
  • the web applications facilitate access to information from the information-retrieval services and incorporation of the information in the document or in metadata associated with the document.
  • the invention further provides an enhanced experience by providing a system that automatically or semi-automatically derives information associated with user documents in a word processing environment not only to access an ISP collection of search tools and documents but also utilize secondary source of documents, e.g., ALR, AmJur, Headnotes, law review articles, in confirming legal authority and in presenting argument in work product, such as legal briefs and decisions.
  • the invention provides a computer useable medium having a set of executable code for enabling electronic communications between a word processing program of a client access device and an information services provider system (ISP).
  • the set of executable code comprises the following sets of computer program code executable by the processor.
  • a first set of computer program code for operatively connecting to the word processing program.
  • a second set of computer program code for operatively connecting to the information services provider system.
  • a third set of computer program code for accepting a user search request initiated by a user of the word processing program.
  • a fourth set of computer program code for transmitting the user search request to the information services provider system.
  • a fifth set of computer program code for receiving a set of search results, the set of search results comprising a set of referencing text results.
  • a sixth set of computer program code for displaying within the word processing program at least a portion of the set of referencing text results.
  • the third set of computer program code may comprise code for identifying a highlighted portion of text within the word processing program.
  • the word processing program may be either Microsoft Word or Corel WordPerfect.
  • the set of referencing text results preferably comprises case law and the set of search results comprises a primary set of case law results derived from an ISP case law database.
  • the computer useable medium may further comprise a computer program code for combining the set of referencing text results and the primary set of case law results.
  • the computer useable medium may comprise a memory within the information services provider system and further comprise a seventh set of computer program code for receiving from the ISP the first set of computer program code, the second set of computer program code, the third set of computer program code, the fourth set of computer program code, the fifth set of computer program code, and the sixth set of computer program code at the client access device; and an eighth set of computer program code for installing at the client access device the first set of computer program code, the second set of computer program code, the third set of computer program code, the fourth set of computer program code, the fifth set of computer program code, and the sixth set of computer program code on the client access device.
  • the invention provides a computer-implemented method for enabling electronic communications between a word processing program operating on a client access device and a computer-based information services provider system (ISP).
  • the method comprises the following steps of operatively connecting to a word processing program operating on a client access device; operatively connecting to an ISP; accepting a user search request initiated by a user of the word processing program;
  • the step of accepting a user search request may comprise identifying a highlighted portion of text within a document associated with the word processing program.
  • the method may further comprise receiving from the ISP a set of computer program code at the client access device, the set of computer program code adapted to execute on the client access device to perform in whole or in part the steps of (a)-(f); and installing the set of computer program code on the client access device.
  • the invention provides a client access, such as a computer.
  • the device includes: a processor adapted to execute code; a memory for storing executable code; a word processing program executed by the processor; means for establishing electronic communications with an information services provider system (ISP) having a first database having a primary set of documents; a first set of computer program code for operatively connecting to the word processing program; a second set of computer program code for operatively connecting to the information services provider system; a third set of computer program code for accepting a user search request initiated by a user of the word processing program; a fourth set of computer program code for transmitting the user search request to the information services provider system; a fifth set of computer program code for receiving a set of search results, the set of search results comprising a set of referencing text results; and a sixth set of computer program code for receiving for display within a user interface of the word processing program at least a portion of the set of referencing text results.
  • ISP information services provider system
  • the device displays within a user interface of the word processing program at least a subset of the primary set of documents and at least a portion of the set of referencing text results. Moreover, the displayed sets may be ranked with respect to relevancy to data associated with the user search request at least a portion of one or both of the primary set of documents and the set of referencing text results. Also, the referencing text results may comprise case law derived from case citations contained in non-case law referencing text documents identified in a database other than the first database.
  • the present invention provides a network-based, computer-implemented information services provider system (ISP) having a set of executable code for enabling data exchange with a word processing program remotely operating on a client access device, the system comprising: a processor adapted to execute code; a memory for storing executable code; a first database accessible by the processor and having stored therein a primary set of documents; a first set of computer program code adapted to operatively connect to the word processing program; a second set of computer program code adapted to receive search data associated with a user search request initiated by a user of the word processing program; a third set of computer program code adapted to generate a set of search results, the set of search results comprising a set of primary search results from the first database and a set of referencing text results derived from a database other than the first database; and a fourth set of computer program code adapted to transmit for display within a user interface of the word processing program at least a portion of the set of search results including at least a
  • the present inventors further devised, among other things, systems and methods for named-entity tagging, resolving and event and relationship extraction.
  • This further present invention addresses above discussed needs as well as others by incorporating, linking or otherwise accessing the vast amounts of documents, testimony and data collected over the course of a litigation or other proceeding as well as harnessing the research resources of an ISP for use in outlining and presenting and eliciting testimony and evidence, such as at trial.
  • ISP Information Service Provider
  • a legal professional may utilize a word processor application or component and highlight, tag, insert links or references to video, insert links or references to documents, insert links or references to case law, briefs or pleadings, etc., in preparing such documents.
  • ISPs may provide an applet or application executing locally on the user's computer that interfaces with the ISP network-based system and that may be used separate and standalone.
  • a legal team may have onsite a database(s) of documents, testimony, videotape, exhibits, etc., in electronic form.
  • the team may have one or more computers connected to display technology to present information, documents, videotape, etc., accessible from the database.
  • the present invention provides an Outline feature for use in a computer/software-based Litigation Support System ("LSS"), such as Thomson Reuters Corporation's West LiveNote and West Case Notebook software-based products.
  • the outline feature operates within the LSS to allow users to make outlines of cases and to perform other enhanced functions.
  • West Case Notebook is a software program that helps attorneys keep all case-related documents in one place while they perform all the necessary parts of litigation.
  • West Case Notebook easily integrates with Westlaw. Any research done on Westlaw® can be moved into a Case Notebook file, where users can annotate, search and report on the research and other documents.
  • West Case Notebook provides the following user enhancements: organize case documents, pleadings, legal research and information about "characters", i.e., individuals or organizations connected to the case; classify case documents, research and information by annotating notes and pre-defined, color-coded issues; export Westlaw research with comments, issue tags KeyCite status and live links directly into a Case Notebook file; receive realtime feed at depositions or court and leave with a usable electronic transcript saved into a legal team's case file; locate information quickly with summary reports on specific issues or data, and with flexible full text searching targeted to particular data sets such as specific transcripts or documents; organize sub-sets of documents and information using data groups; and remote access to case file.
  • West Case Notebook organizes all essential case information in a centralized electronic database. This allows a legal team to enter and share key facts, documents, main characters, evidence, pleadings, legal research and more. Case Notebook users are able to easily search for and find "characters", i.e., the names of major participants in cases or are people involved in cases, and associated information, e.g., "character information.” These "characters” may be directly input into the system or may be derived or "found” by the system in processing documents such as transcripts, case law, etc. The system "tags” or “pins” or otherwise associates references with the characters and provides tools that allow users to research the names or "characters” for a variety of purposes.
  • the system of the further present invention creates and inserts "Character Smart Tags” or “Smart Tags” for associating characters with documents, exhibits, testimony, outline information, etc., e.g., metadata.
  • the names of characters input into or found by the system such as appearing in transcripts, documents, and pleadings, are marked, such as by underlining, highlighting, etc., for perception and action by the user. For instance, a user right-clicking an underlined name will open a context menu.
  • the underlines are referred to as Character Smart Tags or simply Smart Tags.
  • the term "document” should be given a broad meaning to include all of the above mentioned items in whatever form and including
  • the further present invention provides character maintenance functionality based on software or program code (Entity Maintenance Module - EMM) that, in one implementation, is embedded in an LSS, e.g., West LiveNote Case Notebook, and will recognize the names of people (referred to as characters) involved in a specific case.
  • the character maintenance of EMM aspect of the LSS will search for names in the properties of documents, pleadings, and transcripts. It will search the text of transcripts and perform a character recognition process, such as by use of Adobe Acrobat or similar technology, to "OCR" the documents and pleadings, and list the primary name in, for example, a Character Display pane.
  • EMM working within an LSS, e.g., West LiveNote Case Notebook, will underline the primary names and their variants (referred to as aliases). Users will be able to access Smart Tag context menus for more information about the character, including data on Westlaw. Users will also have the option to turn off automated Character Smart Tag creation and create Smart Tags manually.
  • the system may use any of a variety of xML-based rules or constructs or other suitable schemas or formats in encoding documents or files.
  • the LSS may be integrated with or incorporate other services to enhance and leverage reporting and legal videography litigation functions.
  • West Case LiveNote is the legal industry's benchmark for transcript and evidence management and may be used in conjunction with reporting services, such as West Court Reporting Services.
  • Such integrated systems may include or interface with word processing or other software for text editing.
  • the invention allows users to insert copied text from transcripts, copied text from documents and pleadings, annotation text, questions and answers from transcripts, and electronic outlines.
  • the outline feature may be implemented as a software-based add-on to an existing subscription-based service or product.
  • a "Transcript Summary” feature may be an add-on to Case Notebook subscribers that allows users to type summaries for specific lines of transcripts.
  • An exemplary system includes an entity tagger, an entity resolver, a text segment classifier, and a relationship extractor.
  • the entity tagger receives an input text segment, and tags named entities with the segment as being a person, company, or place.
  • the entity resolver accesses an authority files, and associates the persons and companies named in the text segment with specific entries in the authority files.
  • the text segment classifier determines whether the entity tagged and resolved text segment includes a relationship event, such as job-change event or merger and acquisition.
  • the relationship extractor determines the role of named entities in the text segment within the event. For example, the extractor determines for a merger and acquisition event, which named company was the acquirer and which was acquired.
  • the further present invention provides a computer- implemented method comprising: accessing a preexisting entity list; analyzing a first document to detect an entity, the entity comprising a person, place, or organization, the first document being associated with a current legal event; resolving the entity with the preexisting entity list and: if the entity is not present in the preexisting entity list, adding the entity to the preexisting entity list and generating a first set of relationship data associated with the relationship between the first document and the entity; or if the entity is present in the preexisting entity list, generating a first set of relationship data associated with a relationship between the first document and the entity; repeating the resolving step for each distinct entity detected in the first document; and storing the first set of relationship data.
  • the method further characterized by the detected entity is one of the group consisting of attorney names, judge names, courts, names of parties to a lawsuit, expert names, witness names, and law firm names.
  • the method further characterized by the first set of relationship data includes a first set of location data representing one or more locations in the first document in which the entity appears.
  • the further present invention provides a computer- implemented method comprising: accessing a preexisting entity list; analyzing a first document to detect an entity, the entity comprising a person, place, or organization, the first document being associated with a current legal event; resolving the detected entity with the preexisting entity list and, if the detected entity is not present in the preexisting entity list, generating a list of new entities; generating respective sets of relationship data representing a relationship between the first document and each respective detected entity; repeating the resolving step for each distinct entity detected in the first document and adding each distinct entity not present in the preexisting entity list to the list of new entities; and storing the respective sets of relationship data.
  • the method further characterized by displaying a user interface adapted to allow a user to select and/or deselect one or more of the new entities.
  • the further invention provides a computer useable medium having a set of executable code for enabling electronic communications between a Word processing program of a client access device and an information services provider system (ISP), the set of executable code comprising: a first set of computer program code adapted to access a preexisting entity list; a second set of computer program code adapted to analyze a first document to detect an entity, the entity comprising a person, place, or organization, the first document being associated with a current legal event; a third set of computer program code adapted to resolve the entity with the preexisting entity list and: if the entity is not present in the preexisting entity list, adding the entity to the preexisting entity list and generating a first set of relationship data associated with the relationship between the first document and the entity; or if the entity is present in the preexisting entity list, generating a first set of relationship data associated with a relationship between the first document and the entity; a fourth set of computer program code adapted to repeat the resolving step
  • ISP information services provider system
  • the computer useable medium further characterized by a sixth set of computer program code adapted to generate smart tags based on the first set of relationship data, whereby subsequent display of the first document includes displaying a set of smart tags at a set of locations in the first document associated with the entity.
  • the computer useable medium further characterized by a seventh set of computer program code adapted to generate, in response to a report request, a signal based upon the set of smart tags; and an eight set of computer program code adapted to generate a computer display associated with the signal.
  • the further invention provides a computer- implemented method comprising: analyzing a first document to detect entities appearing in the document, the first document being associated with an event; detecting a first entity in the first document; generating a first set of relationship data representing a relationship between the first document and the detected first entity; comparing the detected first entity with a set of entity data derived from an existing authority database of known entities; and updating the authority database of known entities including storing the first set of relationship data.
  • Figure 1 is a first schematic diagram illustrating an exemplary computer-based system for implementing the present invention
  • Figure 2 is a second schematic diagram illustrating an exemplary computer- based system for implementing the present invention
  • Figure 3 is a search flow diagram illustrating an exemplary method of implementing the present invention.
  • Figure 4 is a flow diagram illustrating a database and document accessing aspect of the present invention.
  • Figure 5 is a schematic diagram of a hardware configuration of a processor- based system for implementing the present invention.
  • Figure 6 is a workflow associated with processing the Drafting Assistant aspect of the present invention.
  • Figures 7A-7C represent a logon and access aspect in conjunction with the present invention.
  • Figures 7D represents a matter control aspect in conjunction with the present invention.
  • Figure 8 is a workflow for determining compatibility of applications and controls in conjunction with the present invention.
  • Figures 9A-9B are screen shots representing IIT controls aspect in conjunction with the present invention.
  • Figure 10 is a workflow for selecting controls in conjunction with the present invention.
  • Figure 1 1 is a screen shot associated with a user-selected control in conjunction with the present invention.
  • Figure 12-14B are workflows for accessing documents and templates and importing documents in conjunction with the present invention.
  • Figure 15 is a screen shot representing a control and search and import aspect of the present invention.
  • Figures 16 and 17 are a workflow and screen shot illustrating a user selected
  • Figures 18 through 20 are a workflow and screen shots illustrating a user selected ISP search and results aspect of the present invention.
  • Figures 21 A through 26 are a workflow and screen shots illustrating the
  • Locate Authority UI and search aspect of the present invention
  • Figures 27A-27D illustrate a series of screen shots illustrating a search results screen resulting from processing the present invention.
  • Figure 28 is a block and flow diagram of an exemplary system for named- entity tagging, resolving and event extraction, which corresponds to one or more
  • Figure 29 is a diagram illustrating guided sequence decoding for named-entity tagging which corresponds to one or more embodiments of the present invention.
  • Figure 30 is a block diagram of an exemplary named-entity tagging, resolution, and event extraction system corresponding to one or more embodiments of the present invention.
  • Figure 31 is a flow chart of an exemplary method of named-entity tagging and resolution and event extraction corresponding to one or more embodiments of the present invention.
  • Figure 32 is a flow chart of another exemplary method of named-entity tagging and resolution corresponding to one or more embodiments of the present invention.
  • Figures 33-46 illustrate a series of screen shots associated with the user interface aspects and control aspects and display aspects corresponding to one or more embodiments of the present invention.
  • the present invention provides, among other things, software platform components that enable an application to perform several functions without leaving the document and the host application.
  • the document could become a software platform.
  • These functions include for example extracting key context indicators such as document type (memo, pleading, agreement etc), jurisdiction and governing law (Orange County, New York etc.) and storing them, for example, in a data structure logically associated with the user and/or the document.
  • a document identifier is also stored to uniquely associate the document with the user.
  • Some embodiments store the data as metadata linked to the document; others within subscriber data for an online legal research service (or a professional information research service.)
  • the system also presents relevant content options to users based on the context of the document being drafted.
  • the system may include functionality that automatically extracts jurisdiction, document type and title from the document and allows searching similar content on WestLaw or WestLaw Business.
  • the system may include the functionality of extract key legal entities from the document and using this information to enhance the document by adding relevant content.
  • the system may automatically extract judge and party names, link automatically to profiles, extract and validate, KeyCite (KC) Flags (West BriefTools, West Knowledge Management (West KM)), and provide guidance on citation format (West CiteAdvisor).
  • KC KeyCite
  • West BriefTools West Knowledge Management
  • West CiteAdvisor West CiteAdvisor
  • FIG. 1 shows an exemplary Integrated System 100 comprising an online information-retrieval (or legal research) system adapted to integrate with a client-operated document processing system.
  • System 100 includes at least one web server that can automatically control one or more aspects of an augmented document-processing application on a client access device.
  • the document-processing application for example, the Microsoft word application, is augmented with an add-on framework that integrates into the graphical user interface of the application and includes a browser control that can access one or more web-based applications and allow macro-type scripts of the web-based applications or services control the document processing application.
  • System 100 includes one or more databases 1 10, one or more servers 120, and one or more access devices 130.
  • Databases 1 10 includes a set of primary databases (PDC) 1 12, a set of secondary databases (RTC) 1 14, and a set of metadata databases 1 16.
  • Metadata databases 1 16 include, for instance, case law and statutory citation relationships, KeyCite data, depth of treatment data, quotation data, headnote assignment data, and ResultsPlus secondary source recommendation data. Other embodiments may include non-legal databases that include financial, scientific, or health-care information. Still other
  • embodiments provide public or private databases, such as those made available through WESTLAW, INFOTRAC, and more generally any open web or Internet content. Also, in some embodiments, primary and secondary connote the order of presentation of search results and not necessarily the authority or credibility of the search results.
  • a wireless or wireline communications network such as a local-, wide-, private-, or virtual-private network
  • Server 120 which is generally representative of one or more servers for serving data in the form of webpages or other markup language forms with associated applets, ActiveX controls, remote-invocation objects, or other related software and data structures to service clients of various "thicknesses.” More particularly, server 120 includes a processor module 121, a memory module 122, a subscriber database 123, a primary search module 124, metadata research module 125, and a user-interface module 126.
  • Processor module 121 includes one or more local or distributed processors, controllers, or virtual machines. In the exemplary embodiment, processor module 121 assumes any convenient or desirable form.
  • Memory module 122 which takes the exemplary form of one or more electronic, magnetic, or optical data-storage devices, stores subscriber database 123, primary search module 124, secondary search module 125, and user-interface module 126.
  • Subscriber database 123 includes subscriber-related data for controlling, administering, and managing pay-as-you-go or subscription-based access of databases 1 10.
  • subscriber database 123 includes one or more user preference (or more generally user) data structures.
  • one or more aspects of the user data structure relate to user customization of various search and interface options. To this end, some embodiments include user profile information such jurisdiction of practice, area of practice, and position within a firm.
  • Primary search module 124 includes one or more search engines and related user- interface components, for receiving and processing user queries against one or more of databases 1 10.
  • one or more search engines associated with search module 124 provide Boolean, tf-idf, natural-language search capabilities.
  • Secondary module 125 includes one or more search engines for receiving and processing queries against one or more of databases 1 14. Some embodiments charge a separate or additional fee for searching and/or accessing documents from the secondary databases.
  • Information-integration-tools (IIT) framework module 126 (or software framework or platform) includes machine readable and/or executable instruction sets for wholly or partly defining software and related user interfaces having one or more portions thereof that integrate or cooperate with one or more document-processing applications.
  • Exemplary document-processing (or document-authoring or -editing) applications include word processing applications, email applications, presentation applications, and spreadsheet applications. (More about the module 126 is described below.) In the exemplary embodiment, these applications would be hosted on one or more accesses devices, such as access device 130.
  • the invention may also include a metadata research module that includes one or more search engines for receiving and processing queries against metadata databases 1 16 and aggregating, scoring, and filtering, recommending, and presenting results.
  • the metadata module includes one or more feature vector builders and learning machines to implement the functionality described herein. Some embodiments charge a separate or additional fee for accessing documents from the second database.
  • a user-interface module that includes machine readable and/or executable instruction sets for wholly or partly defining web-based user interfaces over a wireless or wireline communications network on one or more accesses devices, such as access device 130.
  • Access device 130 is generally representative of one or more access devices.
  • access device 130 takes the form of a personal computer, workstation, personal digital assistant, mobile telephone, or any other device capable of providing an effective user interface with a server or database.
  • access device 130 includes a processor module 131 one or more processors (or processing circuits) 131, a memory 132, a display 133, a keyboard 134, and a graphical pointer or selector 135.
  • Processor module 131 includes one or more processors, processing circuits, or controllers. In the exemplary embodiment, processor module 131 takes any convenient or desirable form. Coupled to processor module 131 is memory 132.
  • Memory 132 stores code (machine-readable or executable instructions) for an operating system 136, a browser 137, document processing software 138.
  • memory 132 also includes document management software and time and billing system software not shown in the FIG. 1. In some embodiments, this software may be hosted on a separate server.
  • operating system 136 takes the form of a version of the Microsoft Windows operating system
  • browser 137 takes the form of a version of Microsoft Internet Explorer. Operating system 136 and browser 137 not only receive inputs from keyboard 134 and selector 135, but also support rendering of graphical user interfaces on display 133.
  • document processing software 138 includes one or more word processing applications, e.g., Microsoft Word processing software, Powerpoint presentation software, Excel spreadsheet software, and Outlook email software.
  • Document processing software is shown integrated with information-integration tools 1381 , which may be, for example, downloaded from server 120 via a wired or wireless communication link established with, for example, an ISP.
  • information-integration tools 1381 may be, for example, downloaded from server 120 via a wired or wireless communication link established with, for example, an ISP.
  • an integrated document-processing and information-retrieval graphical- user interface 139 is defined in memory 132 and rendered on display 133.
  • interface 139 presents data in association with one or more interactive control features (or user-interface elements).
  • each of these control features takes the form of a hyperlink or other browser-compatible command input.
  • User selection of some control features results in retrieval and display of at least a portion of the corresponding document within a region of interface 139.
  • FIG. 1 shows regions as being simultaneously displayed, some embodiments
  • interface 139 includes document-processing tool bar region
  • region 1393 includes control and display elements for external content and services, such as a listing of one, two, or more web apps (or locally supported apps) provided by server 120 and databases 1 10, specifically the web apps and framework components of module 126.
  • Region 1393 includes control and display elements for metadata content related to completing a task related to authoring a document loaded into document-processing (active editing) window 1392.
  • region 1393 may list contact data regarding all persons, such as law-firm and client personnel, opposing legal counsel and court personnel, and witnesses associated with a legal case for which the loaded document is being prepared. Such entities and persons are referred to herein interchangeably as “entity”, "person", “company”, and "named entity”.
  • entity entity
  • person person
  • company and named entity
  • region 1393 includes specific workflow information and control elements related to the user who launched the document-processing application and/or generic workflow information accessible via the user.
  • the user may select a workflow step or task within region 1393 and initiate update of the content or available tools and services of module 126.
  • the information integration tools include local desktop tools, such as BriefTools, CiteLink, DealProof, LiveNote, local server tools and services, such as West km knowledge management system, ES, and Elite accounting, and remote tools and services, such as KeyCite and other Thomson Reuters or third-party tools and services. These tools are made available through an exemplary software platform or framework of module 126.
  • ISP Information Services Provider
  • LSS Litigation Support System
  • FIG. 2 shows another exemplary embodiment of the overall system.
  • the framework generally allows for building applications that operate in a user desktop workflow scenario.
  • the exemplary framework or platform can be broken down into the following layers or silos.
  • Hooks Mechanism in the host application, such as a toolbar button in MS Word word processing application to invoke the container.
  • Container The area, such as a command bar object in MS Word application, where the feature applications are hosted.
  • Applications Feature applications that support a specific set of features.
  • Service Blocks Infrastructure pieces that feature applications can leverage.
  • a hook in the exemplary embodiment, is designed as a mechanism for users to open the container from a host application.
  • the hook loads itself inside that host application and then loads the container.
  • a hook also introduces a uniform way to see the content.
  • the hook through the use of application programming interfaces (APIs), provides a way to get at, extract, and/or insert data of the particular opened document within the host application.
  • a host application could be any Microsoft desktop application, WordPerfect, Adobe Professional, or a web browser (e.g., Internet Explorer, Netscape, FireFox, etc.).
  • the host application is Microsoft Word.
  • the exemplary embodiment provides single add-in for all supported Word versions.
  • One way of achieving this support is to add an abstraction layer based on the use of reflection into the version specific library to allow the same code to work for all versions of Word.
  • the abstraction layer is based on the most recent version, and falls back on earlier supported method calls if needed. It also fails gracefully when the functionality is missing in the Word version. Additionally, the layer implements changes to add-in to determine the correct version specific library to load and all method calls to Word object model using reflection.
  • UI real-estate is an area on the screen set aside for the container and a toolbar button.
  • the host application is responsible for creating this space and creating an instance of Forms. DynamicContainer. Generally a window is created as the parent of the
  • DynamicContainer Additionally, the host is responsible for providing the ability to resize the area for the DynamicContainer.
  • the software platform is a managed .Net product with the Common Language
  • CLR Runtime
  • CLR is a platform for software development that provides services by consuming metadata.
  • the software platform provides support and help for creating unmanaged host integrations using
  • the CLRLoader can be used to load the CLR into process, and invoke a designated managed class in a separate assembly to bridge into managed code and the rest of the add-in implementation.
  • the CLRLoader is a COM object that can be created using standard COM methods (CoCreateInstance( )etc). It provides an interface that starts the CLR, and can load a managed class from an assembly with information provided in a configuration file.
  • CLRLoader must be given the HostShim Attribute and the user must define a method called "Configure” that returns a void and has a single "object” parameter.
  • the software platform host application should implement the interface. Additionally, all the interfaces defined in the project, file document.cs are implemented on a set of classes to provide access to the document content of the host application.
  • the container is designed to host feature application features and functions. However, some embodiments host the feature application itself. Hosted within the container is a browser control or mini embedded browser.
  • the browser control does application user interface (UI) rendering and script execution.
  • An exemplary browser control is Internet Explorer but any web browser or equivalent would be acceptable as well.
  • UI rendering refers to displaying the user interface of the feature application within the container.
  • the feature application UI's are developed using html and Cascading Style Sheets (CSS) but some embodiments use other browser based technologies, such as ASP.Net pages, Silverlight applications, Adobe Flash applications, etc.
  • Much of the functionality of the feature applications is implemented in the JavaScript programming language.
  • Embedded in the browser control is a JavaScript execution engine that reads the script and performs the requested operations defined in the JavaScript program.
  • Feature applications are designed with intent of reusing the software platform and functionality. They are developed independently but may be dependent on the software platform components. For example, one app inserts and updates flags. Assuming the software platform already has a communication service block and diagnostics service block (service blocks described in further detail below), the communication service block could be used to gather flag information and the diagnostics service block could be used to add tracing and logging into the application as well as add exception handling into the application.
  • the communication service block could be used to gather flag information and the diagnostics service block could be used to add tracing and logging into the application as well as add exception handling into the application.
  • Another example feature application provides linking to referenced documents. This feature application relies on Office Integration to provide a handle to the document in focus within Word. The application should also include the ability to select referenced documents for analysis. An assuming once again a diagnostics service block exists with the software platform, the diagnostics service block could be used to add tracing and logging into the application as well as add exception handling into the application.
  • the UI can take the form of a static HTML page or other web application language.
  • the inclusion of a script tag for the inject.cs script file facilitates access to the desktop injected items of the Host and ServiceLocator.
  • the ServiceLocator is used to create instances of other Desktop Services by name.
  • the UI location is constrained by the container, and thus influences design of the UI.
  • the exemplary embodiment references the two JavaScript files (inject.cs and Load.cs) that are a part of the software platform main web package. JavaScript interacts with the desktop services provided. This gives access to a JavaScript reference to the "host" object as well as the "locator" ServiceLocator object. Finally, if the application provides a desktop service, the service implementation (See Software Platform Exemplary Service Practices section) is provided in an installable package.
  • Feature applications call service blocks which are designed with the intent of reusability and expose the services of those feature applications.
  • the purpose of service blocks is to supply local reusable components to a feature application.
  • the functionality can be accessed via JavaScript and/or by referencing the necessary .net assemblies. Examples of application building platform components that can be leveraged are more fully detailed and set forth in U.S. Published Application Publ. No. 2010/0115401, the entirety of which is incorporated herein by reference.
  • an addon framework is installed and one or more tools or APIs on server 120 are loaded onto one or more client devices 130.
  • this entails a user directing a browser in a client access device, such as access device 130, to internet-protocol (IP) address for an online information-retrieval system, such as the Westlaw system and then logging onto the system using a username and/or password.
  • IP internet-protocol
  • Successful login results in a web-based interface being output from server 120, stored in memory 132, and displayed by client access device 130.
  • the interface includes an option for initiating download of information integration software with corresponding toolbar plug-ins for one or more applications.
  • download administration software ensures that the client access device is compatible with the information integration software and detects which document-processing applications on the access device are compatible with the information integration software. With user approval, the appropriate software is downloaded and installed on the client device.
  • an intermediary "firm" network server may receive one or more of the framework, tools, APIs, and add-on software for loading onto one or more client devices 130 using internal processes.
  • a user may then be presented an online tools interface in context with a document-processing application.
  • this entails a user launching and opening or creating a document using one or more of the following independent applications: Microsoft Word word processing
  • word processor and “word processing application” refers broadly to “document processors” and “document processing applications” and the use of “word” and “document” should be given broad meaning in the context of units of
  • Add-on software for one or more of these applications is simultaneous invoked, which in turn results in presentation of the add-on menu.
  • the add-on menu includes a listing of web services or application and/or locally hosted tools or services. A user selects via the tools interface, such as manually via a pointing device. Once selected the selected tool, or more precisely its associated instructions, is executed. In the exemplary embodiment, this entails communicating with corresponding instructions or web application on server 120, which in turn may provide dynamic scripting and control of the host word processing application using one or more APIs stored on the host application as part of the add-on framework.
  • the user launches the host application (i.e. Microsoft Word, WordPerfect, etc.) to work on a document, e.g., legal brief or memorandum.
  • a Word processor Software Framework (WSF) interface includes code, add-on or module that may be loaded as an add-on to the host application, e.g., App 138.
  • WSF Software Framework
  • the user opens a document and selects the desired WSF Application from a list of applications presented via the integrated UI elements.
  • WSF displays the application within the WSF Container and navigates the embedded browser to the applications base URL (server 120, appropriate portion of IIT module 126).
  • WSF applications can be installed and run as: Local HTA (i.e., locally installed HTML, JS, CSS, etc.); Enterprise web application (intranet or extranet); or Internet web application, for example.
  • WSF injects the WSF Document API references into the JavaScript execution engine for access from the applications JavaScript.
  • the document in display active edit window of host application, such as a word processing application preserves the context of the application in WSF (i.e., each document has its own instance of WSF which can be customized based on user preferences).
  • the WSF JavaScript execution engine allows the application code to run.
  • the application can use the WSF API's to access the contents of the opened host (i.e., Microsoft Word, WordPerfect, etc.) Document, including modifications to these documents.
  • the WSF API's exposed to the client include but are not limited to: collection of Open Documents, including API methods for accessing Document specific data; collections of Paragraphs, Footnotes, Endnotes, Tables of Authority, hyperlinks, images and many other document content objects within a specific open document; and the ability to create a Location object to represent a given textual location within the document.
  • the WSF API methods that are called by the application in turn will call methods exposed by the Host application (ex. Microsoft Word).
  • the manner in which these calls are done is Host application specific and dependent on facilities exposed by the Host application.
  • the WSF manages the mappings between its own API and the functionality exposed by the Host. Additionally, the application can use native browser capabilities and other WSF functionality to communicate with web services available locally on the host machine, at enterprise (intranet or extranet), or the over the Internet.
  • FIG. 2 illustrates another representation of an exemplary system 200 for carrying out the herein described processes that are carried out in conjunction with the combination of hardware and software and communications networking.
  • system 200 provides a framework for searching, retrieving, analyzing, and ranking claims and/or patent documents as well as a system for monitoring user subscription rights and access and for downloading tools and software associated with providing enhanced services to subscribed users.
  • System 200 may be used in conjunction with a system 204 offering of an information or professional services provider (ISP), e.g., West Services Inc., a part of Thomson Reuters Corporation, and include an Information Integration and Tools Framework and Applications module 126, as described hereinabove.
  • system 200 includes a Central Network Server/Database Facility 201 comprising a Network Server 202, a Database of documents, e.g
  • the Central Facility 201 may be accessed by remote users 210, such as via a network 226, e.g., Internet. Aspects of the system 200 may be enabled using any combination of Internet or (World Wide) WEB-based, desktop-based, or application WEB-enabled components.
  • the remote user system 210 in this example includes a GUI interface operated via a computer 21 1, such as a PC computer or the like, that may comprise a typical combination of hardware and software including, as shown in respect to computer 21 1, system memory 212, operating system 214, application programs 216, graphical user interface (GUI) 218, processor 220, and storage 222 which may contain electronic information 224 such as electronic documents.
  • GUI graphical user interface
  • the methods and systems of the present invention, described in detail hereafter, may be employed in providing remote users access to a searchable database.
  • remote users may search a document database using search queries based on patent claims to retrieve and view patent documents of interest. Because the volume of documents is quite high, the invention provides scoring and ranking processes that facilitate an efficient and highly effective, and much improved, searching and retrieving operation.
  • Client side application software may be stored on machine-readable medium and comprising instructions executed, for example, by the processor 220 of computer 21 1 , and presentation of web-based interface screens facilitate the interaction between user system 210 and central system 21 1.
  • the operating system 214 should be suitable for use with the system 201 and browser functionality described herein, for example, Microsoft Windows Vista (business, enterprise and ultimate editions), Windows 7, or Windows XP Professional with appropriate service packs.
  • the system may require the remote user or client machines to be compatible with minimum threshold levels of processing capabilities, e.g., Intel Pentium III, speed, e.g., 500 MHz, minimal memory levels and other parameters.
  • Central system 201 may include a network of servers, computers and databases, such as over a LAN, WLAN, Ethernet, token ring, FDDI ring or other
  • Software to perform functions associated with system 201 may include self-contained applications within a desktop or server or network environment and may utilize local databases, such as SQL 2005 or above or SQL Express, IBM DB2 or other suitable database, to store documents, collections, and data associated with processing such information.
  • the various databases may be a relational database.
  • relational databases various tables of data are created and data is inserted into, and/or selected from, these tables using SQL, or some other database-query language known in the art.
  • a database application such as, for example, MySQLTM, SQLServerTM, Oracle 81TM, 10GTM, or some other suitable database application may be used to manage the data.
  • SQL Object Relational Data Schema
  • FIG. 5 an exemplary representation of a machine in the example form of a computer system 500 within which a set of instructions may be executed to cause the machine to perform any one or more of the methodologies discussed herein.
  • the system 500 may be used to implement the /system/modules/interfaces.
  • the machine operates as a standalone device or may be connected (e.g., networked) to other machines.
  • the machine may operate in the capacity of a server or a client machine in server-client network environment, or as a peer machine in a peer-to-peer (or distributed) network environment.
  • the machine may comprise a server computer, a client computer, a personal computer (PC), a network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine.
  • PC personal computer
  • network router switch or bridge
  • any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine.
  • machine shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
  • the example computer system 500 includes a processor 502 (e.g., a central processing unit (CPU), a graphics processing unit (GPU), or both), a main memory 504 and a static memory 506, which communicate with each other via a bus 508.
  • the computer system 500 may further include a video display unit 510, a keyboard or other input device 512, a cursor control device 514 (e.g., a mouse), a storage unit 516 (e.g., hard-disk drive), a signal generation device 518, and a network interface device 520.
  • the storage unit 516 includes a machine-readable medium 522 on which is stored one or more sets of instructions (e.g., software 524) embodying any one or more of the methodologies or functions illustrated herein.
  • the software 524 may also reside, completely or at least partially, within the main memory 504 and/or within the processor 502 during execution thereof by the computer system 500, the main memory 504 and the processor 502 also constituting machine-readable media.
  • the software 524 may further be transmitted or received over a network 526 via the network interface device 520.
  • machine-readable medium 522 is shown in an example embodiment to be a single medium, the term “machine-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions.
  • the term “machine- readable medium” shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present invention.
  • the term “machine-readable medium” shall accordingly be taken to include, but not be limited to, solid-state memories, optical and magnetic media, and carrier wave signals.
  • the invention may be used in connection with searching based on known terms but is particularly powerful when a user uses terms not traditionally used in connection with an issue or a subject, e.g., "everyone agrees to the underlying events" as opposed to "no genuine issue of material fact" in the context of summary judgment proceedings.
  • the invention provides the enhanced feature of searching not only primary sources (Fig. 1 - Primary DBs 1 12), e.g., case law and statute databases, but also searching secondary sources of collections or sets of referencing texts (Fig. 1 - Secondary DBs 1 14).
  • the resulting set of referencing text documents yielded by the second layer of searching is then used to identify and present primary source case law relevant to an issue being researched.
  • the invention provides an added layer of searching within a wholly separate and distinct body of reference documents or texts and then uses that secondary source search to further search primary source databases to thereby enriching and enhancing the set of primary source documents ultimately provided to the user.
  • the invention enhances the effectiveness of the overall system performance.
  • RTC or RTS Reference Text Collections or Sets
  • ALR Reference Text Collections or Sets
  • the invention processes the search criteria to yield a responsive set of referencing text documents from the RTC based on a user search request or query, such as may be highlighted or otherwise derived from a working document operating in a word processor application by the user.
  • the responsive set of referencing text documents are identified by matching search terms or criteria with text appearing in the referencing text documents that is associated with case law cited in the referencing text documents.
  • the system identifies those citations related to the highlighted or search terms found in the referencing text documents to yield a set of "referencing text results", which is a set of case law cited in the referencing text documents.
  • the invention generates a set of search results comprised of two sets of case law for presenting to a user on a subject of interest.
  • the first set of case law is generated by performing the search on the primary case law database and the second set of case law is generated based on the citations contained in the set of referencing text documents that relates to the user search request.
  • the invention provides a seamless integration of searching functions and database resources from the word processor environment that includes not only primary case law but also secondary sources of non-case law.
  • the invention when searching from the word processing environment for terms or highlighted statement contained within a working document, provides an additional layer of searching in addition to traditional ISP systems and provides an enhanced way of searching for responsive legal authority based on terms not traditionally used and that appear in secondary sources, e.g., ALR.
  • the system provides searching in both the primary and the secondary sources and presents responsive case law from the primary source and case law that is cited in responsive referencing text documents.
  • the system may rank, together or separately, the two sets of case law, the primary search results from the primary database of case law and the set of referencing text results.
  • the system may also reduce, such as through a de-duplication process, the set of search results or the component search results.
  • the system may display to the user the respective responsive search results either combined or separated.
  • the set or search results are then available for user examination and may be incorporated into the working document.
  • a user highlights text within a word processing application.
  • the system uses the highlighted text as a query, or to derive a query, to search a primary document collection/database, e.g., Primary DB 1 12 of Fig. 1.
  • the system uses this same information to search a reference text collection/database, e.g., RTC 1 14 of Fig. 1.
  • text may be normalized before it is used a search query.
  • the Novus search API may do some standard normalization before executing the search.
  • the query may be identical for each search and can be run simultaneously.
  • the system aggregates, ranks, and/or re-ranks the search results, either separately or in the aggregate.
  • the system may invoke enhanced functions of IPS services such as de-duplicating process to further refine the search results.
  • the queries may both be "natural language searches" using the same Novus search APIs.
  • the searches may be metadata restricted, for example, to specify jurisdiction.
  • the processes of steps 304-308 may be performed in part outside the user experience, including: receiving a ranked set of results for the document collection search; receiving a ranked set of results for the reference text collection search; re-ranking the aggregated results.
  • the system returns for display via a user interface a set of results to a user (optionally displays only primary documents, e.g., cases, from the PDC 1 12 and/or display separately a list of secondary or Reference Text documents from the RTC 1 14).
  • the user performs, such as via GUI, further operations to incorporate aspects of the search into the user document opened in the word processing application, e.g., 139/1392 of Figure 1.
  • re-ranking involves taking aggregated results and applying a statistical model to re-rank the results.
  • the re-ranking algorithm receives search result lists from both searches. The lists are filtered by jurisdiction and other criteria. Also, for instance, "writ denied" cases from the referencing text collection may be filtered out before being sent to the re-ranking algorithm.
  • the aggregated set of results could have duplicate cases with different rankings; usually take the higher ranked case. For example, Case A could have been found in Primary Document Collection, e.g., 1 12, and is ranked #1; Case A could also be found in Reference Text Collection, e.g., 1 14, and is ranked #2.
  • Case A from the PDC collection would be used and the Case A from Reference Text Collection will be discarded before the statistical model is run.
  • UI User Interface, e.g., 139/1393
  • the source e.g., primary or secondary
  • the results from the ISP or primary source may be listed, ranked or not, and in a second pane the results from the secondary source, e.g., referencing texts from sources such as ALR, AmJur, etc., may be presented.
  • a variety of search functions may be performed on either or both sets, separately or
  • the collections may be arrived at by Natural Language search on cases.
  • Case search could be an all cases search with filter available at any time afterward but before it gets to the user.
  • it could be a specific case search of only certain jurisdictions, court levels. For instance, about 100 cases may be passed through to be re-ranked. The number of cases returned for ranking or presenting may be limited.
  • the system may be structured so that the Reference Text Collection contains "Pseudo Documents" and operates as follows. Each Case has a Pseudo Document within the RTC collection. Pseudo
  • Documents contain references, citations and GUID for the case, e.g., a litigation maintained in a Litigation Support System (LSS), such as West Case Notebook.
  • LSS Litigation Support System
  • Reference predetermined amount of text that supports the proposition that the case is being cited for.
  • search for citations within the briefs and case databases Once a citation is found, collect pre-determined amount of text immediately preceding the citation (every citation is one reference).
  • the system associates the reference and related citation to a Case ID/Doc GUID.
  • the system concatenates new references onto existing GUID if there is one or it creates a pseudo document if the GUID has not been seen before.
  • the system stores Pseudo Documents within the RTC collection if, for example, they have a pre-defined number or more references, e.g., 10. If they do not have the requisite number of references then the system stores the Pseudo Documents in a separate collection. If the references in the separate collection become greater than, for example, 10 for a pseudo document then the pseudo document is moved to the RTC collection. Also, the system may be configured to truncate Pseudo Documents at a set number or threshold, e.g., 500 references.
  • the Pseudo Document may be a structured document with, for example, the following fields:
  • the system may be configured to add "padding" in between each chunk of referencing text (fields 3, 4, 5, and n) above. This is because the search engine, e.g., West's Novus platform, may be configured to return the text surrounding only the best matching portion of the Pseudo Document.
  • the search engine may also identify which is the best matching portion within the document, and may flag the text surrounding the best matching portion. For instance, if the referencing text for citing document B matches the user's query the best. Because of the padding, the best matching text returned will only be for citing document B. The referencing text around documents A and C are just too "far" away from the best matching portion due to the padding. This approach may be used to facilitate the UI usage of the documents returned from referencing text search.
  • the padding has no effect on the search itself, as the search doesn't recognize the padding - it's only used to determine which text to return as the best matching portion with no pollution from adjacent referencing text.
  • Figures 6 and 7A-7C illustrates methods of installation and updating of software platform in association with the present invention.
  • a software platform containing a base package for an application that includes a software platform built on a .NET framework and COM technology, a feature application, and, optionally, an updater.
  • the user downloads this package and deploys the software platform along with the feature application and possibly the updater.
  • Another option is to download and deploy the individual components separately in install order of the .NET framework, software platform, a feature application.
  • the updater can be installed anytime after the software platform is installed.
  • the updater and the software platform are independent of each other.
  • Figures 7D and 10 illustrate an exemplary manner of handling matter control in the context of the exemplary implementation of the invention.
  • the "matter” refers to a particular litigation or other legal proceeding for which a file or working area is set up on an LSS, for instance on Case Notebook.
  • the LSS may include a set of existing template or genericized document types to assist the user in preparing documents of the sort commonly associated with a broad range of litigated issues.
  • the documents may include genericized, or previously prepared, Pleadings,
  • Motions, and Memoranda may include the following Motions: Alter Judgment; Certify Class; Compel; Compel Arbitration; Compel Discovery; Consolidate; Declare a Mistrial; Directed Verdict; Dismiss; Dismiss for Lack of Jurisdiction; Limine; Intervene; Joinder; Judgment Notwithstanding the Verdict; Judgment as a Matter of Law; Judgment on Partial Findings; Judgment on the Pleadings; Judgment Under Rule 54(b); New Trial; Partial Summary Judgment; Permanent Injunction; Preliminary Injunction;
  • the set of genericized documents may also include the following documents: Trial Brief; Pleadings; Complaints; Answers and Counterclaims; and Briefs.
  • the User shall have the ability to access
  • Figure 9 illustrates a manner in which a user opens a word processing application.
  • the IIT aspect of the invention has been loaded and resides at the client access device, e.g., computer, 130 and presents to the user via a GUI control options, which may be presented in any of a number of acceptable ways including via toolbar, ribbon, container, dialog boxes, etc.
  • Figure 9A illustrates a GUI presenting control options via a ribbon.
  • Figure 9B illustrates control options appearing in a container.
  • Exemplary controls include: locate authority; check formatting; ISP/Westlaw search; Import documents; and Preferences, for example. If the User selects Locate Authority, the system launches the Locate Authority feature. If the User selects Check Formatting, the system launches the Rules Based
  • Figure 12 illustrates a screen shot in which a user has opened a word processor for editing a document shown in the right-hand pane (corresponds to 1392 region of UI 139 of Figure 1) and within left-hand panes the user has access to ISP solution functionality (corresponds to 1393 region of UI 139 of Figure 1).
  • the user has selected Transcripts and is presented with a list of available transcripts to open including opening into Case Notebook.
  • Figures 13 and 14A and 14B illustrate workflows for importing files and folders into the LSS including browsing capabilities.
  • the Drafting Assistant System includes an organizational group labeled "Templates/Model Documents" for storing documents not originating in Case Notebook. Users will have access to Templates/Model Documents even if they do not subscribe to Case Notebook. Folders and Content contained within
  • Templates/Model Documents will be the same regardless of which matter a User has selected, or even if a User has not selected a matter from Case Notebook.
  • the default folders for Templates/Model Documents are as follows: Model Documents, Language, West
  • Templates Where a firm makes networked materials available via Repository functionality, Users shall have both personal and firm folders and documents. Default firm folders are as follows: Model Documents, Language, West Templates. In a network environment, default personal folders are as follows: Model Documents, Language. All folders and content contained within Templates/Model Documents will be stored locally on the User's computer - either hard drive or network drive. All Users will have the ability to perform functions on network documents and folders.
  • Import is accessed via the Ribbon/Toobar/Pulldown, the User can select from the following options: Search and Import Local/Network Content; Import Current Document; or Import Selected Text. Access can also be via the Toolbar, Container, or dialog.
  • Figures 16-20 relate to a user performing searching functions outside the word processing application but within the context and UI 139 of the combined experience.
  • Figure 16 describes the process by which a user selects a function, e.g., ISP search - in this example West Solutions, Westlaw Search.
  • the user may be presented with a logon screen to access the ISP search services and/or content. This may depend on an existing subscription to the individual or at the firm level. Preferences associate with the user's account with the ISP may also be implemented.
  • the user experience with respect to the ISP aspect is preferably viewed as seamless and consistent within the host word processing application.
  • Figure 18 illustrates an exemplary workflow associated with a user selecting the "Locate Briefs or Motions" link in the Westlaw Search pane of Figure 17 and is self-explanatory.
  • Figures 19 and 20 illustrate UI's, and in particular the IIT region 1393 of UI 139 of Figure 1, associated with inputting eyRules search criteria, Figure 19, and displaying search results, Figure 20.
  • Figures 21 through 27 relate to a user's ability to highlight sections of text from an open document in the word processing application and to perform a search based on the present invention to return useful search results for use in preparing the working document, including incorporating excerpts from the researched authority.
  • Figures 21A and 21 B illustrate a workflow in which a user highlights a section of text in the word document, e.g., document open in right-pane region 1392 of UI 139, in order to search on the terms of interest in the search IIT region 1393 of UI 139.
  • the flow as represented in the figures explains the process.
  • Figure 22 is a workflow that illustrates the process for a user to, after performing a search using the information integrated tools and resources available in region 1393, select text from the document/authority displayed in region 1393 for "copying and pasting" into the word processor document in region 1392.
  • Figure 23 illustrates a UI presented to the user in IIT region 1393 and
  • Figure 24 illustrates a UI screen, UI 139, presented to a user for performing the process described above and in connection with Figures 21A-21B.
  • the User shall have the ability to identify text in the document being drafted which may require citation to legal authority.
  • the User shall have the ability to mark authority to visibly flag text requiring authority so that the User or the System can return later to provide the appropriate citation.
  • the User shall have the ability at any time during drafting to launch a process that will use a Westlaw query to suggest legal authority for text flagged as requiring authority.
  • the user has highlighted the text "Because unions are inevitably required to represent employees with conflicting interests, judicial review of union action must be highly deferential" from the word processing document in the right-hand region 1392.
  • the drafting assistant component of the system presents the user with "Mark to Locate Authority” tool to delineate the text to be searched for finding authority, e.g., case law or statutes stored in PDC 1 12.
  • Figure 24 shows the highlighted text as having the search delineated by the markers "START AUTHORITY" and "ENDAUTHORITY.” A second text excerpt is also shown as having been marked.
  • the User shall have the ability to go to the next set of authority markers without performing a search by selecting the Next button.
  • the user may then enter additional search criteria in the IIT region 1393 of UI 139, e.g., "Authority Type” (case law, secondary sources, statutes, and administrative codes) as well as “Date” and "Jurisdiction” criteria and restrictions.
  • the user may then click on the ""Begin Locate Authority Search” button to launch a search within the ISP.
  • Figures 27A-27D illustrate the resulting search results screens associated with the Locate Authority process.
  • Figure 28 shows an exemplary named entity tagging and resolving system 2100.
  • system 2100 includes an entity tagger 21 10, an entity resolver 2120, and authority files 2130.
  • Entger 21 10, resolver 2120, and authority files 2130 are implemented using machine-readable data and/or machine- executable instructions stored on memory 2102, which may take a variety of consolidated and/or distributed forms.
  • Entity tagger 21 which receives textual input in the form of documents or other text segments, such as a sentence 2109, includes a tokenizer 21 1 1, a zoner 2112, and a statistical tagger 21 13.
  • Tokenizer 21 1 1 processes and classifies sections of a string of input characters, such as sentence 2109. The process of tokenization is used to split the sentence or other text segment into word tokens. The resulting tokens are output to zoner 21 12.
  • Zoner 21 12 locates parts of the text that need to be processed for tagging, using patterns or rules. For example, the zoner may isolate portions of the document or text having proper names. After that determination, the parts of the text that need to be processed further are passed to statistical sequence tagger 21 13. [00122] Statistical sequence tagger 21 13 (or decoder) uses one or more unambiguous name lists (lookup tables) 21 14 and rules 21 15 to tag the text within sentence 2109 as company, person, or place or as a non-name. The rules and lists are regarded herein as high- precision classifiers.
  • Exemplary pattern rules can be implemented using regex+Java, Jape rules within GATE, ANTLR, and so forth.
  • a sample rule for illustration dictates that "if a sequence of words is capitalized and ends with "Inc.” then it is tagged as a company or organization.
  • the rules are developed by a human (for example, a researcher) and encoded in a rule formalism or directly in a procedural programming language. These rules tag an entity in the text when the preconditions of the rule are satisfied.
  • Exemplary name lists identify companies, such as Microsoft, Google, AT&T, Medtronics, Xerox; places, such as Minneapolis, Fort Dodge, Des Moines, Hong Kong; and drugs, such as Vioxx, Viagra, Aspirin, Penicillin.
  • the lists are produced offline and made available during runtime.
  • a large corpus of documents for example, a set of news stories, is passed through a statistical model and/or various rules (for example, a CRF model) to determine if the name is considered
  • Exemplary rules for creating the lists include: 1) being listed in a common noun dictionary; and 2) being used as company name more than ninety percent of the time the name is mentioned in a corpus.
  • the lookup tagger also finds systematic variants of the names to add to the unambiguous list.
  • the lookup tagger guides and forces partial solutions. Using this list assists the statistical model (the sequence tagger) by immediately pinning that exact name without having to make any statistical determinations.
  • Examples of statistical sequence classifiers include linear chain conditional random field (CRF) classifiers, which provide both accuracy and speed. Integrating such high precision classifiers with the statistical sequence labeling approach entails first modifying the feature set of the original statistical model by including features corresponding to the labels assigned by the high-precision classifiers, in effect turning "on" the appropriate label features depending on the label assigned by the external classifier. Second, at run time, a Viterbi decoder (or a decoder similar in function) is constrained to respect the partially labeled or tagged sequences assigned by the high- precision classifiers.
  • CRF linear chain conditional random field
  • This form of guided decoding provides several benefits.
  • the third benefit is an ease of customization that stems from an elimination of a need to retrain the decoder if new rules and list items are added.
  • Figure 29 is a conceptual diagram showing how a text segment "Microsoft on Monday announced a" is pretagged and how this pretagging (or pinning) constrains the possible tags or labeling options that a decoder, such as Viterbi decoder, has to process.
  • a decoder such as Viterbi decoder
  • the statistical sequence tagger calculates the probability of a sequence of tags given the input text.
  • the parameters of the model are estimated from a corpus of training data, that is, text where a human has annotated all entity mentions or occurrences. (Unannotated text may also be used to improve the estimation of the parameters.)
  • the statistical model then assembles training data, develops a feature set and utilizes rules for pinning. Pinning is a specific way to use a statistical model to tag a sequence of characters and to integrate many different types of information and methods into the tagging process.
  • the statistical model locates the character offset positions (that is, beginning and end) in the document for each named entity.
  • the document is a sequence of characters; therefore, the character offset positions are determined. For example, within the sentence “Hank's Hardware, Inc. has a sale going on right now," the piece of text “Hank's Hardware, Inc.” has an offset position of (0, 20).
  • the sequence of characters has a beginning point and an ending point; however the path in between those points varies.
  • Regular expressions contains an uppercase letter, last char is a dot, Acronym format, contains a digit, punctuation Single word lists: last names, job titles, loc words, etc.
  • Multi-word lists country names, country capitals, universities, company names, state names, etc.
  • Copy features copies features from one token to neighboring tokens, for example, the token two to the left of me is capitalized (Cap@-2)
  • First-sentence features copy features from 1st sentence words to others
  • Abbreviation feature copy features of name to mentions of abbr.
  • the features computation does not calculate features for isolated pinned tokens.
  • the computations combine hashes, combine tries, and combine regular expressions. Features are only computed when necessary (for example punctuation tokens are not in any hashes so do not look them up).
  • the Viterbi algorithm (or an algorithm similar in function) is used to efficiently find the most probable sequence of tags given the input and the trained model. After the algorithm determines the most probable sequence of tags, the text, such as tagged sentence 21 19, where the entities are located is passed to a resolver, such as entity resolver 2120.
  • Entity resolver 2120 provides additional information on an entity by matching an identifier for an external object within authority files 2130 to which the entity refers.
  • the resolver in the exemplary embodiment uses rules instead of a statistical model to resolve named entities.
  • the external object is a company authority file containing unique identifiers.
  • the exemplary embodiment also resolves person names.
  • the exemplary resolver uses three types of rules to link names in text to authority file entries: rules for massaging the authority file entries, rules for normalizing the input text, and rules for using prior links to influence future links. Other embodiments include integrating the statistical model and resolver.
  • authority file 130 is a database of information about entities.
  • an authority file entry for Swatch might have an address for the company, a standard name such as Swatch Ltd., the name of the current CEO, and a stock exchange ticker symbol.
  • Each authority file entry has a unique identity. In the previous example a unique id could be, ID:345428 , "Swatch Ltd.” , Nicholas G. Hayek Jr. , UHRN.S.
  • the goal of the resolver is to determine which entry in the authority file matches corresponds a name mention in text.
  • Swatch Group refers to entity ID:345428.
  • resolving names like Swatch is relatively easy in comparison to a name like Acme.
  • a number of related but different companies may be possible referents. What follows is a heuristic resolver algorithm used in the exemplary embodiment:
  • ORG i.e., stock exchange abbreviations
  • Figure 30 shows an exemplary system 2300 which builds onto the components of system 2100 with a classifier 2310 and a template extractor 2320, which are shown as part of memory 2102, and understood to be implemented using machine-readable and machine-executable instructions.
  • Classifier 2310 which accepts tagged and resolved text such as sentence 2129 from resolver 2120, identifies sentences that contain extractable relationship information pertaining to a specific relationship class. For example, if one is interested in the hiring relationship where the relationship is hire(firm, person), the filter (or classifier) 2312 identifies sentence (1.1) as belonging to the class of sentences containing a hiring or job- change event and sentence (1.2) as not belonging to the class. (1.1) John Williams has joined the firm of Skadden & Arps as an associate.
  • the exemplary embodiment implements classifier 2310 as a binary classifier.
  • building this binary classifier for relationship extraction entails:
  • Sentences containing requisite number of tagged entity types are called candidate sentences; 5) Identifying 500 positive instances from the candidate set and 500 negative instances.
  • a sentence in the candidate set that actually contains a relationship of interest is called a positive instance.
  • a sentence in the candidate set that does not contain a relationship of interest is called a negative instance. All sentences within the candidate set are either positive or negative instances.
  • Creating classifier that combines selected features with selected training methods Exemplary training methods include naive bayes and Support Vector Machine (SVM.) Exemplary features include co-occurring terms and syntax trees connecting relationship entities; and
  • a range of filters that are either document-dependent filters or complex relation detection filters based on machine learning algorithms are developed and tools that easily retarget new document types.
  • the structure of a document type provides very reliable clues on where the sought after information can be found.
  • the filter is flexible and automatically detects promising areas in a document.
  • a filter that includes a machine learning tool for example Weka
  • Weka machine learning tool
  • Template extractor 2320 extracts event templates from positively classified sentences, such as sentence 2319, from classifier 2310.
  • extracting templates from sentences involves identifying the name entities participating in the relationship and linking them together so that their respective roles in the relationship are identified.
  • a parser is utilized to identify noun phrase chunks and to supply a full syntactic parse of the sentence.
  • extractor 2320 entails:
  • a sentence containing a job change event is one that describes an attorney joining a law firm or other organization in a professional capacity.
  • the target corpora from which job change events are extracted are legal newspaper databases.
  • the minimal number of tagged entities which qualify a sentence for inclusion in the candidate set is one lawyer name and one legal organization name.
  • One way to efficiently collect positive and negative training instances is to stratify samplings. This can be done by sorting the sentences according to the head word of the verb phrase that connects a person with a law firm in the sentence. Then collect all head verbs that occur at least five times under a single bucket. After collection, select five example sentences from each bucket randomly and mark them as either positive or negative examples. For each bucket that yields only positive examples, add all remaining instances to the positive example pool.
  • the job change event extractor moves identified entities from a positively classified job change event sentence into a structured template record.
  • the template record identifies the roles the named entities and tagged phrases play in the event.
  • the template below (which also represents a data structure) is in reference to sentence 1.1 above.
  • classifier 2310 determines whether tagged and resolves sentences (or more generally text segments) from entity resolver 2120 include a merger and acquisitions event, that is, an event in which one company merges with or acquires another company.
  • the target corpora for extracting merger and acquisition events are financial news wire articles.
  • the minimal number of tagged entities which qualifies a sentence for inclusion in the candidate set is two company names.
  • To help collect training data utilize structured records from merger and acquisitions database on Westlaw® information-retrieval system (or other suitable information-retrieval system) to identify merger and acquisition events that have taken place in the recent past.
  • a net income announcement event occurs when a company announces it has expected or actualized net income over a specific time frame.
  • the target corpora for extract merger and acquisition events are financial news wire articles.
  • the minimal number of tagged entities which qualifies a sentence for inclusion in the candidate set is one company name and the phrase "net income" or the word "profit”.
  • To efficiently find positive instances extract net income information from SEC documents for particular companies and find positive candidates when the named company in the sentence and the dollar amount or percentage increase in profit for a time period line up with information from an SEC document.
  • Negative instances are found when the data for a particular company does not line up with SEC filings.
  • the net income announcement event extractor moves identified entities from a positively classified net income announcement event sentence into a structured template record.
  • the template record identifies the roles the named entities and tagged phrases play in the event.
  • An additional embodiment of the present invention includes a tool that generates sentence paraphrases starting from the seed templates provided by a user.
  • the tool takes sentences that indicate an event with high precision with the actual entities replaced by their generic types.
  • the sentence is searched for in a corpus and the actual entity identities are obtained.
  • other sentences are located with the same entities in the corpus (perhaps in a narrow time window) which saves as paraphrases for the initial sentence.
  • This step can now be repeated with the newly acquired sentences.
  • the sentences can be ordered according to frequencies of component phrases and manually checked to generate gold data.
  • One example is a time-stamped news corpus from different news sources, where the same event is likely to be covered by different sources;
  • ORG 1 acquired ORG2 means this is an MA sentence with ORG 1 being the buyer and ORG2 being the target.
  • Another embodiment entails extraction of information from tables found in text.
  • An SVM classifier (or another classifier similar in function) distinguishes tables from non-tables. Tables that are only used for formatting reasons are identified as non-tables. In addition, tables are classified as tables of interest, such as background, compensation, etc.
  • the feature set comprises text before and after the tables as well as n-grams of the text in the table. The tables of interest are then processed according to the following:
  • the table has to be partitioned in the labels and the values. For the exemplary table below, the system determines that the money amounts are values and the rest are labels;
  • Figure 31 shows a flow chart 2400 of an exemplary method of operating a named entity tagging, resolution, and event extraction system, such as system 2300 in Figure 30.
  • Flow chart 2400 includes blocks 2410-2460, which are arranged and described serially. However, other embodiments also provide different functional partitions or blocks to achieve analogous results.
  • Block 2410 entails breaking the extracted text into tokens. Execution proceeds at block 2420.
  • Block 2420 entails locating parts of the extracted text that need to be processed. In the exemplary embodiment, this entails use of zoner 21 12 to locate candidate sentences for processing. Execution then advances to block 2430.
  • Block 2430 entails finding the named entities within the processed parts of extracted text. Then the entities of interest in the candidate sentences are tagged.
  • Candidate sentences are sentences from target corpus that might contain a relationship of interest. For example, one embodiment identifies text segments that indicate job-change events; another identifies segments that indicate merger and acquisition activity; a yet another identifies segments that may indicate corporate income announcements. Execution continues at block 440.
  • Block 2440 entails resolving the named entities. Each entity is attached to a unique ID that maps the entity to a unique real world object, such as an entry in an authority file. Execution then advances to block 2450.
  • Block 2450 classifies the candidate sentences.
  • the candidate sentences are classified into two sets: those that contain the relationship of interest and those that do not. For example, one embodiment identifies text segments that indicate job-change events; another identifies segments that indicate merger and acquisition activity; a yet another identifies segments that may indicate corporate income announcements. When the text is classified, executes advances to block 2460.
  • Block 2460 entails extracting the relationship of interest using a template. More specifically, this entails extracting entities from text containing the relationship and place the entities in a relationship template that properly defines the relationship between the entities.
  • the extracted data may be stored in a database but it may also involve more complex operations such as representing the data according a time line or mapping it to an index.
  • Some embodiments of the present invention are implemented using a number of pipelines that add annotations to text documents, each component receiving the output of one or more prior components.
  • These implementations use the Unstructured Information Management Architecture (UIMA) framework and ingest plain text and decomposes the text into components.
  • Each component implements interfaces defined by the framework and provide self-describing metadata via XML descriptor files.
  • the framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.
  • UIMA additionally provides a subsystem that manages the exchange between different modules in the processing pipeline.
  • the Common Analysis System (CAS) holds the representation of the structured information Text Analysis Engines (TAEs) add to the unstructured data.
  • CAS Common Analysis System
  • the TAEs receive results from other UIMA components and produce new results that are added to the CAS.
  • all results stored in the CAS can be extracted from there by the invoking application (for example, database population) via a CAS consumer.
  • Primitive TAEs for example, tokenizer, sentence splitter
  • Other embodiments use alternatives to the
  • a character analysis and processing procedure 2500 begins at step 2502 with the LSS/EMM initialized with a set or list of existing character names and the associated alias names of those characters.
  • the LSS will construct this list from its relational database (RDB).
  • RDB relational database
  • the system initiates processing by passing EMM the contents of a document.
  • EMM finds characters in the document and generates a set of characters for passing to the LSS. These may be characters from the original list or new characters.
  • EMM returns to LSS a set or list of found characters which may be new or existing characters or aliases.
  • EMM also returns a set or list of document location information that represent locations or where in the document each character was found. For instance, the locations may comprise page, line, and start and end positions within the document.
  • LSS takes the returned character list or set and updates its relational database (RDB). This process may include adding new characters and updating existing ones. Also, existing characters may be updated with new aliases.
  • EMM may also identify and collect address, contact and other information associated with a character found in a document and return a set of such information to LSS. LSS may then update address, contact or other information associated with a character.
  • LSS takes the location set or list, translates it to the internal document location representation and stores in a relational table for that document.
  • the end user can access for viewing and further action the updated character set or list in the LSS interface.
  • the end user can also access for viewing and further action smart tags in the document associated with characters involved in an event, e.g., a litigation.
  • the Character Recognition Process performed by the EMM of the LSS system operates as follows.
  • the LSS integrates with a component, EMM, to recognize "characters," e.g., persons, entities, company names, that appear within part or all of a document, e.g., within the text or body of a document.
  • This process may be performed across a set of documents. For instance, in the legal context, decisions rendered in cases result in published opinions, orders or other documents that are of interest to legal professionals.
  • LSS systems provide searching functions to enable users, such as attorneys, to search, identify and examine documents of interest. For instance, an attorney may be interested in reviewing decisions rendered by a certain court, judge or other entity.
  • An LSS may maintain an existing relational DB of character or entity records associated with a collection of case law.
  • the present invention may be used, for instance on a periodic basis as decisions are rendered and published, to update the RDB to further associate published decisions with existing characters, such as judges, attorneys, parties, etc.
  • the present invention may be used to allow the LSS to create a new character record.
  • the LSS may be an integrated solution, such as West's LiveNote and Case Notebook solutions, and may include centralized components, such as web-servers and databases, and may involve localized applications that are downloaded and stored locally such as at a client computer or server.
  • Case Notebook stores data in "Cases" and each case can contain many documents in various formats.
  • the EMM provides an xml based messaging system for inter-process communication between EMM and LSS.
  • LSS starts the EMM executable as desired or on a periodic scheduled basis or as when needed to process a set of documents to recognize characters and/or maintain the RDB.
  • the LSS opens a named pipe to communicate with that process. Essentially, LSS sends xml, receives a response, then sends more xml, etc.
  • LSS sends EMM a set or list of characters. This character list or set is used for all content in the session. Characters have a name, and they can also have a list of aliases or nicknames, e.g., one alias for the name "David” is "Dave.”
  • LSS then sends a set of documents or content, e.g., each document or content may be sent one item at a time.
  • a process translates the document's internal coordinate system into a coordinate system configured in the EMM.
  • transcripts are stored with document locations specified by a page, a line and a position on that line.
  • Word Processing files are stored with document locations specified by an offset position from the start of the file.
  • Image locations are specified by a page along with a rectangle on that page (i.e., an x,y origin and a width and height).
  • the EMM document location may be the same as the transcript document location.
  • EMM then processes the document to identify characters.
  • the EMM may idetnify characters both from the existing list (derived from the LSS RDB), and it may also identify new characters that do not correspond to any character records maintained by the LSS.
  • EMM sends to the LSS a set or list of new characters, along with a set or list of location information representing where in the document each character can be found.
  • LSS then merges the returned character list with the cases character set maintained at the RDB - this may also be referred to as an authority DB.
  • the EMM may simply return to the LSS a complete list of characters identified in the set of documents and the functionality of determining duplication within the returned character set vis-a-vis the existing or authority character set.
  • LSS repeats steps (3) and (4) until it has no more documents to scan. It then shuts down the EMM process.
  • the LSS may also include code to transform LSS-related content coordinate systems into the EMM coordinate system.
  • a module may be provided to transform Word Processing coordinates into EMM coordinates.
  • Word Processing files have coordinates that are stored as a single number, which is a character offset from the beginning of the file. These are transformed into EMM coordinates by "walking down" the document. Every 75 characters the process walks forward to the end of a word. For each such instance the process recognizes this 75+ character string as a line. For every 25 lines, the process adds those lines to a page.
  • the reference to "character” is not to an entity or name, as used elsewhere in this specification, but rather to individual discrete, base units of linguistic expression. For example, the single "character” "David" comprises five characters.
  • the LSS may also include code to transform LSS-related Image coordinates into EMM coordinates. Images have words located in rectangles on pages. To transform these rectangles into lines, the LSS leverages the fact that its OCR engine lists words in the traditional English order (i.e., it starts from the top left, moves right, and then back to the left when the line is ended). Accordingly, the process runs down the list of rectangles. If the y coordinates of the word do not overlap with the previous word (which would indicate a move to the next line), or if the x coordinates are less than the previous rectangle (which would indicate a carriage return equivalent), then the process starts a new line.
  • "Automatically Creates Characters from Full Text” 2610 controls whether names of persons and organizations are automatically added to the user-displayed Characters table when the Entity Tagger tags them in the FULL TEXT data imported and stored in the LSS, for example in West Case Notebook.
  • "Automatically Creates Characters from Properties” 2612 controls whether names of persons and organizations are automatically added to the user- displayed Characters table when the user manually enters the name into a specific metadata property, e.g. Deponent Name. If the box is selected or "checked” then the EMM
  • the user may optionally de-select the checkbox appearing beneath the heading "Automatically Creates Characters from Full Text.”
  • the EMM does not automatically display new names in the Characters table when the Character Recognition software tags words in the full text of data imported into the system.
  • the EMM software still tag names, however they will be stored in a side table, for example, for the user to analyze at a later time, and potentially add them to the main Characters table. This may be a default setting.
  • the selected document e.g., memorandum l3.doc
  • the EMM Character Recognition process runs on the words indexed from the target document. After the EMM Character Recognition process is completed, the user does not see some of the entity names of persons and organizations
  • the options box 3002 allows the user to run, for example, a Characters report for the new entities and Profile the new entities on a separate part of the LSS or using an outside or separate professional services system, e.g., Westlaw.
  • the LSS may include a "onePass" type user authorization feature that permits seamless integration and flow to some or all of additional research or other tools and systems.
  • a user may also be presented with a typical "login” screen to access the outside or separate service or tool.
  • the user may select the entity "Apache Nitrogen Products, Inc.” 3006 and select "Profile on Westlaw” to display a further option box 3004 from which the user may select "Person & Company Library” feature.
  • Figure 38 shows a screen, following any required login process, for performing this added service of a search for the selected entity using the selected resource.
  • Figure 39 illustrates a series of reports resulting from the additional search.
  • search results may also be brought into the LSS system for use in performing professional services.
  • the results may include documents relating to a case and/or entity of interest to the user and may be incorporated into a documents database, may be processed for smart tagging, may be excerpted for deposition outline, etc.
  • the processes described above may now be performed on any imported document from the outside or added service.
  • the user has selected document "memorandum7.doc.”
  • the document 3500 (memorandum7.doc) is then imported into the LSS, e.g., West Case Notebook, and the EMM Character Recognition process runs on the words indexed from the target document.
  • LSS e.g., West Case Notebook
  • the user also has right-click options associated with this Smart Tagged name, appearing in the Characters table.
  • the user chooses the Characters Report 3502 right-click option for Frank Ermis.
  • the Characters Report runs and returns the reference to Frank Ermis's name in the full text of the document currently stored in the LSS.
  • the user may then click the link to the document titled "memorandum7" to view the full document referencing the name Frank Ermis.
  • the LSS retrieves the full document referencing the name Frank Ermis, highlighting the reference. This is useful when the user wants to quickly see the thousands of references to a Character of the litigation appearing across potentially thousands of documents stored in the LSS.
  • the user double-clicks the entity listing "Enron North America Corp.” 3802 to view the Properties dialog 3804 of the user-displayed Characters tablel 800.
  • the "Details” tab of the dialog 3804 is presented, but the user may click on the "Aliases” tab to add alias information for the "Enron North America Corp” entity.
  • the Aliases tab Upon selecting the Aliases tab, the user is presented with the Aliases screen 3900, including the "Other Aliases & Characters" table 3902 on the right side of the dialog box.
  • This table or list 3902 displays a list of entities displayed in the Characters table, as well as the entities tagged in the data by the EMM Character Recognition software.
  • Table 3 A compensation table
  • Step 1 which is implemented to maintain efficiency, entails identifying tables that have a reasonable chance of containing the desired relation before deep analysis are applied.
  • the tables containing the desired information are quickly identified using relation- specific classifiers based on supervised machine learning.
  • Step 2 we distinguish between label column and label rows from values inside those tables. This time, the same supervised machine learning approach is used, but the training data is different from those in Step 1.
  • Step 3 after those label rows and label column are identified, an elaborate procedure is applied to these complex tables to ensure that semantically coherent labels are not separated into multiple cells, or multiple distinct labels are not squashed into a cell.
  • the goal here is to associate each value with their labels in the same column and the same row.
  • the result of the Step 3 is a list of attribute-value pairs.
  • Step 4 a rule-based inference module goes through each attribute-value pairs and identify the desirable ones to populate the officers and directors database.
  • Step 1 Before providing the details of those steps, we will first describe the annotation for performing the supervised learning employed in both Step 1 and Step 2.
  • Step 1 and Step 2 To make our system more robust against lexical variations and table variations, we employed supervised machine learning in Step 1 and Step 2. As we know in supervised learning, one of the most challenging and time-consuming tasks is to obtain the labeled examples. To make our approach reusable across different domains, we developed a scheme that minimizes the human annotation effort needed.
  • isGenuine a flag indicates that this is a genuine table or a non-genuine table.
  • relations the relations that a table contain, such as "name+title”, “name+age”, name+year+salary” or "name+year+bonus", or a combination of them.
  • isContinuous a flag indicates that if this table is a continuation of the previous genuine table.
  • lastLabelRow the row number of the last label row.
  • lastLabelColumn the column number of the last label column associated with each relation.
  • valueColumn the number of the column that contains the desired values for each relation.
  • the flag isContinuous indicates if the current table is a continuation of the previous table. If it is, the current table can "borrow" the boxhead from previous table since such information is missing.
  • Table classification Much of past work in table classification focused on distinguishing between genuine and non-genuine tables (Wang & Hu 2002). For information extraction, we need to go a step further. We also need to know if a table contains the desired information before we perform expensive operations on it. To identify tables that contain desired relations, we employed LIBSVM (Chang & Lin 2001 ), a well-known implementation of support vector machine. Based on the annotated tables, a separate model is trained for each desired relation. In SEC domain, a table might contain multiple relations. Exemplary features include:
  • Label row and column classification Based on the annotated data, LIBSVM is again used to classify which rows belong to boxhead and which columns belong to stub.
  • the training data for the models are words in the desired tables that were manually identified as box-head and stubs by using lastLabelRow and lastLabelColumn features. Other features used include the frequency of label words, the frequency of name words, and frequency of numbers.
  • the exemplary embodiment uses a different label column classifier, since the lastColumnLabel might differ between different relations, as explained in the Annotation Section.
  • Step 1 specifically addresses the issue with the use of columnspan and rowspan in HTML table, as have been done in (Chen, Tsai, & Tsai 2000).
  • Table 3 without copying the original labels into spanning cells, the label "annual compensation” would not be attached to the value "1,300,000” using just the HTML specification. By doing this step, we only need to associate all the labels in the box-head in that particular column to the value and ignore other columns.
  • Step 2 we use certain layout information, such as underline, empty line, or background color, to determine when a label is really complete.
  • SEC filings there are many instances where a label is broken up into multiple cells in the boxhead or stub. In those cases, we want to recreate the semantically meaningful labels to facilitate later relation extraction - a process that is heavily dependent on the quality of the labels attached to the values. For example, in Table 3, based on the separate in row 5, cells "John T. Chambers", “President, Chief Executive”, and “Officer and Director” are merged into one cell, with line break marker (#) inserted into the original position. The new cell is "John T.
  • Step 4 heuristic rules were applied to identify subheader. For example, if there is no value in the whole row except for the first label cell, then that label cell is classified as subheader. The subheader label is assigned as part of the label to every cell below it until a new subheader label cell is encountered.
  • Step 5 splits certain columns into multiple columns to ensure that a value cell does not contain multiple values.
  • the first cell in first column is "name and principal position”.
  • the system detects the word “and” and split the column into two columns, "name” and “principal position", and do similar operations to all the cells in the original column.
  • cell on row 2 is the result of merge 3 cells, with line break markers between the string in the original cells. By default, we use the first line break marker to break the merged cell into two cells. After this transformation, we have "John T. Chambers” and “President, Chief" that corresponding to "name” and “principal position”. This type of operation is not only limited to "and”, but also to certain parenthesis,
  • Step 6 deals with repeated sequences in last label column.
  • Table 3 we are fortunate that all the cells under "fiscal year” contains only 1 value. There are instances in our corpus that such information is represented inside the same cell with line break between each value. In such cases, there are no lines between these values, and the resulting table looks cleaner and thus visually more pleasing. It is certainly incorrect to assign all 3 years "2005, 2004, 2003” to the cell containing bonus information "1,300,000". To address this, our system performs repeated sequence detection on all last label columns. If a sequence pattern, which doesn't always have to be exactly the same, is detected, the repeated sequence are broken into multiple cells so that each cell can be assigned to the associated value correctly.
  • This process is designed to maximize precision and recall while minimizing the annotation effort.
  • Each component can be modified to take advantage of the domain specific information to improve its performance.

Abstract

La présente invention concerne un procédé et un système qui permettent aux utilisateurs d'accéder facilement à des outils de recherche juridiques en ligne tout en utilisant d'autres applications. Un système mis en œuvre par ordinateur cité à titre d'exemple utilise un logiciel compagnon qui s'intègre dans une application de traitement de texte hôte sur un dispositif d'accès client. Ce logiciel compagnon intégré permet aux utilisateurs de faire une sélection dans un listage extensible d'une ou plusieurs applications Web sur un serveur Web, chacune de ces applications Web pouvant commander le fonctionnement de ladite application de traitement de texte hôte. Les applications Web facilitent l'extraction et l'accès à des informations provenant des services de recherche documentaire ainsi que de textes de référence sources secondaires, et l'incorporation des informations dans le document ou dans les métadonnées associées au document. La présente invention permet une expérience utilisateur fluide dans les applications hôtes, les fournisseurs de services d'information (ISP), tels que les bases de données de recherche juridiques et les outils de recherche juridiques, et les sources secondaires, comme les textes de référence en rapport avec les documents sources primaires, par exemple la jurisprudence et les statuts de droit, associées au service de l'ISP.
EP11823869.0A 2010-08-05 2011-08-05 Procédé et système permettant d'intégrer des systèmes basés sur le web à des applications locales de traitement de documents Withdrawn EP2601573A4 (fr)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US12/806,116 US9501467B2 (en) 2007-12-21 2010-08-05 Systems, methods, software and interfaces for entity extraction and resolution and tagging
US12/806,119 US11386510B2 (en) 2010-08-05 2010-08-05 Method and system for integrating web-based systems with local document processing applications
PCT/US2011/001391 WO2012033511A1 (fr) 2010-08-05 2011-08-05 Procédé et système permettant d'intégrer des systèmes basés sur le web à des applications locales de traitement de documents

Publications (2)

Publication Number Publication Date
EP2601573A1 true EP2601573A1 (fr) 2013-06-12
EP2601573A4 EP2601573A4 (fr) 2014-03-19

Family

ID=45810918

Family Applications (1)

Application Number Title Priority Date Filing Date
EP11823869.0A Withdrawn EP2601573A4 (fr) 2010-08-05 2011-08-05 Procédé et système permettant d'intégrer des systèmes basés sur le web à des applications locales de traitement de documents

Country Status (3)

Country Link
EP (1) EP2601573A4 (fr)
CA (2) CA2807494C (fr)
WO (1) WO2012033511A1 (fr)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11455350B2 (en) * 2012-02-08 2022-09-27 Thomson Reuters Enterprise Centre Gmbh System, method, and interfaces for work product management
WO2015192212A1 (fr) * 2014-06-17 2015-12-23 Maluuba Inc. Serveur et procédé pour classifier des entités d'une interrogation
US10185559B2 (en) 2014-06-25 2019-01-22 Entit Software Llc Documentation notification
US20160162467A1 (en) * 2014-12-09 2016-06-09 Idibon, Inc. Methods and systems for language-agnostic machine learning in natural language processing using feature extraction
US11487951B2 (en) * 2017-09-18 2022-11-01 Microsoft Technology Licensing, Llc Fitness assistant chatbots
US11270213B2 (en) * 2018-11-05 2022-03-08 Convr Inc. Systems and methods for extracting specific data from documents using machine learning
SE1950154A1 (en) * 2019-02-11 2020-06-02 Roxtec Ab A computerized method of producing a customized digital installation guide for building a sealed installation of one or more cables, pipes or wires by assembling ordered and delivered transit components to form a transit
US11741191B1 (en) * 2019-04-24 2023-08-29 Google Llc Privacy-sensitive training of user interaction prediction models
CN110264161A (zh) * 2019-06-21 2019-09-20 唐山开用网络信息服务有限公司 公安涉案财物管理及采集平台
US20210117503A1 (en) * 2019-10-18 2021-04-22 Coupang Corp. Computer-implemented systems and methods for manipulating an electronic document
WO2022150838A1 (fr) * 2021-01-08 2022-07-14 Schlumberger Technology Corporation Lecteur de contenu et de métadonnées de document d'exploration et de production
CN113986248B (zh) * 2021-11-03 2023-05-16 抖音视界有限公司 一种代码生成方法、装置、计算机设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7836010B2 (en) * 2003-07-30 2010-11-16 Northwestern University Method and system for assessing relevant properties of work contexts for use by information services
MX2008014893A (es) * 2006-05-23 2009-05-28 David P Gold Sistema y metodo para organizar, procesar y presentar informacion.
CA2710421A1 (fr) * 2007-12-21 2009-07-09 Marc Light Extraction d'entites, evenements et relations
US8019769B2 (en) * 2008-01-18 2011-09-13 Litera Corp. System and method for determining valid citation patterns in electronic documents

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
No further relevant documents disclosed *
See also references of WO2012033511A1 *

Also Published As

Publication number Publication date
EP2601573A4 (fr) 2014-03-19
CA2807494A1 (fr) 2012-03-15
WO2012033511A1 (fr) 2012-03-15
CA2807494C (fr) 2020-02-11
CA3060498C (fr) 2023-01-31
CA3060498A1 (fr) 2012-03-15

Similar Documents

Publication Publication Date Title
US9501467B2 (en) Systems, methods, software and interfaces for entity extraction and resolution and tagging
CA2807494C (fr) Procede et systeme permettant d'integrer des systemes bases sur le web a des applications locales de traitement de documents
US11386510B2 (en) Method and system for integrating web-based systems with local document processing applications
Kiryakov et al. Semantic annotation, indexing, and retrieval
US9218344B2 (en) Systems, methods, and software for processing, presenting, and recommending citations
Meyer et al. OntoWiktionary: Constructing an ontology from the collaborative online dictionary Wiktionary
Feldman The answer machine
Tito et al. Icdar 2021 competition on document visual question answering
Säily et al. Explorations into the social contexts<? br?> of neologism use in early English correspondence
Wang et al. Seeft: Planned social event discovery and attribute extraction by fusing twitter and web content
Qumsiyeh et al. Searching web documents using a summarization approach
Upshall Text mining: Using search to provide solutions
Weal et al. Ontologies as facilitators for repurposing web documents
Cunningham et al. Knowledge management and human language: crossing the chasm
Belerao et al. Summarization using mapreduce framework based big data and hybrid algorithm (HMM and DBSCAN)
Galitsky et al. Building chatbot thesaurus
Kasmuri et al. Building a Malay-English code-switching subjectivity corpus for sentiment analysis
Jhajj et al. Use of Artificial Intelligence Tools for Research by Medical Students: A Narrative Review
Walker et al. Prompting Datasets: Data Discovery with Conversational Agents
Varvel Jr et al. Google Digital Humanities Awards recipient interviews report
Amitay What lays in the layout
Salman et al. Doc‐KG: Unstructured documents to knowledge graph construction, identification and validation with Wikidata
Van Hooland et al. Exploring Large-Scale Digital Archives–Opportunities and Limits to Use Unsupervised Machine Learning for the Extraction of Semantics
Jóhannesson Entity linking for Icelandic
Formanek Exploring the potential of large language models and generative artificial intelligence (GPT): Applications in Library and Information Science

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20130225

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAX Request for extension of the european patent (deleted)
A4 Supplementary search report drawn up and despatched

Effective date: 20140217

RIC1 Information provided on ipc code assigned before grant

Ipc: G06F 7/00 20060101AFI20140211BHEP

RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: THOMSON REUTERS GLOBAL RESOURCES UNLIMITED COMPANY

17Q First examination report despatched

Effective date: 20180315

RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: THOMSON REUTERS GLOBAL RESOURCES UNLIMITED COMPANY

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN

18W Application withdrawn

Effective date: 20200130