WO2008070415A2 - Appareil et procédé de collecte d'informations réparties dans un réseau - Google Patents

Appareil et procédé de collecte d'informations réparties dans un réseau Download PDF

Info

Publication number
WO2008070415A2
WO2008070415A2 PCT/US2007/084728 US2007084728W WO2008070415A2 WO 2008070415 A2 WO2008070415 A2 WO 2008070415A2 US 2007084728 W US2007084728 W US 2007084728W WO 2008070415 A2 WO2008070415 A2 WO 2008070415A2
Authority
WO
WIPO (PCT)
Prior art keywords
network
search
accordance
information
collection
Prior art date
Application number
PCT/US2007/084728
Other languages
English (en)
Other versions
WO2008070415A3 (fr
Inventor
Robert P. Erickson
David A. Fox
Original Assignee
Deepdive Technologies Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Deepdive Technologies Inc. filed Critical Deepdive Technologies Inc.
Publication of WO2008070415A2 publication Critical patent/WO2008070415A2/fr
Publication of WO2008070415A3 publication Critical patent/WO2008070415A3/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Definitions

  • the present disclosure relates generally to the field of collection of electronic documents, and more particularly to the identification and collection of electronic documents, including documents accessible in a network environment, and to an apparatus and method for same.
  • Embodiments presently disclosed address a need for automated electronic discovery of information in an electronic form, e.g., corporate documents including e-mails and their attachments.
  • an electronic form e.g., corporate documents including e-mails and their attachments.
  • automated identification and preservation of electronic information such as documents, files, e-mails, and other forms of data that reside within a computer network, for both compliance-auditing and litigious- discovery purposes, for example.
  • automation and minimal interference in day-to-day operations is highly desirable.
  • electronic discovery is a process by which information in an electronic form, e.g., files, e-mails, and other forms of data, or information, is collected, reviewed, and produced from a computer network. Such collection can be in response to an investigative request, e.g., a discovery request in a litigation, internal audit, compliance audit, etc.
  • Embodiments of the present disclosure provide an automated collection of electronic information.
  • Embodiments of the present disclosure comprise identification of network resources which store electronic information and collection of such information in accordance with criteria, (e.g., criteria defined by a court of law), and preserving the identified electronic information, or data.
  • criteria e.g., criteria defined by a court of law
  • collection comprises a three phase approach: (i) pre-discovery, which includes locating connected devices (e.g., networked computers and/or computers associated with specific persons of interest, or custodians); (ii) copying data subject to specific file filtering constraints that can narrow the dataset size while preserving characteristic properties, or metadata; and (iii) optionally culling the copied data, which can include applying search criteria to the filtered data to further refine the collection.
  • pre-discovery which includes locating connected devices (e.g., networked computers and/or computers associated with specific persons of interest, or custodians); (ii) copying data subject to specific file filtering constraints that can narrow the dataset size while preserving characteristic properties, or metadata; and (iii) optionally culling the copied data, which can include applying search criteria to the filtered data to further refine the collection.
  • Embodiments of the present disclosure provide automated collection of electronic documents, including documents scattered about the network, which collection can be performed once or on a repetitive basis.
  • Figure 1 illustrates a block diagram of a representation of a network of computing devices and peripherals in which one or more embodiments of the present disclosure can be used in provided;
  • Figure 2 which comprises Figures 2A to 2H, illustrates client/server model message type examples for use in accordance with one or more embodiments of the present disclosure.
  • Figure 3 provides an illustrative example of a block diagram of an internal architecture of a search appliance in accordance with one or more embodiments of the present disclosure
  • Figure 4 which comprises Figures 4A to 4D, provides examples of scoring in accordance with one or more embodiments of the disclosure.
  • Figure 5 which comprises Figures 5 A and 5B, provides an example of scoring in exemplary cases in accordance with one or more embodiments of the present disclosure.
  • Figure 6 illustrates a flowchart of process steps to create and update an index in accordance with one or more embodiments of the present disclosure
  • Figure 7 provides an illustrative example of a block diagram of a search appliance used in indexing and searching in accordance with one or more embodiments of the present disclosure
  • Figure 8 illustrates a flowchart of process steps to score and rank search results in accordance with one or more embodiments of the present disclosure
  • Figure 9 which comprises Figures 9A and 9B, provides an illustrative example of a database schema used in one or more embodiments of the disclosure.
  • Figure 10 provides an example of a 3-ary trie tree in accordance with at least one disclosed embodiment.
  • Figure 11 which comprises Figures HA and HB, provides an example of a process for use in connection with one or more embodiments of the present disclosure.
  • Figure 12 which comprises Figures 12A and 12B, provides an example of pseudo code of a script for use in discovering shared resources in accordance with one or more embodiments.
  • a networked information search, identification and collection apparatus and method identify information stored in electronic form and provide for collection of identified information using a set of criteria.
  • a computer network can be physically interrogated using an automated process, which locates computers on the network, shared resources associated with the network computers.
  • persons associated with the identified computers referred to as custodians, can be identified.
  • Embodiments provide such automation, and deliver pre-discovery that enhances the ability and efficiency of identifying custodians of interest, which can be used to supplement, or replace, interviews with personnel. Identifying computers of a corporate network and their associations with custodians of interest via an automated process represents a new and valuable asset to the litigious-discovery and compliance-auditing markets.
  • metadata can include file attributes, such as modification time, which, along with other criteria, can be used to filter the imaged data so that a specific, responsive subset can be collected.
  • filtering and criteria used for filtering can be dictated by a court of law.
  • collection can be automatically and repeatedly performed, and can include filtering, copying, and preserving electronic data, over the computer network, to predetermined target storage, without intrusively imaging hard drives, and without intrusive installation of agents on host machines.
  • embodiments presently disclosed can be used for electronic discovery and electronic compliance auditing — whether for internal, regulatory, or pre-litigation purposes.
  • the resultant collection storage can be used for subsequent review and action.
  • a culling operation can be performed on the copied data, as a data reduction step.
  • Culling can comprise one or more searches using a search tool/utility to identify that information from the copied information which satisfies specific search criteria, e.g., keyword search strings defined in a litigation proceeding and/or by the court.
  • Culling can be used to produce a subset of the collected data responsive to search criteria, for example.
  • the responsive set, or subset, of data, except privileged documents, such as confidential lawyer-client e-mail correspondences, can then be "produced", or otherwise made available for review.
  • a culling step can be optional.
  • Embodiments presently disclose provide an apparatus, or device, referred to herein as a collection device, which comprises both hardware and software and is configured to function as a network search and collection appliance.
  • the collection device can be used to perform automated, and in some cases repeated, collection of electronic documents stored on computing devices typically coupled to a computer network (e.g., a corporate or enterprise computer network such as an intranet, wide area network, etc.).
  • a computer network e.g., a corporate or enterprise computer network such as an intranet, wide area network, etc.
  • the device can be used to collect electronic data, which can include sensitive electronic data, by: (i) discovering the locations of electronic data (e.g., custodial data) accessible via a computer network to which the device is connected and (ii) copying discovered data to storage such as one or more removable hard drives, using filtering criteria. Characterizing information about the data, such as its creation and modification times, can be preserved in copying the electronic data.
  • electronic data copied to a data store can be further culled using a keyword searching or other search tools. Electronic data copied and optionally culled in accordance with embodiments presently disclosed can be reviewed via any means presently known or later discovered.
  • Embodiments of the present disclosure allow for incorporation of additional functionality and/or integration of third-party toolsets, e.g., via web services.
  • the collection device can be a portable device or can be in effect a more permanent component, e.g., a rack-mounted device.
  • the collection device can be easily configured within the target network.
  • Embodiments disclose an extensible and scalable collection device that integrates with existing corporate networks to provide non-intrusive electronic discovery and collection of loose (e.g., disparately-located) documents.
  • the collection device can assist in collection of information on a corporate network as part of a corporate compliance program, legal proceeding, or other investigation, for example.
  • the collection device can comprise one or more collection devices. In a case that multiple collection devices are used, the devices can function in a federated configuration.
  • Each collection device can store collected content in data storage, persistent storage, such as one or more hard drives.
  • the persistent storage can comprise one or more removable hard drives that can be safely removed, e.g., hot-swappable, during operation.
  • Removable persistent storage provides an ability to examine the stored contents apart from the collection device. However, it should be apparent that in embodiments of the present disclosure the collection device is used to examine the collected information.
  • the collection device connected to a computer network searches and identifies network shares, and indexes the contents of such shares, which process is unobtrusive and avoids installation of software agents on host machines in a target network.
  • users can be defined, together with associated profiles. Actual users may be representative of one or more of these three basic profiles, depending on their roles within the enterprise. Examples of a user include device administrator, investigative consultant, corporate officer, which can be the same or different users.
  • a device administrator can be a user with information technology (IT) experience. Such a user can be either a member of an investigative team (e.g., either internal or external to a corporation or other entity being investigated) or the host company's IT staff, which can be dependent on, for example, whether the collection device is deployed for litigation or auditing purposes, respectively.
  • IT information technology
  • Such a user can be responsible for integrating and maintaining the collection device within a target network infrastructure and can be tasked with obtaining authorizing credentials, such as domain- browsing and file-share-mounting usernames and passwords, for input by another authority.
  • the device administrator can be a person who interacts with the administrative interfaces of the collection device, which allow for configuration of the product as a network device.
  • An investigative consultant can be a person who specializes in collection of electronic data from the target network, e.g., for purposes of litigation. Such a person can perform network discovery of computers and their file shares, and may make requests of the device administrator for authorizing credentials, if network discovery indicates such a need. Once network discovery is performed using the collection device, the investigative consultant can use filter criteria and initiate a collection to collect information in a persistent store, e.g., a removable hard drive from the discovered network resources. With collection completed this person may elect to remove the hard drive for further offsite culling, review, and production. This person interacts with network discovery and collection interfaces of the collection device.
  • the persistent store can be protected, using various protection schemes now known or later discovered, which can include various encryption and hashing techniques, for example.
  • a corporate officer is typically engaged in compliance auditing. Such a person usually delegates tasks to the device administrator to set up and maintain one or more collection devices. In particular, the corporate officer may delegate to the device administrator the task of scheduling periodic collection, as well as subsequent maintenance of removable hard drives. Also, the corporate officer may collaborate with a contracted service, such as one that performs the duties ascribed to the investigative consultant, a person in the employ of this service company. The investigative consultant may be contracted to assist in initial setup and may be called upon from time-to-time if, in the routine of auditing, discoveries of interest should warrant.
  • the corporate officer may elect to: (i) personally perform auditing on removable hard drives containing collected data, either using search technology provided by the collection device or a third party; (ii) contract a third party, such as that of the investigative consultant, to perform this service on data collected to removable hard drives; or (iii) leverage both the culling methods of in-house and consulting services, depending on the needs and policies of the company.
  • the corporate officer takes authoritative action based on review of responses to the culling efforts.
  • the corporate officer may choose to interact with the optional search interface of the collection device, depending on needs and policies, as mentioned.
  • a portable version of the collection device can be introduced into a corporate network by a team of investigators.
  • the device administrator of the team installs the collection device(s) within the network and begins an initial discovery procedure using administrative credentials provided by the IT staff.
  • the device administrator queries IT staff to reconcile these exceptions, for example turning machines on that are off, locating machines that have left the network, and/or providing additional authentication credentials.
  • the investigative consultant assumes the responsibility of collecting data from target machines.
  • the investigative consultant configures filtering criteria and begins the collection of data to removable hard drives of one or more collection devices situated in the corporate network. If a removable hard drive is filled to capacity then it can be swapped with a blank one, allowing collection to continue. Removable hard drives can be delivered to a laboratory for further investigation, review, and production.
  • a IU rack-mounted version of the collection device is installed on a computer network of a company wishing to maintain a compliance-auditing presence within its enterprise, a presence which can be overseen by the corporate officer.
  • the device administrator a member of the company's IT staff reporting to the corporate officer, is responsible for the installation and maintenance of the collection device.
  • the device administrator performs initial network discovery, ensuring that access (authorizing credentials) and coverage (scalability) of the network is complete. If time constraints or discontinuous network topology require, additional collection devices can be introduced to accommodate the scale of the collection process.
  • the investigative consultant reporting to or contracted by the corporate officer, configures the filtered copying of data to removable hard drives of the one or more installed collection devices installed.
  • Network discovery and collection may be scheduled to occur periodically, thus automating the process, such that collection to the same removable hard drive captures new, modified, and missing files.
  • Analysis of collected content can be performed by the investigative consultant or the corporate officer, depending on corporate practices and policies. The corporate officer can be ultimately responsibility for the overall collection process, its analysis, and the corrective measures that ensue.
  • the collection device discovers the workgroups/domains, computers, and Windows file-shares of a corporate network, wherein "loose" documents can be found, and correlates use of discovered machines with specific custodians of interest.
  • a collection device can be easily connected and configured to a corporate network, once connected can perform network discovery in a timely fashion, and report findings and any exceptions that occur during the discovery process. Examples of findings include a list of the workgroups/domains, their computers, Windows file-shares of said computers, and associations between computer users and these machines. Examples of exceptions include an inability to connect to domains and machines for which a network presence has been determined.
  • the collection device distinguishes between lack of physical connection and denial of access.
  • the collection device provides an opportunity for input of authorizing credentials, at which point, the discovery process continues to greater depth, e.g., to examine the device for which authorization is allowed to identify network shares and/or stored information.
  • network discovery reports can highlight changes in the network from one report to the next, such as new or missing computers, and changes in user associations with computers.
  • An interactive network discovery interface of the collection device can be used to monitor and respond to discovery events.
  • the collection device copies a subset of electronic data from computers of a network to one or more removable hard drives located on the device using a filter.
  • criteria used as a filter include absolute file path (e.g., both machine name and directory path), file name wildcard, range of file date and time (e.g., creation, modification, and access times), file signature (e.g., file types and/or extensions), system-file exclusion, and custodian name.
  • data is copied in its native form, preserving both content, file metadata (including file access times), and directory structure.
  • Files can be uniquely distinguished from one another via a hash algorithm, for example. Copied files can optionally be de-duplicated.
  • support can be provided for Unicode files and file naming convention, including preservation of Arabic, Russian, Chinese, and Japanese written languages encoded within file content and file path names.
  • a software tool can be provided to transfer data stored on a removable drive to a third- party medium, if desired.
  • the copying process can be repeated and/or scheduled periodically, with the ability to detect and record new and changed files, as well as files that no longer exist, or ghost files.
  • the copying process can generate a report of new, modified, and ghost files, as well as a list of exceptions encountered.
  • filtering exceptions include files that: (i) meet the filter criteria but cannot be copied, (ii) do not meet the filter criteria and have incomplete or incomprehensible metadata, (iii) do not meet the filter criteria and whose signatures cannot be ascertained, and (iv) do not meet the file criteria and are known to be password protected.
  • files corresponding to exceptions are copied and preserved on the target removable hard drive, with an indication, in the reported exception list.
  • the collection device provides an interactive collection interface to facilitate the copying of data to removable hard drives, subject to the filtering criteria described above.
  • the collection device offers an optional utility for culling copied data.
  • a culling utility provides functionality consistent with the search technology on which it is based.
  • the culling utility can be customized to meet specific requirements of a discovery or investigative project.
  • Various culling utilities can be used with embodiments of the present disclosure, with selection of one or another utility being based on each one's strengths and weaknesses, for example.
  • the collection device is designed to permit the application of external third-party culling utilities and services to electronic data collected onto removable hard drives. The choice of culli ⁇ g utilities and services is left to the user.
  • the culling utility of the collection device can be made accessible through an interactive search interface, for example.
  • the collection device comprises collection features associated with network discovery, electronic documents discovery and copying, and document-content search.
  • the collection device is comprised of both hardware and software.
  • the hardware runs the Linux 2.6.14+ operating system, with support for a disk array, e.g., a RAID.
  • the collection device is configured as a portable device, e.g., a MaxVision 8070MRA portable workstation with two removable SATA hard drives.
  • the collection device comprises a processor (e.g., a Pentium 4, 800MHz FSB (925XE chipset)), memory (e.g., 4GB DDE), two removable hard disk drives (e.g., 500 GB, SATA), a network adapter (e.g., 1 gigabit Ethernet), and a monitor (e.g., 17" LCD monitor).
  • a processor e.g., a Pentium 4, 800MHz FSB (925XE chipset)
  • memory e.g., 4GB DDE
  • two removable hard disk drives e.g., 500 GB, SATA
  • a network adapter e.g., 1 gigabit Ethernet
  • a monitor e.g., 17" LCD monitor.
  • the collection device comprises a server, e.g., an enterprise server.
  • the server comprises a IU rack -mounted Dell 1950 PowerEdge server with two SATA hard drives, one of which is removable.
  • the rack-mounted form factor can include standard enterprise-level support and maintenance features.
  • Exemplary hardware can include a processor (e.g., dual-core Intel® Xeon® 5150, 4MB Cache, 2.66GHz, 1333MHz FSB), a fixed disk drive, (e.g., 250GB, SATA, 3.5-inch, 7.2K RPM hard disk drive), a plurality of removable disk drives (e.g., 500GB, SATA, 3.5-inch, 7.2K RPM removable disk drives), memory (e.g., 4GB 533MHz (4x1 GB), Dual Ranked DIMMs memory), and a network interface (e.g., a dual-embedded Broadcom® NetXtreme II 5708 Gigabit Ethernet NIC).
  • a processor e.g., dual-core Intel® Xeon® 5150, 4MB Cache, 2.66GHz, 1333MHz FSB
  • a fixed disk drive e.g., 250GB, SATA, 3.5-inch, 7.2K RPM hard disk drive
  • a plurality of removable disk drives
  • the collection device can be extensible.
  • Network discovery, electronic data collection, and content search functionality is invoked through web service requests encapsulating SOAP envelopes, e.g., both of the following are supported: (i) SOAP over HTTP and (ii) SOAP within web forms over HTTP.
  • SOAP Simple Object Access Protocol
  • a third party can integrate the collection device into an existing product offering using an application programming interface, API.
  • third party toolsets can be distributed with the collection device, acting as host server, through the use of an SDK.
  • the collection device is scalable.
  • the scalability of the collection device facilitates meeting the demands of large corporate computer networks through federation of multiple devices and their web services, for example.
  • network discovery, electronic data collection, and content search can be delegated to devices of a non-hierarchical cluster and accessed through a common SOAP client.
  • a search interface provided by a culling utility of the collection device, as invoked via SOAP, permits the return of aggregated search results from queries made across the removable hard drives of cluster-configured devices, for example.
  • a collection device design can be configured to avoid large concurrent access.
  • the collection device is configured to handle such large concurrent access/use.
  • the collection device is installed and configured on a TCP/IP network using a UDP client/server model and protocol.
  • software implementing the model and using the protocol can be shipped with the collection device, such that the device can be automatically installed and configured on the network using the software once the device is physically connected to a network.
  • the collection device can include administrative, collection, and search interfaces, which can be accessed from a web browser.
  • the collection device can be configured to provide a SOAP-over-HTTP Windows GUI client, which can be installed on non-9x Windows computers from CD.
  • a GUI application can comprise configuration of the collection box via UDP, Windows native GUI representations of the administrative, collection, and search interfaces of each collection device, management, viewing, and printing of discovery, collection, and culling reports, and copying of removable-drive data to an external file system, in a format representative of a directory hierarchy from which the data was copied.
  • the collection device can receive software updates, which include patches, hot fixes, and feature enhancements, which can be received automatically (at a configurable rate) or manually (via interactive upload), if at all, for example.
  • the update feature can be fully configurable from the administrative interface of the product.
  • the collection device can interoperate with various advanced Windows networking and security features, employing NetBIOS, DNS, and broadcasts to discover and access workgroups, domains and computers of a network, examples of which include NetBIOS workgroups, Windows NT domains, and Windows 2000/2003 domains.
  • windows domain controllers and individual computers can be queried to develop one or more lists of network and local users, respectively, and, under CIFS/SMB, files shares of discovered servers, particularly default shares, can be perused for user profile information, so that network device, e.g., network devices, can be found and correlated with use by specific custodians.
  • the network discovery interface allows for input of authorizing credentials, either interactively or in batch, via an XML-formatted file that can be uploaded to the collection device.
  • Pre-discovery features which represent components of the network discovery strategy, can include NetBios, WINS, DNS and Active Directory, for example.
  • NetBIOS- over-TCP/IP protocol set allows for a collection on candidate machines on the current network to be found by opening connections to the SMB ports (139 and 445) of all possible addresses on the network. Machines that accept connections are candidates for file-share interrogation, with NetBIOS session wrapping or raw.
  • Support for Windows Name Service (WINS) rather than using broadcasts, permits machines to be queried to obtain their node names and status; unresolved lookups are carried forward with IP addresses.
  • Browse lists so obtained can be parsed to search for previously unknown workgroups and servers, which can be added to the list of hosts to be queried.
  • Windows Active Directory can be used to find domain member servers and shared folders. Obtaining the names of domain member servers from the LDAP directory rather than searching for them on the network helps streamline discovery on class A and B networks, where ping flooding may take considerable time. Machine names so obtained are parsed to search for previously unknown workgroups and servers, which are added to the list of hosts to be queried. Also, obtaining share names and locations from the directory offers advantages over querying machines directly, especially when some shared resources may be located on machines that aren't running at the time of the initial network survey. Windows global catalog servers can be used to find available shared folders. This feature allows shared resources to be obtained from an entire forest of domains rather than just the current domain. Examples of other protocols for use with the collection device include NFS, Netware and Apple Talk.
  • the collection device copies a subset of files, e.g., loose files, from computers identified on a corporate network to a designated removable hard drive located on the device, which contains a Linux ext2 file system supporting Unicode file names, including the wide characters of Arabic, Russian, Chinese, and Japanese.
  • copying is performed by readonly mounting selected file shares onto the local file system of the removable hard drive using the CIFS/SMB protocol, from which vantage point file metadata and content can be examined.
  • Data is then copied from mount points to permanent residence on the removable hard drive, subject to a complete set of filtering criteria, preserving file content, true native format, file metadata, and directory structure.
  • the copying process is configurable and executable from an interactive collection interface of the collection device.
  • the collection process can be manually invoked at any time or scheduled for automatic invocation, e.g., at regular intervals.
  • Copied file content are maintained along with a database (e.g., PostgreSQL database) comprising file MD5 hash values and file metadata.
  • a database e.g., PostgreSQL database
  • file access times for the encountered file can be recorded.
  • Additional collections to the same removable hard drive detect new, modified, and ghost files via differences in encountered versus recorded MD5 hash values.
  • the copying process includes a report of apparent new, modified, and ghost files, as well as a list of exceptions.
  • a copying filter can comprise one or more of absolute file path, file name, file date and time, file signature, system file exclusion, custodian name, and file de-duplication.
  • a list of absolute file paths can be specified, wherein each item of the list contains both a machine name and directory path, which is relative to the shared file folder. This can be used to determine which file shares to mount onto the designated removable hard drive.
  • a list of file-name filters can be specified wherein each filter can include standard wildcard syntax, such as an asterisk. This criterion can also be used to designate file extensions.
  • One or more times e.g., creation, modification, and access
  • Metadata including such times can be resolved via the CIFS/SMB protocol and the extended file attributes support of the Linux kernel, for example.
  • File types can be specified from a list of recognized types.
  • File extensions can be specified as part of the file name.
  • File-type recognition can be implemented by a derivative of the Magic database deployed on Linux within the KDE application package, for example.
  • System files can be distinguished from files created by users and excluded from copying.
  • System files can be distinguished via a standard set of hash values, which are provided by NSRL/NIST, for example.
  • a list of the names of custodians can be specified to restrict copied data to a subset of file shares known to be correlated with use by users matching the names of custodians. This can be used to determine which file shares to mount onto the designated removable hard drive.
  • An optional filter flag allows for de-duplication of files copied to the designated removable hard drive, which can be based on MD5 hash-value comparisons, for example.
  • duplicate files can be identified in a PostgreSQL database and a collection report, and are not physically copied to the removable hard drive.
  • a Windows-based software tool can be provided to transfer data stored on a removable hard drive to a third-party file system.
  • the tool allows restoration of the directory structure of the copied files, in a format akin to the original file shares of interest. This can be useful when third-party culling tools and/or services are applied to the collected data.
  • the collection device can include an optional utility for the culling of data copied to removable hard drives.
  • This utility can be an integration of the dtSearch search technology accessible through an interactive search interface of the collection device, for example.
  • This utility can be optional, as it may not be suitable for all users.
  • the collection device permits use of external third-party culling utilities and services to files collected and preserved on its removable hard drives.
  • a Windows-based software tool can be provided to facilitate the use of third-party culling utilities and services, by restoring the collected data to a file system in which the original directory structure is reproduced, for example. The choice of culling utilities and services can be left to the user.
  • Embodiments of the present disclosure include a licensing model, such as a token-based metered licensing model tied to storage capacity.
  • a licensing model for use with the collection device enables eDiscovery service firms and litigation support teams, for example, to pay for their actual use of the collection device and to pass on those license fees to their clients as part of their normal service charges.
  • a licensing model for use with the collection device involves charging a client a collection fee based on a set rate per gigabyte of amount copied and/or storage used.
  • fees and fee ranges include a lower-end fee of $400 per gigabyte and a higher-end fee of more than $1,500 per gigabyte.
  • these fees are merely exemplary and that other fees can be used in connection with embodiments of the present disclosure.
  • software tokens can be sold, which are tied to storage capacity needed for the collection, for example.
  • customers using the collection device are able to purchase a token of any size capacity, e.g., in whole gigabytes, from a website, install the token to the collection device, and use the collection device up to the capacity limit of the purchased token(s).
  • a token can be tied to a specific collection device through a serial number and to a specific job or legal action through a customer or court-defined reference number, for example.
  • the collection device can be configured to meter the actual storage used during the collection and decrement the capacity purchased. When the capacity limit is reached, the collection device can stop the collection process.
  • ⁇ олователи can purchase additional tokens at any time, e.g. online and via a website.
  • software tokens are purchased in advance of use, and a collection device prohibits collection unless there is a non-zero value remaining of purchased capacity.
  • the collection device can provide one or more reports of actual use including any installed tokens and their capacity, so information can be cross-referenced to purchases of tokens. This information can provide backup information for a service firm to provide to their customers for billing purposes, for example.
  • Figure 11 which comprises, Figures HA and HB provides an example of a process for use in connection with one or more embodiments of the present disclosure.
  • a collection size is estimated.
  • An eDiscovery service firm or other authorized agent that uses the collection device can estimate the storage capacity in gigabytes needed to complete a collection.
  • the collection can be for a given legal action, and the estimate can be based on selected network volumes or file shares.
  • other tools and methods for estimating storage capacity can be used with embodiments of the present disclosure.
  • customer approval is obtained from the service firm's client, or customer.
  • the eDiscovery service firm provides a size and cost estimate to the law firm or other party that is contracting for the electronic document collection, and obtains their approval before proceeding with the collection.
  • a website, or web server configured to provide functionality and a website interface, is accessed using the collection device.
  • An authorized representative of the company using the collection device for example, goes to a secure eCommerce section of a website to login and purchase a software token for the collection.
  • a company account can be created automatically when a customer purchases a collection device, e.g., an initial collection device.
  • a company administrator for the customer can collect separate user accounts, if needed.
  • a collection device is selected.
  • the accessed website can provide a listing of collection devices associated with the customer, each device identified by its respective serial number.
  • the customer selects the serial number of the appliance intended for use a given action and token.
  • a token is purchased for a specific collection device.
  • embodiments can use a token for more than one collection device.
  • a job number is created.
  • a customer creates a reference or job number for this collection or selects a pre-existing reference from a drop down list, for example.
  • tokens can be tied to specific legal actions or customer jobs.
  • a token capacity is specified. For example, a customer enters the capacity in gigabytes for use with the token. A minimum capacity, e.g., 1 gigabyte, can be used, and tokens can be issued in whole numbers of gigabytes.
  • a token selection is confirmed.
  • the website can present a confirmation screen displaying the specifications for the token and the purchase price. The customer can cancel, edit the selection, or confirm it and proceed to checkout.
  • the token is purchased. The customer enters their payment information, and if authorized, the website processes the transaction.
  • the token is generated. Once the payment transaction is complete, the website automatically generates a token and makes it available for download over a secure connection, for example.
  • the token is transferred (e.g., downloaded) to the collection device. If the collection device has web access, the customer can download it directly to the appliance, for example.
  • the token can be represented as a text file, which can be uploaded manually to the appliance, e.g., over a network connection.
  • the token is validated.
  • the collection device can check and ensure that the software token is valid and then add the specified capacity to the total collection capacity for the specific job number.
  • a collection device can be used for multiple jobs at the same time. However, for auditing and accounting purposes, it may be desirable to limit a token for use with a specific job, or job number.
  • the collection device performs the collection. The customer uses the collection device to perform collection and the collection device tracks the storage used with the capacity authorized by the token(s).
  • the collection device outputs a warning. The collection device maintains a current storage used and capacity remaining from tokens purchased.
  • the collection device can further estimate an amount of storage needed to complete the collection based on the selected volumes and file shares.
  • the collection device can issue a notification when the remaining capacity reaches a low-water mark, which can be configured/selected by the user.
  • the collection device can continue to warn the user as available capacity declines at an interval chosen by the user and specified in terms of capacity (gigabytes, or hundreds of megabytes).
  • collection can be stopped if it is determined that token capacity is reached.
  • the collection device stops the collection process automatically when the total storage capacity purchased for the specific job number used has been reached.
  • the collection device can issue a notification that collection is stopped.
  • the user can obtain another token and additional capacity as described, and the collection process can be resumed with the additional token/capacity.
  • the collection is completed.
  • the collection device can be used to generate a report showing details of the collection performed and a record of all tokens installed and used by job number.
  • the website can provide a record of purchase transactions.
  • the token comprises storage capacity in gigabytes (e.g., whole numbers, 5 digits to 99,999 GB), a purchaser's identification information (e.g., company name - 25 alphanumeric characters), a unique identification for the token (e.g., which can be cross-referenced to a customer invoice generated by the website for accounting and audit purposes), a serial number of target collection device, a job reference (e.g., 15 alphanumeric characters), date and time created (e.g., Unix data/time), field indicating token is for evaluation purposes (e.g., Boolean).
  • the token is encrypted and digitally signed to prevent forgery and allow authentication.
  • the token can be downloaded directly to the collection device and installed by an authorized user.
  • the token may be downloaded from a website from which it is purchased to a text file that can be manually uploaded by an authorized user to the collection device over the network.
  • the token is valid for the appliance with the serial number for which the token was created. If an attempt is made to install the token to an unauthorized collection device, the installation fails and the user is notified.
  • the token is not transferable to another collection device, once the token is installed on a collection device.
  • excess capacity from a purchased token on a collection device cannot be transferred to another collection device.
  • an attempt to transfer a token, or download and install an identical token results in failure processing, e.g., which can include termination of the installation, or transfer, and a failure notification.
  • an authorized user can choose to transfer capacity from one job, e.g., by job number, to another, if available, on the same appliance.
  • a token can be issued for evaluation purposes, e.g., to evaluate a collection device.
  • a token issued for evaluation cannot be used for another action or job, e.g., a legal action or other paid transaction.
  • Evaluation tokens can be limited in capacity, e.g., 5 Gigabytes. If a purchased token and an evaluation token are installed on the same collection device, the capacity of the purchased token will supplant any remaining storage capacity from the evaluation token.
  • the collection device is coupled to a search appliance used to provide access, including indexing and search access, to information located on one or more intranets, the Internet, or both.
  • the collection device is a part of the search appliance.
  • the networked search apparatus also referred to herein as a network search device or network search appliance, and method comprise configuration, indexing, and searching capabilities to facilitate networked information search and retrieval.
  • FIG. 1 a block diagram of a representation 100 of a network of computing devices and peripherals in which one or more embodiments of the present disclosure can be used in provided.
  • computers 150, 160, and 170, at least one instance of search appliance 180, and at least one data server 190 are coupled via a network 120.
  • an optional printer 110 and an optional fax machine 140 are shown.
  • individuals, business entities and the like for example, can efficiently and effectively access and manage the storing, indexing, accessing, and retrieving of electronic data as described herein.
  • Optional printer 110 and an optional fax machine 140 are standard peripheral devices that can be used for transmitting or outputting paper-based documents, notes, search results, reports, etc. in conjunction with the queries and transactions processed by computer-based system 100. It should be apparent that optional printer 110 and optional fax machine 140 are merely representative of the many types of peripherals that can be utilized in conjunction with the present disclosure, and that other peripheral devices can be used with one or more embodiments of the present disclosure and no such device is excluded by its omission in Figure 1.
  • Network 120 is any suitable computer communication link or communication mechanism, including a hardwired connection, an internal or external bus, a connection for telephone access via a modem or high-speed Tl line, radio, infrared or other wireless communications, private or proprietary local area networks (LANs) and wide area networks (WANs), as well as standard computer network communications over the Internet or a network internal (e.g. "intranet") to an enterprise, or entity, via a wired or wireless connection, or any other suitable connection between computers and computer components known to those skilled in the art, whether currently known or developed in the future.
  • portions of network 120 can suitably include a dial- up phone connection, broadcast cable transmission line, Digital Subscriber Line (DSL), ISDN line, or similar public utility-like access link.
  • network 120 can comprise one or more network segments.
  • At least a portion of network 120 comprises a standard wired or wireless Internet connection between the various components of computer-based system 100.
  • Network 120 provides for communication between the various components coupled to network 120, which allows for information to be transmitted between devices coupled thereto.
  • a user of computer system e.g., computer 150, 160 and 170, connected to network 120, for example, can gain access, based on access privileges corresponding to the user, to data and information accessible via network 120.
  • network 120 serves to link the physical components of computer- based system 100 together, regardless of their physical proximity.
  • data server 190 and computers 150, 160, and 170 can be geographically remote and physically separated from each other.
  • computers 150, 160 and 170 can be any type of computer known to those skilled in the art that is capable of being configured for use with computer-based system 100 as described herein. This includes laptop computers, desktop computers, tablet computers, pen-based computers and the like. Computers 150, 160, and 170 are most preferably commercially available computers such as a Linux-based computer, PC-based computers, or Macintosh computers. However, as those skilled in the art should appreciate, the methods and apparatus presently disclosed apply equally to any computer or computer system, regardless of whether the computer is a traditional "mainframe" computer, a multi-user computing apparatus or a single user device, such as a personal computer or workstation.
  • handheld and palmtop devices can also provide examples of devices that can be deployed as computers 150, 160 and 170. It should be apparent that any operating system or hardware platform can be anticipated, and that many different hardware and software platforms can be configured, to be deployed as computers 150, 160 and 170. Various hardware components and software components (not shown) known to those skilled in the art can be used in conjunction with computers 150, 160 and 170.
  • Data server 190 together with computers 150, 160 and 170, are preferably configured to store and retrieve data, some or all of which is sharable via network 120.
  • Various hardware components such as external monitors, keyboards, mice, tablets, hard disk drives, recordable CD-ROM/DVD drives, jukeboxes, fax servers, magnetic tapes, and other devices known to those skilled in the art can be used in conjunction with data server 190, and computers 150, 160 and 170.
  • data server 190 can be configured with various additional software components (not shown) such as database servers, web servers, firewalls, security software, and the like. While a single data server 190 is shown connected to network 120 of Figure 1, it should be apparent that embodiments of the present disclosure contemplate and embrace any number of data servers 190.
  • the various data servers can vary in size, complexity and capability, but can all generally be capable of being configured to index and retrieve information via network 120 in accordance with embodiments presently disclosed.
  • data server 190 can represent a network accessible data server that is configured to store data files for later retrieval by the users of computers 150, 160 and 170 via network 120.
  • a typical transaction can be represented by a request (e.g., identify, retrieve, access, etc.) for information directly stored on data server 190 or on some other computer or computer system that is logically connected to data server 190, for example.
  • a request for information can include requests involving any type of digitized data, whether voice, text, graphics, etc. and the information can be stored in any format known now or later developed/identified.
  • search appliance 180 represents a network accessible computing system configured to act as a network- based indexing and search apparatus capable of indexing data, receiving search queries and processing the search queries to return one or more data files accessible via network 120, and any other appropriately designated computers, that are responsive to the search queries.
  • a typical transaction can be represented by a request for files containing certain keywords or phrases from the data store of data server 190 or stored on some other computer or computer system that is logically connected to data server 190.
  • the request to retrieve data can include search requests involving any type of digitized data, whether voice, text, graphics, etc. and the information can be stored in any format now known or later developed/identified.
  • search appliance 180 is configurable automatically via a UDP client/server model.
  • a user interface comprising displayable web pages using a standard web browser can be used in configuring search appliance 180.
  • the search appliance 180 is physically connected to network 120. Once the search appliance is physically connected to network 120. Once the search appliance is physically connected to network 120.
  • search appliance 180 is connected to network 120, as is described in more detail below, search appliance 180 transmits a message containing identification information via User Datagram Protocol (UDP) and network 120 to configure search appliance 180.
  • UDP User Datagram Protocol
  • search appliance 180 can be used to identify sharable resources available on the network, and maintain a search repository, or database, of search information.
  • search appliance 180 uses the search database to search information on the network.
  • search results are scored, or ranked, according to one or more scoring mechanisms.
  • the UDP client/server model used in one or more embodiments of the disclosure addresses an issue present when installing a network appliance on a network, such as network 120. That is, when configuring a network appliance, such as search appliance 180, on network 120, it is necessary to configure the device for network communications, e.g., TCP/IP Ethernet communication. For example, in a TCP/IP network environment, an IP address and subnet mask should be established for search appliance in order to operate over TCP/IP within the network in which it is deployed.
  • search appliance 180 It is possible to use a manual configuration approach, e.g., manually setting network parameters for search appliance 180.
  • the manual configuration approach assumes a fairly sophisticated knowledge of network configuration needs. It would therefore be beneficial to be able to configure search appliance 180 for network 120 automatically.
  • Another approach, which can be used with embodiments of the present disclosure, to configure search appliance 180 involves the use of BOOTP, or the superseding and encompassing DHCP, to obtain IP settings.
  • search appliance 180 is configured to use any one or a combination of one or more of these.
  • search appliance 180 e.g., identify valid IP settings, for communication on network 120.
  • search appliance 180 e.g., identify valid IP settings
  • this approach provides an ability to establish initial communication between search appliance 180 and data server 190.
  • the UDP client/server model contemplates the use of a set of connectionless UDP broadcast messages that can be used to communicate between a network device, e.g., network data server 190, and search appliance 180, without the need for search appliance 180 to be configured with TCP/IP settings, e.g., a TCP/IP address.
  • a network device e.g., network data server 190
  • search appliance 180 without the need for search appliance 180 to be configured with TCP/IP settings, e.g., a TCP/IP address.
  • TCP/IP settings e.g., a TCP/IP address.
  • client/server model is described with reference to UDP, other protocols can be used.
  • a communication protocol defining a set of messages used to communicate with search appliance 180 is described, it should be apparent that other messages types can be used to communicate with search appliance 180 via UDP, or other network protocol.
  • the communication protocol defines a structure for messages used in implementing the UDP client/server model.
  • examples are provided to illustrate end-user network setup using the UDP client/server model.
  • messages can be passed between UDP client and server. More particularly, message types are presented in terms of commands issued by the UDP client, e.g., a networked device such as data server 190, to one or more UDP servers, e.g., search appliance 180.
  • a typical command consists of a message sent by a UDP client to one or more UDP servers listening on a dedicated port.
  • a response message can be in the form of a message sent by one or more UDP servers back to the UDP client, which in turn listens on its own dedicated port.
  • messages in the form of UDP limited broadcasts are connectionless, and thus, without state. There is no guarantee that an intended recipient of a message receives the message. Messages are broadcast to all devices on the network segment. Examples of messages/commands that can be used with the UDP client/server model of one or more embodiments of the disclosure are shown in
  • Figure 2 which comprises Figures 2 A to 2H, illustrates client/server model message type examples for use in accordance with one or more embodiments of the present disclosure.
  • the first command the POL message
  • the POL message is issued by a UDP client, e.g., data server 190, to identify all of the UDP servers, e.g., instances of search appliance 180, in a network, or network segment.
  • a UDP server that receives a POL message can reply with a PLR message.
  • identification information provided with the PLR message additional messages can be sent to specific ones of search appliance 180 to cause search appliance 180 to perform an operation specified by the message.
  • a UDP client For example, another message that can be issued by a UDP client, a GET message, requests IP information from a specific UDP server (e.g., a specific instance of search appliance 180).
  • the intended UDP server can reply using a GTR message, which contains the requested information.
  • Another message which can be issued by a UDP client requests the recipient UDP server to set its IP state.
  • the intended UDP server can reply with a STR message, which indicates the result, e.g., success or failure, of the requested operation.
  • An RES message can be issued by a UDP client to instruct a specific instance of the search appliance 180 to initiate a reset operation to reset its state, which is accompanied by a restart of the appliance.
  • each message is no greater than 512 bytes in length.
  • the UDP client e.g., network server 190
  • the UDP server e.g., search appliance 180
  • the remaining types of messages identified above are sent by a search appliance 180 to the UDP client in reply.
  • Each message body identifies the sender via a MAC address field.
  • the POL message sent by the UDP client is intended for all UDP servers that might be listening.
  • the remaining message types are intended for a specific recipient, as is identified by its MAC address in the message body.
  • Figures 2B to 2H provide examples of message formats for use with one or more embodiments of the present disclosure.
  • any other format including varying lengths for fields described herein, can be used for a request for the identities of network devices for use with embodiments of the present disclosure.
  • the polling message e.g., POL
  • the polling message can be sent by a bootstrap client to each of the search appliances 180 (e.g., as a broadcast message) on the network to request the identities of the appliances on the physical network.
  • the message requests the identities of instances of network appliance 180 connected to the network, or portion thereof.
  • the message comprises a field 210 to identify a version of the message protocol, a field 211 to identify the message type and a field 212 to identify the MAC address of the bootstrap client.
  • FIG. 2C provides an example of a polling response message sent in reply to a polling message in accordance with one or more disclosed embodiments.
  • the polling response message, PLR is sent by a search appliance 180 to a bootstrap client in response to the POL message.
  • a search appliance 180 can send a PLR message to return its MAC address and optionally its hostname.
  • the format of the PLR message shown in Figure 2C comprises field 210 which identifies a protocol version, field 211 which identifies the message type, field 212 which identifies the MAC address of the client, field 213 which identifies the MAC address of the responding search appliance 180, and field 214 which identifies a hostname of the responding search appliance 180.
  • fields 210, 211, 212 and 213 can be 1-byte, 3-bytes, 6-bytes and 6-bytes, respectively, in length.
  • Field 214 can be a variable byte length field, e.g., from zero to two hundred and fifty-five bytes.
  • the bootstrap client can address a specific instance of search appliance 180 to obtain additional information from the appliance.
  • the GET message is sent by the bootstrap client to request information from a search appliance 180, such as the current network configuration of the appliance (e.g., the appliance's network (e.g., IP) address).
  • the GET message can include authentication information, e.g., identifier, password or other authentication information, which the search appliance 180 can use to authenticate the requester (e.g., the bootstrap client).
  • Figure 2D provides an example of a GET message format for use with one or more disclosed embodiments.
  • Field 210 identifies a protocol version
  • field 211 identifies the message type
  • field 212 contains the MAC address of the bootstrap client
  • field 213 contains the MAC address of the responding search appliance 180
  • field 215 contains authentication information (e.g., identifier, password and/or other authentication information) for use in authenticating the requester (e.g., the bootstrap client) to the search appliance 180.
  • fields 210, 211, 212 and 213 can be 1-byte, 3-bytes, 6-bytes and 6-bytes, respectively, in length.
  • Field 215 can be a variable in length, e.g., from zero to two hundred and fifty-five bytes.
  • the response can be in the form of a GTR message having a format such as that shown in Figure 2E.
  • the authentication information contained in the GET message can be used to authenticate the requester. If search appliance 180 decides to respond to the GET message, e.g., the search appliance 180 can authenticate the requester using the authentication information in the GET message, before the search appliance 180 sends the GTR message.
  • the GTR message returns the current IP address and subnet mask of the search appliance 180.
  • a gateway configuration can be subsequently performed via an HTTP interface.
  • the GTR message format shown in Figure 2E comprises field 210 to identify a protocol version, field 211 to identify the message type, field 212 to identify the MAC address of the bootstrap client, field 213 to identify the MAC address of the responding search appliance 180, fields 221 and 222 to identify the IP address and subnet mask of the search appliance 180, and field 223 is a DHCP flag.
  • the DHCP flag indicates whether the search appliance 180 is configured to use DHCP (e.g., value of "0x01"), or whether the search appliance 180 successfully leased an address from the DHCP server (e.g., value of "0x02"), for example.
  • fields 210, 211, 212, 213, 221, 222 and 223 can be 1-byte, 3 -bytes, 6-bytes, 6-bytes, 4-bytes, 4-bytes and 1-byte in length, respectively.
  • a bootstrap client In response to receiving the network information from search appliance 180, a bootstrap client can send a command to the appliance to configure its IP settings.
  • the SET message can be sent by the bootstrap client to the search appliance 180 to set its IP address and subnet mask, together with authentication information.
  • Figure 2F provides an example of such a SET message, for use with one or more embodiments of the present disclosure.
  • Field 210 identifies a protocol version
  • field 211 identifies the message type
  • field 212 contains the MAC address of the bootstrap client
  • field 213 contains the MAC address of the responding search appliance 180
  • field 215 contains authentication information (e.g., identifier, password and/or other authentication information) for use in authenticating the requester (e.g., the bootstrap client) to the search appliance 180.
  • Fields 221 and 222 contain the network address information (e.g., IP address and subnet mask) for use by the search appliance 180 to configure its network settings, hi accordance with such an embodiment, fields 210, 211, 212 and 213 can be 1-byte, 3-bytes, 6-bytes, and 6-bytes, respectively, in length.
  • Field 215 can be a variable in length, e.g., from zero to two hundred and fifty-five bytes.
  • Fields 221 and 222 can be 4-bytes in length.
  • Search appliance 180 can send a response to the SET message, such as an STR message, which indicates a status or outcome of the SET operation.
  • the outcome can indicate a success (e.g., return code has a non-zero value) or failure status (e.g., return code has a value of zero).
  • the STR message can include further information to describe the status in more detail.
  • the STR message can describe failed operation outcome.
  • Figure 2G provides an example of a message, e.g., an STR message, indicating a configuration operation (e.g., set and reset operations) outcome in accordance with one or more embodiments.
  • Field 210 identifies a protocol version
  • field 211 identifies the message type
  • field 212 contains the MAC address of the bootstrap client
  • field 213 contains the MAC address of the responding search appliance 180
  • field 220 identifies a "status code" of the operation
  • field 217 contains a message further describing the success or failure of the configuration operation.
  • fields 210, 211, 212, 213 and 220 can be 1-byte, 3-bytes, 6-bytes, 6-bytes and 1-byte, respectively, in length.
  • Field 217 can be a variable length field, e.g., from zero to two hundred and fifty-five bytes.
  • an RES message can be used to reset the state of search appliance 180.
  • the RES message requests that the search appliance 180 reset its state to a default configuration, e.g., a factory default configuration.
  • Figure 2H provides an example of a message, e.g., RES, to reset the search appliance 180 in accordance with one or more embodiments.
  • Field 210 of the RES message identifies a protocol version
  • field 211 identifies the message type
  • field 212 contains the MAC address of the bootstrap client
  • field 213 contains the MAC address of the responding search appliance 180.
  • fields 210, 211, 212 and 213 can be 1-byte, 3-bytes, 6-bytes and 6-bytes, respectively, in length.
  • each instance of search appliance 180 continuously runs a UDP server and is configured in the factory to accept an IP address leased to it by a DHCP server running in its network. If a DHCP server does not exist in the network, in accordance with embodiments disclosed herein, TCP/IP configuration of search appliance 180 can be used through commands received by the UDP server executing in search appliance 180, using the UDP client/server model described above.
  • the UDP client/server model described herein can be used to: (i) discover all search appliances 180 connected to the network, e.g., network 120, (ii) obtain the IP address and subnet mask of a specified search appliance 180 so discovered, and/or (iii) set the IP address and subnet mask of a specified search appliance 180 so discovered.
  • search appliance 180 boots in a network containing a DHCP server. In such a case, search appliance 180 obtains a valid IP address from the
  • DHCP server, and network setup of the search appliance 180 can be completed without the UDP client/server model described herein.
  • the following are among the alternatives available to the user in a case that the network contains a DHCP server.
  • the end user need not take any action.
  • the UDP client/server bootstrap client can be run on a network server to discover a search appliance 180 connected to the network. For example, to obtain the IP settings as provided by the DHCP server, or change the IP settings to another static IP address.
  • search appliance 180 boots in a network that does not contain a DHCP server.
  • search appliance 180 waits for its IP address and subnet mask to be set, e.g., using the SET command of the UDP client/server model from the UDP server.
  • the end user configures the appliance within the network by running the program code which implements the UDP bootstrap client on the network device, e.g., data server 190.
  • the UDP bootstrap client communicates with instances of search appliance 180, as described above, to discover one or more instances of search appliance 180, and/or to issue the command to set its IP address and subnet mask, to configure search appliance 180 for network communications.
  • the UDP bootstrap client can be run to discover one or more instances of search appliance 180.
  • the bootstrap client can be used to obtain an IP address and subnet mask of one or more instances of search appliance 180, reset an IP address and subnet mask of one or more instances of search appliance 180 to static values, or reset one or more instances of search appliance 180 to a factory configuration.
  • Figure 1 shows only a few computers 150, 160, and 170 connected to network 120, it is anticipated that dozens or hundreds or even thousands of similarly configured computers 150, 160, and 170 can be "indexed" and searched using instances of search appliance 180.
  • multiple computers 150, 160, and 170 can be configured to communicate with search appliance 180 and one or more data servers 190 and with each other via network 120.
  • search appliance 180 Using search appliance 180, a user of a computer, such as one of computers 150, 160, and 170, can initiate a search request to locate and retrieve desired data files from data server 190, for example, with the search request being received and processed by search appliance 180. In response to receipt of such a request, search appliance 180 can, if appropriate, provide access to the requested data files to the requester. As discussed above, in accordance with one or more embodiments and using search appliance 180, a user of one of computers 150, 160, and 170, for example, can request and retrieve information in this fashion from not only data server 190, but from any other computer or computer system coupled to network 120, indexed using search appliance 180.
  • search appliance 180 it is possible to submit a search request, review the results of a search, and index volumes of data located on a local shared resource, at a remote location connected to network 120, and across an intranet and the Internet.
  • search appliance 180 can also be configured with various additional software components (not shown) such as servers, firewalls, comprehensive security software, and the like. Given the relative advances in the state-of- the-art computer systems available today, it is anticipated that functions of search appliance 180 can be provided by many standard, readily available computing devices and systems configured in accordance with at least one embodiment presently disclosed.
  • Search appliance 180 suitably comprises at least one Central Processing Unit (CPU) or processor 310, a main memory 320, a memory controller 330, an auxiliary storage interface 340, and a terminal interface 350, all of which are interconnected via a system bus 360. It should be apparent that various modifications, additions, or deletions can be made to search appliance 180 illustrated in Figure 3 within the scope of the present disclosure such as the addition of cache memory or the addition of other peripheral devices, for example. Figure 3 is not intended to be an exhaustive example, but is presented for purposes of illustration.
  • Processor 310 performs computation and control functions of search appliance 180, and comprises a suitable central processing unit (CPU).
  • processor 310 can comprise a single integrated circuit, such as a microprocessor, or can comprise any suitable number of integrated circuit devices and/or circuit boards working in cooperation to accomplish the functions of a processor.
  • Processor 310 suitably executes one or more software programs contained within main memory 320.
  • Auxiliary storage interface 340 allows search appliance 180 to store and retrieve information from auxiliary storage devices, such as external storage mechanism 370, magnetic disk drives (e.g., hard disks or floppy diskettes) or optical storage devices (e.g., CD-ROM).
  • auxiliary storage devices such as external storage mechanism 370, magnetic disk drives (e.g., hard disks or floppy diskettes) or optical storage devices (e.g., CD-ROM).
  • DASD direct access storage device
  • DASD 380 can be a floppy disk drive that can read programs and data from a floppy disk 390.
  • signal bearing media include: recordable type media such as floppy disks (e.g., disk 390) and CD ROMS, and transmission type media such as digital and analog communication links, including wireless communication links.
  • Memory controller 330 through use of an auxiliary processor (not shown) separate from processor 310, is responsible for moving requested information from main memory 320 and/or through auxiliary storage interface 340 to processor 310. While for the purposes of explanation, memory controller 330 is shown as a separate entity; those skilled in the art understand that, in practice, portions of the function provided by memory controller 330 can reside in the circuitry associated with processor 310, main memory 320, and/or auxiliary storage interface 340. [000118] Terminal interface 350 allows users, system administrators and computer programmers to communicate with search appliance 180, normally through separate workstations or through stand-alone computer systems such as computer systems 170 of Figure 1.
  • search appliance 180 depicted in Figure 3 contains only a single main processor 310 and a single system bus 360, it should be understood that the present disclosure applies equally to computer systems having multiple processors and multiple system buses.
  • system bus 360 of one or more embodiments of the present disclosure is a typical hardwired, multi-drop bus, any connection means that supports bi-directional communication in a computer-related environment can be used.
  • Main memory 320 preferably contains an operating system 321, user interface 322, database management system 323, together with program code to implement functionality described in connection with embodiments of the present disclosure, such as index mechanism 324, search mechanism 325, report mechanism 326, scoring mechanism 327, and security mechanism 328.
  • the term "memory" as used herein refers to any storage location in the virtual memory space of search appliance 1 SO. It should be understood that main memory 320 need not necessarily contain all parts of all components shown. For example, portions of operating system 321 can be loaded into an instruction cache (not shown) for processor 310 to execute, while other files can be stored on magnetic or optical disk storage devices (not shown).
  • Database management system 323 can be a relational database management system, which can use or implement a data model, or schema, definitions, and data stored according to the data model, such as is described in connection with one or more embodiments disclosed herein. The data stored using database management system 323 can change from query to query, depending on updated made to the stored data using database management system 323.
  • search appliance 180 can include additional components, not shown.
  • embodiments of the present disclosure include a security mechanism 328 for verifying and validating user access Io the data files located by search appliance 180.
  • Security mechanism 328 can be incorporated into operating system 321 in accordance with one or more disclosed embodiments.
  • security mechanism 328 can be configured to provide different levels of security and/or encryption for computers 150, 160, and 170 and data server 190 of Figure 1.
  • security mechanism 328 can be determined by the nature of a given search request and/or response to the search request, including the identity of the requestor.
  • security mechanism 328 can be contained in, or implemented in conjunction with, hardware components such as hardware-based firewalls, routers, switches, dongles, and the like.
  • operating system 321 includes software used to operate and/or control search appliance 180.
  • processor 310 typically executes operating system 321.
  • Operating system 321 can be a single program or, alternatively, a collection of multiple programs that act in concert to perform the functions of an operating system. Any operating system now known to those skilled in the art, or later developed/identified, can be used with one or more embodiments of the present disclosure.
  • user interface 322 can take another form, it can comprise web pages, which can be displayed, using a browsing software application such as one identified herein, on a monitor coupled to search appliance 180, and/or displayed on a monitor coupled to computer connected to search appliance 180 via network 120, such as computer systems 150, 160 and 170.
  • User interface 322 can be used to configure the various components shown in memory 320, including index mechanism 324, search mechanism 325, report mechanism 326, scoring mechanism 327, and security mechanism 328.
  • Database management system 323 is representative of any suitable database now known to those skilled in the art, and or later developed/identified.
  • database management system 323 is a relational database
  • database management system 323 uses a Structured Query Language (SQL) to manipulate (e.g., create, update, query, etc.) data stored in the database.
  • SQL Structured Query Language
  • database management system 323 is shown residing in main memory 320, it should be apparent that database management system 323 can also be physically stored in a location other than main memory 320.
  • database management system 323 can be stored on external storage device 370 or DASD 380 and coupled to search appliance 180 via auxiliary storage I/F 340.
  • database 323 can contain keywords for the content contained or accessible via a corporate intranet or the Internet.
  • database management system 323 can consist of multiple disparate databases stored on many different computers or computer systems.
  • search appliance 180 includes a network interface for connecting to network 120, together with the network protocols needed to communicate via network 120.
  • search appliance 180 includes the suite of protocols typically referred to as the Transmission Control Protocol/Internet Protocol, or TCP/IP.
  • Index mechanism 324 is a configurable indexing tool for categorizing various types of information and creating an index to be used in conjunction with searching and retrieving information over network 120, such as from data server 190.
  • Index mechanism 324 can be configured manually with various levels of user intervention or programmatically, depending on the specific type of data to be indexed.
  • Index mechanism can perform an initial index and can be configured to re-index the data files contained in database 323 at user-specified intervals, which index can be used to facilitate searching contents of database 323.
  • Search mechanism 325 can include a web-based software application accessible via a graphical user interface, such as user interface 322, to request and retrieve information from database 323.
  • search mechanism 325 can include a Natural Language Processor (NLP) based search engine which, in conjunction with the other components of search appliance
  • indexing mechanism 324 can be used as a robust search tool for locating and retrieving desired content.
  • indexing mechanism 324 can be used as a robust search tool for locating and retrieving desired content.
  • index 329 can be used as a robust search tool for locating and retrieving desired content.
  • scoring mechanism 327 can be used as a robust search tool for locating and retrieving desired content.
  • a user of computers 150, 160, and 170 of Figure 1 can access search mechanism 325 via a standard web browser such as Safari, FireFox, Netscape, Internet Explorer, etc.
  • search mechanism 325 can serve as an interface to the information stored in database 323. It is anticipated that various reports related to the information contained in database 323 can be generated by report mechanism 326, which can include a browser-based user interface for displaying search results.
  • Report mechanism 326 can provide output, either via a hard copy or display on a monitor, a variety of reports, including reports of the results from accessing database 323 via search mechanism 325. These reports can include the results of the various searches performed by a computer user, such as computer system 170 of Figure 1. These various reports can be formatted and presented to the user based on the specific type of request made by the user and the type of information to be returned to the user.
  • scoring mechanism [000131] In accordance with embodiments of the present disclosure, scoring mechanism
  • scoring mechanism 327 can be configured to score and rank the results obtained by search mechanism 325 in response to a user's search request, or query.
  • An number of scoring methodologies can be employed by scoring mechanism 327 to score search results so that the results can be ranked in a way most likely to present relevant results first.
  • scoring mechanism 327 can be user configurable, allowing the user to determine which features and scoring factors (weighting methods) to apply when search results are returned in response to given search query.
  • scoring mechanism 327 comprises a scoring mechanism to score documents returned from a search query based on a total number, or frequency, of occurrences of the N unique stem words contained in the original search query.
  • equation (1) set forth below provides an example of an equation used to determine a score for the m th result:
  • X, (m) is the frequency of occurrence of the i th stem word within the m th document.
  • this "frequency weighting" formula may not provide any special consideration for occurrences of more than one stem word in a document. Using this scoring scheme, the sum of the frequencies of all the stem words is measured.
  • Figure 4 which comprises Figures 4A to 4D, provides examples of scoring for use in accordance with one or more embodiments of the disclosure.
  • Table 400 of Figure 4A includes column 406 which identifies three documents, each of which has corresponding frequency counts for first and second search terms shown in columns 407 and 408, and a score for each of the three documents shown in column 409 and rows 401 to 403.
  • FIG. 4A a scoring example is provided for a search query involving two unique stem words, in which two results are returned with the same score.
  • Column 404 identifies a given document, i.e., m equals 1, 2 or 3, each one of rows 401 to
  • Column 403 corresponds to a given stem word.
  • Column 405 identifies the frequency of occurrence of the first stem word in the m ' document.
  • column 406 identifies the frequency of occurrence of the second stem word in the m th document.
  • Column 407 provides a ranking for each document based on the frequencies of occurrence associated with each stem word, which ranking can be calculated using equation (1) above.
  • row 401 corresponding to the first result contains 10 occurrences of the first stem word while row 402, which corresponds to the second result, contains 5 occurrences of each of the two stem words.
  • the measure of relevancy is the sum of the number of occurrences across all of the stem words, both documents would be scored the same and would have the same relevance in the search result.
  • scoring mechanism 327 takes into account the simultaneous occurrences of stem words in the same document, which document might be considered to be more relevant than another document which contains fewer stem words.
  • scoring mechanism 327 determine a score for a search result taking into account occurrences of multiple keywords, or stem words, in a single document. In accordance with such embodiments, scoring mechanism 327 determines a score for a result using a product of frequencies, e.g., , in order to quantify correlation, for example. In other words, by introducing
  • equation (1) is expanded using combinatorial analysis, and introduces combinations of the products of frequencies, in ever higher-order products, to an order equal to the number of stem words in a given multi-keyword search query.
  • each product created in this fashion can be scaled to the size of the original term, and thus, to each term that precedes it in the expansion. This can be accomplished by dividing each product by the appropriate multiplicative power of the original scoring formula, e.g., S ⁇ of equation (1) above.
  • the result is the original scoring formula corrected by higher- order correlations between stem words within the document.
  • An example of such a formula which can be used for a query involving N unique stem words is as shown below:
  • Figure 4B provides an example of an outcome using equation (5) in accordance with at least one embodiment of the disclosure.
  • Table 420 of Figure 4B provides column 406 which identifies three documents, each of which has corresponding frequency counts for first and second search terms shown in columns 407 and 408, and a score for each of the three documents shown in column 409 and rows 401 to 403.
  • the scoring formula becomes:
  • Figure 5 which comprises Figures 5 A and 5B, provides an example of scoring in exemplary cases in accordance with one or more embodiments of the present disclosure.
  • Figure 5A which provides an example of scoring involving one, two and three terms in accordance with at least one embodiment.
  • equation (4) can be used to score results in a case that the search query comprises a single search, or stem, word.
  • Equation (5) illustrates a scoring technique in a case that a search query contains two terms.
  • Equation (5) includes a first order, or portion, which sums the frequency of occurrences of the first and second terms independent of a simultaneous occurrence of the stem words in a document, and the second order portions adjusts for the simultaneous occurrence of the terms in a document.
  • a third order can be used to adjust for the simultaneous occurrence of all three terms in a document, as shown in equation (6) of Figure 5.
  • the first order portion corresponds to a summation of the frequencies of occurrence of each of the three terms in document independent of a simultaneous occurrence of two or more of the stem words
  • the second term of this formula corrects for the simultaneous occurrences of pairs of the three words within the document
  • the third term corrects for the simultaneous occurrence of all three words in the document.
  • column 406 identifies five documents, which have corresponding frequency counts for first, second and third search terms shown in columns 407 to 409, respectively, and a score for each of the five documents shown in column 410 and rows 401 to 405.
  • equation (6) accounts for multiple keywords appearing in the same document, under certain circumstances, it might overemphasize the relevance of lesser matches that happen to have large total counts of occurrences.
  • the scoring formula of equation (6) can be modified for those cases where N>i. More particularly, it is possible to introduce an adjustable cutoff number A ⁇ N , where A represents a minimum threshold number of unique stem words. The score corresponding to a document is set to zero if the number of unique stem words appearing in a document is less than A.
  • the threshold number can be used to address a case in which a result has high aggregate frequency of occurrence across the N stem words, but has little correlation between the stem words. In such a case, the threshold can be used to determine whether or not to eliminate a result from the search results returned to a user.
  • the scoring formula of equation (6) can be modified as shown in equation (7) shown in Figure 5B.
  • Table 430 of Figure 4D provides column 406 which identifies the five documents shown in table 430 of Figure 4C.
  • Each of the documents have corresponding frequency counts for first, second and third search terms shown in columns 407 to 409, respectively, and a score for each of the five documents shown in column 410 and rows 401 to 405.
  • Other results are depicted in table 430 have scores of zero, based on equation (7).
  • a threshold to determine a scoring e.g., using equation (7), it is possible to identify relevance of documents based on simultaneous occurrence of multiple stem words of a search query.
  • other criteria can be used alone or in combination with the scoring techniques discussed above.
  • the user can select any or all of the various features of scoring mechanism 327 including without limitation standard frequency weighting and/or enhanced frequency weighting.
  • search appliance 180 can include a security mechanism 328.
  • Security mechanism 328 is configured to provide a security model for providing enhanced search results, based on the identity and role of the searcher.
  • security mechanism 328 employs a log-in model where each user must have a user ID and a password to authenticate their identity on the network and to access search mechanism 325. Security mechanism 328 is described in more detail below.
  • Index 329 represents the index that is constructed by index mechanism 324, based on the content stored in shares accessible via network 120. Index 329 is used by search mechanism 325 to locate content relevant to a given search query presented by a user of a computer, such as one of computers 150, 160, and 170. Index 329 can be periodically rebuilt at a configurable interval in order to accurately reflect any changes made to the content in shares accessible via network 120.
  • index 329 is shown separately from database management system 323, it should be appreciated that index 329 can be created and maintained using database management system 323.
  • a discussion of one example of a data model used for indexing and searching is provided below.
  • index mechanism 324, search mechanism 325, report mechanism 326, scoring mechanism 327, and security mechanism 328 are shown as separate entities in Figure 3, index mechanism 324, search mechanism 325, report mechanism 326, scoring mechanism 327, and security mechanism 328 can be combined into a single software program or application or program product.
  • FIG. 6 a process 600 of maintaining and updating an index for the data files used in conjunction with a search appliance in accordance with one or more embodiments of the present disclosure is depicted.
  • indexing of the data files can be performed on shared resources determined to be available via network 120 at step 610.
  • network 120 is searched to identify shared, or sharable, resources, or shares. More particularly, search appliance 180 searches, also referred to herein as crawling or web crawling, the network for sharable resources, or shares, and maintains/updates a repository of information, using database management system 323, associated with each share to facilitate indexing and/or search.
  • Search appliance 180 is capable of performing network searches, including all files stored on a server or network of servers determined to be shared, not mere HTTP (index.htm) searches.
  • a sharable resource can be a hard disk drive, or other storage media, fixed or removable, or one or more folders, files, documents, pages etc. stored thereon, with "sharable" access rights.
  • sharable resources can include web pages typically displayed via web browser.
  • the initial index can be built using database management system 323 index mechanism 324 (step 620). Indexing can be accomplished by any means now known to those skilled in the art, or later developed/identified, hi accordance with one or more embodiments, as part of the indexing methodology, the creation date and/or last modified date for each data file is captured and stored.
  • a keyword database is constructed (step 630) using the key words or terms contained in the data files stored on data server 190.
  • the keyword database can be accessed by search mechanism 325 to identify search result items in response to submission of a search query submitted by a user, for example.
  • search mechanism 325 can be accessed by search mechanism 325 to identify search result items in response to submission of a search query submitted by a user, for example.
  • an index and/or a keyword database can be re-built to identify changes in sharable resources, e.g., resources for which the sharable characteristics have changed, and/or to identify changes in content to be reflected in the index.
  • a period of time can be used to determine when to re-build one or more of the index and keyword database.
  • process 600 can continue at steps 640 and 650 to in order to wait for such a time.
  • a previously captured creation date and/or last modified date can be examined and compared with a modification date associated with each file that is to be indexed. If there has been no change in the relevant date, then the file need not be re-indexed and the key words associated with that file need not be modified in the keyword database. However, if an existing file has been modified, as determined by examining the previously captured date with the new file modification date, for example, the new modification date can be captured and the document can be re-indexed and the keywords associated with that document can be updated in the keyword database.
  • security mechanism 328 can be configured to provide various levels of security functionality.
  • both indexed content and query results are protected from unauthorized access by security mechanism 328.
  • the approach to securing data from unauthorized access can be implemented at the enterprise level and also deployed at the desktop, as appropriate or desired, for example.
  • security mechanism 328 comprises an internal database, used by security mechanism 328 to track a variety of user and context sensitive information in order to ensure access to information only by approved system users.
  • database 740 can comprise data from multiple disparate data stores and the security assigned to the data in database 740 can vary from dataset to dataset.
  • database 740 is comprised of three separate data stores identified as domain 1, domain 2, and domain 3.
  • domain 1, domain 2, and domain 3 are separate data stores identified as domain 1, domain 2, and domain 3.
  • security for search results returned by search mechanism 325 and reported via report mechanism 326 can be implemented via a role-based administration of web services.
  • a system of one or more federated servers can be constructed in which a password-protected, server-shared database is used to define relational tables that store various types of administrative information and correspondences.
  • users, groups, domains, user roles, and domain groups are defined security components and used by security mechanism 328 to allow or deny access to various types of data stored in database 740 or potentially accessible via search mechanism 325, depending on the status of the various security components.
  • each group is placed in different groups, such as groups 710, 720 and 730, with each group identified as having access to particular domains and/or data files.
  • security mechanism 328 can be used to provide customized search results and protect sensitive data files.
  • User 1, User 2, and User 3 are assigned to user group 710.
  • User 3 and User 4 are assigned to user group 720.
  • user 4 and user 5 are assigned to user group 730.
  • each of user 2, user 3, and user 4 submits the same search query to database 740. However, because each of these users is assigned to different user groups, the results that are provided in response to their respective queries can be substantially different.
  • security mechanism 328 In response to the search request from user 2, security mechanism 328 allows dataset 750 to be returned to user 2. In response to the same search request received from user 3, dataset 760 is returned. Finally, in response to the same search request submitted by user 5, security mechanism 328 allows dataset 770 to be returned.
  • the various system user security components can define all registered users of the system and provide a framework or methodology lor determining which users are authorized to access which information.
  • the information relative to each user is stored in the database tables associated with the database for security mechanism 328.
  • various fields can include at least the unique username and a password for each user of search appliance 180 of Figure 1.
  • group permissions can be similarly stored in a database table which includes fields such as a name for each permission group, where a permission group is a customized text string descriptive of a role or function of the enterprise, such as "sales,” “support,” or “admin.”
  • a user can inherit security-related permissions and restrictions, based on the specific group permissions for the group to which the user is assigned.
  • Searchable domains are stored in a database table whose fields define the location, such as a website URI text string, of each domain from which content can be extracted by indexing operations conducted by index mechanism 324 at the request of a user.
  • a user can be restricted to searching only those domains that are identified in the searchable domains tables for that user and/or for the specific group to which that user belongs.
  • User roles can be stored in a database table whose fields serve to relate system users to group permissions, thus defining one or more roles a user plays within an enterprise. Specifically, a field exists in which a primary key of the system users table can appear in multiple records, each time uniquely corresponding to a second field containing a primary key of an entry of the group permissions table.
  • domain groups can be stored in a database table whose fields serve to relate searchable domains to group permissions, thus associating a domain with one or more group permissions of the enterprise.
  • a field can exist in which a primary key of the searchable domains table can appear in multiple records, each time uniquely corresponding to a second field containing a primary key of an entry of the group permissions table
  • the above-discussed database tables and their relationships can be used to provide a role-based security protocol to protect the results returned from a given user search request. More particularly, using the same security components and sequence/numbering scheme identified above, a specific security protocol can be implemented.
  • User authentication is provided via a match of input username and password to those stored in the system users table, identifying the user as the individual claimed.
  • the text string names of groups of the enterprise are obtained from the group permissions table. Domains of content within or without the enterprise are obtained from the searchable domains table.
  • the user roles table indicates the groups to which the authenticated user belongs.
  • the domain groups table indicates, for a given searchable domain, what groups of users can access that domain's content, and thus, via the user roles table and the matching of group permissions primary keys, what searchable domains the authenticated user has privilege to see
  • the above administrative information can be used to filter the query of a search request, so as to return only information from those domains the authenticated user is permitted to see, based on that individual's role within the enterprise.
  • the level of granularity of search restriction can be at a level of a searchable domain, in a case that group permissions are assigned to searchable domains.
  • the access granted users can be, but is not usually, granted at the level of individual documents, as in a typical file system.
  • an administrator can define searchable domains with a granularity that can vary from finely grained (e.g., at a single-file-level), to medium grained (e.g., at a set-of sub-directories level), or coarsely grained (e.g., at a entire- website level).
  • the granularity of group permissions can be variable, depending on how the searchable domains are defined. Since documents of a common level of sensitivity are typically grouped together, domains are generally defined correspondingly.
  • search mechanism 325 in conjunction with database 323 and index mechanism 324 can be deployed to perform the requested search and retrieve the results (step 820).
  • scoring mechanism 327 can be deployed to determine a scoring of the search results. Scoring mechanism 327 can use any of the equations described herein to score results, as discussed above. As shown in Figure 8, any one or more of the various weighting mechanisms previously described can be used to with, or as an alternative, to score the search results. For example, in accordance with one or more embodiments, search results can be determined by applying frequency weighting (e.g., "enhanced frequency weighting") (step 830). In accordance with one or more embodiments, the application of one or more weighting factors can be user-configurable, and it is possible for each user to configure scoring mechanism 327 for maximum benefit.
  • frequency weighting e.g., "enhanced frequency weighting”
  • the search results can be ordered (step 840) and presented to the user (step 850). hi this fashion, the search results can be enhanced and customized for each individual user of search appliance 180.
  • search mechanism 325 can use a search model to facilitate searching performed in response to a query consisting of one or more keywords, for example.
  • the search model includes a data model used for searching, indexing and ranking operations, techniques such as word stemming and parts- of-speech tagging, and a lexicon that can learn new words encountered while performing initial and incremental indexing.
  • the search model can use a pipeline architecture, as is described in more detail below.
  • the search model can also include scoring, or ranking, of search result items, e.g., documents, such as that performed using scoring mechanism 327 to rank the results of a query used with one or more embodiments of the present disclosure.
  • Word stemming can be used to remove common morphological and inflectional endings from words, so as to normalize terms.
  • One example of such a word stemming mechanism is the Martin Porter Stemming Algorithm.
  • One example of parts- of-speech tagging is the University of Pennsylvania (Perm) Treebank Tagset.
  • search model which can be used in accordance with one or more embodiments, an illustrative description of a design of data structures used, the layout of the supporting database, and incremental indexing is provided. More particularly, the layout of the database and how it is used to maintain long-term storage of the index constructed from document content is discussed.
  • Figure 9 An illustrative example of a database schema used in one or more embodiments of the disclosure is shown in Figure 9, which comprises Figures 9A and 9B.
  • the schema includes key, domain, uri, page, lexicon, rank and word tables described below.
  • use of primary or foreign keys can be limited in order to allow the insert of new records via a file import mechanism rather than through the use of the SQL INSERT statement. It should be apparent that most, if not all, database vendors do not permit a file import if the table to which data is being imported defines an auto incrementing field and/or explicit foreign key relationships.
  • a file import mechanism can be used in embodiments of the present disclosure to achieve efficiencies. More particularly, in view of the numbers of records to be created in generating a search model index, use of an SQL INSERT to insert records in database tables in a relational database is particularly time consuming and impractical.
  • data that is to be inserted into the database is first written to temporary files, or buffers, and then imported into the database.
  • One example of an exception to this approach involves the domain table, which defines an auto incremented index field, and the key table, which maintains counts of indices. Since relatively few records are involved, the file import mechanism need not be used in creating records in the domain and key tables.
  • the domain, uri, and page tables are used to store information about the document pages that are visited during indexing.
  • a domain refers to a location where documents can be stored, such as a website or file directory.
  • every domain that is indexed can be recorded as an entry in the domain table.
  • a document can be referred to by its Universal Resource Indicator, or URI, which can be associated with a specific domain. Every document that is indexed can be recorded as an entry in the uri table.
  • the lexicon and rank tables can be used in indexing the information accessible via network 120. More particularly, the lexicon table, which contains the learning dictionary of the keyword search model, contains an entry for every original, case-insensitive word known to the indexing algorithm, including the parts of speech of each word.
  • the pos field which can be a comma delimited list of tags constructed, for example, from the Perm Treebank tag set.
  • the lexicon table can contain an entry for every stem word that can be constructed from the set of known original words. Every entry in the lexicon table is associated with a unique index, denoted by the lkey field.
  • the ukey field can be a specific lkey index corresponding to a stem word.
  • the ukey field can be used to establish a relationship between ever original word and its corresponding stem word, within the same table. That is, for example, every stem word entry in lexicon can be self-referential, such that the values of lkey and skey of a stem word entry can be identical.
  • An entry in the rank table records the frequency of occurrence of a stem word within a document page, as it is known within the lexicon table.
  • the word table records the positions of original words encountered during indexing, so that they can be highlighted in subsequent search result presentations.
  • the original words need only be referred to by their corresponding stem words, hence the appearance of the field skey within the definition of the word table.
  • buffering and a file import mechanism can be used in one or more embodiments of the present disclosure.
  • a data structure is used to provide a buffer for data before it is written to the database.
  • the data that is buffered corresponds to the fields in the uri, page, rank, and word tables.
  • buffered data is can be written at the end of indexing, or when memory availability reaches a predefined threshold, requiring a flush of data to free the memory.
  • New records can be written to the tables from the buffered data via a file import mechanism, and existing records can be updated via an SQL UPDATE command.
  • Another type of data structure used in indexing is an 7V-ary trie tree, where N is a number of characters (e.g., upper case) in the alphabet, plus digits and punctuation marks.
  • This tree structure can be used to hold the contents of the entire lexicon in memory and to provide fast lookups (e.g., a word lookup), for example.
  • the tree structure is populated using the contents of the lexicon table. If new words are encountered during indexing, they can be added to the tree.
  • the contents of the tree can be written back to the lexicon table.
  • the tree's contents can be written back to the lexicon table using a file import mechanism, as discussed above. For example, entries in the tree which represent new words found during indexing can be imported to the lexicon table via a temporary buffer, or file, using a file import mechanism.
  • the iV-ary trie tree structure can be used with large dictionaries of words because text-string lookup within the trie structure is quite fast.
  • Each node of the tree contains an array of size ,V, where each element of the array is potentially a child node.
  • Figure 10 provides an example of a 3-ary trie tree in accordance with at least one disclosed embodiment.
  • tree 1000 is constructed from an alphabet consisting of the upper case letters A, B, and C.
  • Each of the elements (circles) of the 3-size rectangles, or arrays, 1002A to 1002D corresponds to a letter, with the top element in an array corresponding to A, the middle element to B, and the bottom element to C.
  • Circles 1003A to 1003D represent allocated nodes.
  • the squares IOOIA to IOOID represent an allocation of data at a node, such as the parts of speech of a word.
  • the example of the 3-ary trie tree shown in Figure 10 depicts the storage of data for the words AB, ABC, C, and CC.
  • node indicator IOOIA indicates that element 1003 A, which corresponds to the letter "C” is allocated.
  • node indicators 100 IB to 100 ID correspond to allocations of letters “C", "B” and “A” in arrays 1002B to 1002D, respectively.
  • Array 1002 A corresponds to the word "C”.
  • the word “CC” can be formed from elements 1003 A and 1003B.
  • arrays 1002B and 1002C can form the word "AB” from elements 1003B and 1003C.
  • the word “ABC” can be formed using elements 1003B, 1003C and 1003D of arrays 1002B, 1002B and 1002C, respectively, and traversal paths 1005 and 1006.
  • indexing can be performed using a pipeline thread architecture. More particularly, the sequential nature of indexing can be broken up into segments and assigned to the multiplexing stages of the pipeline, so as to enhance throughput. For example, in accordance with one or more embodiments, web crawling can be assigned to the first stage of the pipeline, and the second stage can be used to perform initial format parsing of documents. Additional stages might be used for further passes through documents (such as to apply sophisticated image recognition algorithms). In one of the final stages of the pipeline, indexed content can be written to the working store.
  • a single multiplexing stage can be assigned to perform all of the tasks of indexing, from web crawling, to format parsing, to indexing of words.
  • the concatenation of all of the sequential tasks can comprise the indexing procedure.
  • indexing includes a parsing of documents, or other items found on network 120, Io identify new words to be added to the lexicon. In addition, with respect to each document, indexing identifies the words contained within the document, the locations of each of these words, and a frequency of occurrence of the words found in the document.
  • embodiments of the present disclosure contemplate the ability of the lexicon to learn new words.
  • the current content of the lexicon is loaded into memory, as discussed herein. This includes any predefined entries whose parts of speech and corresponding stem words have been carefully reviewed, such as by visual inspection.
  • their stem words can be estimated using the Porter stemming algorithm, for example.
  • each new word can be assigned a default part of speech, such as by using the NN tag of the Penn Treebank tag set, for example.
  • the lexicon of the keyword search model can be initialized, e.g., in a version shipped to the end customer, with predefined entries or no entries at all.
  • incremental indexing which can be used with a keyword search model used in one or more embodiments of the present disclosure.
  • two distinct time values (i) the start time, index Jime, of the indexing procedure and (ii) the last modification time, last mod Jime, are maintained for each document visited. These values can be stored, respectively, in the index Jime and last jnod Jime fields of each record of the uri table of the database schema set forth above.
  • document information stored in the uri table is preferably loaded into a data structure in memory to facilitate comparison of last modification times. If the document cannot be found in the data structure, it is added to the data structure, together with its last modification time and the start time of the present indexing. If the document is found in the data structure, then its modification time is compared to the modification stored in the data structure corresponding to the document. If the two times are equal then the document is not indexed again. Otherwise, the document is again fully indexed, i.e., every page, and the information pertaining to the document, including its lastjnodjime and indexjime, is updated in the data structure.
  • a "final scrub" of the database can be performed prior to completing an indexing operation.
  • This final scrub can remove obsolete records from the database. For example, those entries that correspond to documents that are identified during the indexing operation as no longer existing (e.g., a document no longer resides within the domains indexed by the current indexing operation) or for whatever reason no longer able to be indexed.
  • Documents so identified during an indexing operation can be removed by deleting their corresponding entries from the uri table, along within any explicit or implicit relationships to other tables in the database. Thus, for example, all pages of such documents also can be deleted from the page table.
  • Obsolete records of the uri table are those whose values within the indexjime field do not equal the present start time of indexing.
  • the query is processed against the search model described above.
  • the example query includes a keyword, "FOO", which is taken from the user request (e.g., the user request might involve a request for documents containing the word
  • the query shown below is an SQL query involving the lexicon table of the keyword search model, which can be used to look up each unique keyword in the lexicon
  • the lexicon table of the database contains entries for words and their stems and maintains a relationship between each word and its stem.
  • results with a score of zero can be pruned from the list before return to the end user.
  • search appliance 180 identifies servers which provide shared resources, or shares. Servers are identified using several methods depending on the characteristics of the target network.
  • search appliance 180 can browse the network address space (e.g., the network address space of search appliance 180) using network browsing tools and/or use directory services to find shared resources.
  • the search appliance 180 can locate resources by browsing the network using a browser service.
  • a browser service or server, provides a list of available resources on a network domain.
  • a master browser maintains the main or master list of computers and shared resources. For example, all workgroups or domains can have one master browser.
  • a master browser maintains a master list of shared resources, and browser servers maintain a subset of the master list of shared resources. These lists are updated periodically to reflect shared resources added or removed.
  • search appliance 180 searches network 120 to identify sharable resources using SAMBA, an open source utility suite which provides information about shared resources. Documentation for the SAMBA utility suite can be found at www.samba.org.
  • SMBtree which can be used to browse the network to identify a list, e.g., in the form of a tree, showing known domains, the servers in those domains, and the shares on the servers. It has been determined by the inventor of the present invention that this utility does not necessarily provide an accurate and complete listing of the domains, servers and/or shares. Accordingly, in accordance with embodiments of the present invention, other SAMBA utilities are used to supplement the SMBtree utility, in order to obtain a more complete identification of shares accessible via the network.
  • Another SAMBA utility a master and browser lookup utility, used to supplement, or in place of, the SMBtree utility, locates all of the browsers, i.e., the master browser and browser servers, on the network, together with their NetBIOS names.
  • Another utility, the SMBclient utility is then used in embodiments of the present invention to obtain directory information from the servers identified by the former utility.
  • the SMBtree utility can be used to provide a list of the servers and shares on the servers.
  • the search appliance 180 can be configured to find shared resources by consulting a directory service.
  • search appliance 180 uses a directory access protocol (e.g., Light-weight Directory Access Protocol, or "LDAP") to consult directories, such as those directories maintained by Windows Domain Controllers, and Windows Catalog Servers, for example.
  • directory access protocol e.g., Light-weight Directory Access Protocol, or "LDAP”
  • directories such as those directories maintained by Windows Domain Controllers, and Windows Catalog Servers, for example.
  • the process can be iteratively performed until no new servers are returned.
  • the iterative process is implemented as a PERL script.
  • Figure 12 which comprises Figures 12A and 12B, provides an example of pseudo code of a script for use in discovering shared resources in accordance with one or more embodiments.
  • search appliance 180 can examine network configuration information to determine the type of network services that are being used on the network.
  • the network configuration information can be obtained from information entered via a graphical user interface, for example.
  • search appliance 180 can be configured as a DHCP client, which communicates with a DHCP server to request network configuration information (e.g., IP address information, information regarding available domain name servers, NetBIOS servers and/or Windows TM Name Service- enabled servers, etc.).
  • This additional configuration option using manual configuration and/or DHCP configuration information retrieval, provides support for NetBIOS networks that span network segments.
  • search appliance 180 can retrieve shared resource information identified in a previous network search, as well as previously-supplied authentication information. In some cases, if not most, authentication information (e.g., username and password) must be supplied to a server to obtain information regarding the server's shared resources, or other information regarding the network.
  • authentication information e.g., username and password
  • search appliance 180 can use its IP address to identify an address space, e.g., a network block extent, and the IP addresses in the address space.
  • Search appliance 180 can search for devices that accept TCP connections on ports known to correspond to specific file sharing services. For example, a NetBIOS-over-TCP protocol set can be used to attempt to open a connection to a port (e.g., an SMB ports 139 and/or 445).
  • An Active Directory Service (ADS) LDAP can be identified by accessing port 389.
  • An accessible server is identified, and each server identified can be queried directly to identify shared resources (e.g., by obtaining a "share list" from an identified server).
  • a server name list is generated using the servers identified by a search of the address space.
  • Each LDAP server found e.g., by attempting to open a connection to port 389) is queried to identify name of "Domain Member" servers/computers.
  • Each IP address found e.g., by attempting to open a connection to ports 139 and 445) is used to identify a corresponding server name.
  • the NetBIOS or WINS protocols can be used to retrieve a server name corresponding to an IP address. If a server name corresponding to an IP address cannot be determined, the IP address is used as the server name.
  • An IP address can be resolved, and a corresponding server name identified, using a reverse lookup operation.
  • a Domain Name Service or DNS
  • DNS Domain Name Service
  • Each named server, or unresolvable IP address, identified can then be queried to obtain a share list.
  • domain or server-level authentication credentials e.g., login name and password
  • shared resources e.g., shared resources, or "share list”.
  • available authentication credentials e.g., from configuration/initialization information
  • a utility such as the SAMBA' s SMBclient, can be used to request a "share list" from a named server, or IP address. For those servers/IP addresses lacking authentication credentials, or in a case that a server/IP address does not require authentication, the SMBclient can be used without authentication credentials. If authentication credentials are needed to retrieve the "share list", the SMBclient can be used with authentication credentials.
  • a "share list” If a "share list” is obtained, it can be examined, and server name information contained in the "share list” can be used to resolve a server name.
  • server name information contained in the "share list” can be used to resolve a server name.
  • a new server name is identified from the "share list” (e.g., a new server name is listed in the "share list” and/or information contained in the "share list” is used to resolve and previously-unresolvable IP address)
  • authentication credentials are identified (if available), and the server can be queried to retrieve its "share list", as previously discussed.
  • An obtained "share list” can be examined to identify shared resources, or shares, which can be accessed for shared files.
  • the "share list" can be examined to determine whether a previously-undiscovered domain and/or workgroup is identified, which can be added to a domain/workgroup list.
  • domain-level authentication credentials might be available for a newly-discovered domain, which credentials can be used to obtain a "share list”.
  • previously-undiscovered peer servers can be identified and added to the list of servers to be queried for a listing of shared resources.
  • An iterative discovery process is used to discover named servers and IP addresses. In accordance with at least one embodiment, the iterative process continues until no new servers can be identified.
  • Shares discovered using the above-identified iterative process can be mounted to provide access to shared files. That is, for example, a mount operation which references a network device, such as a server or storage appliance and/or a file system, storage device, directory, file, etc. of the network device, makes the referenced item available for access.
  • a mount operation which references a network device, such as a server or storage appliance and/or a file system, storage device, directory, file, etc. of the network device, makes the referenced item available for access.
  • SAMBA While the SMB protocol/file system implementation of SAMBA can be used to mount shared files discovered using the above-described iterative process, older versions of the SMB protocol do not support digital signatures, or digital signing. This can result in an incompatibility with file systems that use an authentication technique, such as digital signing, in connection with,
  • the CIFS VFS i.e., Common Internet File System Virtual File System
  • CIFS VFS is used to mount shares discovered using the above-described iterative process.
  • CIFS VFS is an open source initiative in collaboration with Samba, which allows access to such shares as servers and storage appliances.
  • CIFS VFS implements digital signing, and encompasses the SMB protocol, and is compatible with newer Microsoft implementations of the CIFS protocol, of which SMB is a predecessor.
  • CIFS VFS which implements digital signing and encompasses the SMB protocol, can be used to mount SMB file shares and the newer CIFS file shares, for example, particularly when digital signing is used within mount authentications.
  • the present disclosure provides an apparatus and method for the broad application of indexing, locating and retrieving desired information in an efficient and effective manner.
  • the information retrieval can be for purposes of information collection and copying for investigative purposes, such as discovery in a litigation, audits, or other investigations.
  • the illustrated embodiments are exemplary embodiments only, and are not intended to limit the scope, applicability, or configuration of the present disclosure in any way. Rather, the foregoing detailed description provides those skilled in the art with a convenient road map for implementing the exemplary embodiments of the present disclosure. Accordingly, it should be understood that various changes may be made in the function and arrangement of elements described in the various exemplary embodiments without departing from the spirit and scope of the present disclosure as set forth in the appended claims.

Abstract

La présente invention concerne en général le domaine des documents électroniques, et plus particulièrement l'identification et la collecte de documents électroniques, y compris des documents accessibles dans un environnement de réseau, ainsi qu'un appareil et un procédé permettant ces opérations.
PCT/US2007/084728 2006-11-14 2007-11-14 Appareil et procédé de collecte d'informations réparties dans un réseau WO2008070415A2 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US85874906P 2006-11-14 2006-11-14
US60/858,749 2006-11-14

Publications (2)

Publication Number Publication Date
WO2008070415A2 true WO2008070415A2 (fr) 2008-06-12
WO2008070415A3 WO2008070415A3 (fr) 2008-08-14

Family

ID=39492960

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2007/084728 WO2008070415A2 (fr) 2006-11-14 2007-11-14 Appareil et procédé de collecte d'informations réparties dans un réseau

Country Status (1)

Country Link
WO (1) WO2008070415A2 (fr)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2237209A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Collecteur actif de courriers électroniques
EP2234045A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Suggestion des destinataires d'enquêtes et de notices de conservation dans un système de découverte électronique
EP2237205A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Appareil de balayage de profils
EP2234049A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Procédé de service d'arrière-plan pour la collecte locale de données dans un système de découverte électronique
EP2234050A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Codage prédictif de documents dans un système de découverte électronique
EP2237207A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Outil de balayage de fichiers
EP2234048A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Suggestion de dépositaires potentiels pour des cas dans un système de découverte électronique à tous les niveaux de l'entreprise
EP2237204A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Identification positive et ajout en masse de dépositaires à un cas dans un système de découverte électronique
EP2234052A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Système de gestion de dépositaires
EP2237208A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Estimations de coût dans un système de découverte électronique
EP2234051A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Étiquetage de données électroniques dans un système de découverte électronique
EP2234044A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Surveillance d'un réseau d'entreprise pour déterminer l'utilisation spécifique de dispositifs informatiques
EP2234053A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Outil de collecte de données d'un lecteur partagé pour système de découverte électronique
EP2234047A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Système de découverte électronique
EP2234046A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Procédés et appareils pour communiquer des notices de conservation
EP2237206A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Source pour le traitement de la conversion de fichiers dans un système d'entreprise de découverte électronique
US8396871B2 (en) 2011-01-26 2013-03-12 DiscoverReady LLC Document classification and characterization
US8572376B2 (en) 2009-03-27 2013-10-29 Bank Of America Corporation Decryption of electronic communication in an electronic discovery enterprise system
US9053454B2 (en) 2009-11-30 2015-06-09 Bank Of America Corporation Automated straight-through processing in an electronic discovery system
US9667514B1 (en) 2012-01-30 2017-05-30 DiscoverReady LLC Electronic discovery system with statistical sampling
US10467252B1 (en) 2012-01-30 2019-11-05 DiscoverReady LLC Document classification and characterization using human judgment, tiered similarity analysis and language/concept analysis
RU2713761C1 (ru) * 2019-06-14 2020-02-07 Публичное Акционерное Общество "Сбербанк России" (Пао Сбербанк) Способ и система поиска принадлежности ip-адреса территориальному кластеру на основе данных транзакций
CN111786811A (zh) * 2020-05-25 2020-10-16 福建中锐电子科技有限公司 一种便携式现场电子数据取证终端与装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20010105842A (ko) * 2000-05-18 2001-11-29 구자홍 인터넷을 이용한 정보검색 결과 제공방법
US20050192936A1 (en) * 2004-02-12 2005-09-01 Meek Christopher A. Decision-theoretic web-crawling and predicting web-page change
WO2006118360A1 (fr) * 2005-05-04 2006-11-09 R.S.N. Co., Ltd. Systeme d'analyse tendancielle par sujets

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20010105842A (ko) * 2000-05-18 2001-11-29 구자홍 인터넷을 이용한 정보검색 결과 제공방법
US20050192936A1 (en) * 2004-02-12 2005-09-01 Meek Christopher A. Decision-theoretic web-crawling and predicting web-page change
WO2006118360A1 (fr) * 2005-05-04 2006-11-09 R.S.N. Co., Ltd. Systeme d'analyse tendancielle par sujets

Cited By (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8549327B2 (en) 2008-10-27 2013-10-01 Bank Of America Corporation Background service process for local collection of data in an electronic discovery system
US8504489B2 (en) 2009-03-27 2013-08-06 Bank Of America Corporation Predictive coding of documents in an electronic discovery system
EP2234050A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Codage prédictif de documents dans un système de découverte électronique
EP2234049A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Procédé de service d'arrière-plan pour la collecte locale de données dans un système de découverte électronique
EP2237209A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Collecteur actif de courriers électroniques
EP2237207A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Outil de balayage de fichiers
EP2234048A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Suggestion de dépositaires potentiels pour des cas dans un système de découverte électronique à tous les niveaux de l'entreprise
EP2237204A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Identification positive et ajout en masse de dépositaires à un cas dans un système de découverte électronique
EP2234052A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Système de gestion de dépositaires
EP2237208A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Estimations de coût dans un système de découverte électronique
EP2234051A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Étiquetage de données électroniques dans un système de découverte électronique
US8417716B2 (en) 2009-03-27 2013-04-09 Bank Of America Corporation Profile scanner
EP2234053A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Outil de collecte de données d'un lecteur partagé pour système de découverte électronique
EP2234045A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Suggestion des destinataires d'enquêtes et de notices de conservation dans un système de découverte électronique
EP2234046A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Procédés et appareils pour communiquer des notices de conservation
EP2237206A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Source pour le traitement de la conversion de fichiers dans un système d'entreprise de découverte électronique
US8200635B2 (en) 2009-03-27 2012-06-12 Bank Of America Corporation Labeling electronic data in an electronic discovery enterprise system
US8224924B2 (en) 2009-03-27 2012-07-17 Bank Of America Corporation Active email collector
US8250037B2 (en) 2009-03-27 2012-08-21 Bank Of America Corporation Shared drive data collection tool for an electronic discovery system
US8364681B2 (en) 2009-03-27 2013-01-29 Bank Of America Corporation Electronic discovery system
US9934487B2 (en) 2009-03-27 2018-04-03 Bank Of America Corporation Custodian management system
EP2234044A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Surveillance d'un réseau d'entreprise pour déterminer l'utilisation spécifique de dispositifs informatiques
EP2237205A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Appareil de balayage de profils
EP2234047A3 (fr) * 2009-03-27 2010-11-24 Bank of America Corporation Système de découverte électronique
US8572376B2 (en) 2009-03-27 2013-10-29 Bank Of America Corporation Decryption of electronic communication in an electronic discovery enterprise system
US8572227B2 (en) 2009-03-27 2013-10-29 Bank Of America Corporation Methods and apparatuses for communicating preservation notices and surveys
US8688648B2 (en) 2009-03-27 2014-04-01 Bank Of America Corporation Electronic communication data validation in an electronic discovery enterprise system
US8806358B2 (en) 2009-03-27 2014-08-12 Bank Of America Corporation Positive identification and bulk addition of custodians to a case within an electronic discovery system
US8805832B2 (en) 2009-03-27 2014-08-12 Bank Of America Corporation Search term management in an electronic discovery system
US8868561B2 (en) 2009-03-27 2014-10-21 Bank Of America Corporation Electronic discovery system
US8903826B2 (en) 2009-03-27 2014-12-02 Bank Of America Corporation Electronic discovery system
US9721227B2 (en) 2009-03-27 2017-08-01 Bank Of America Corporation Custodian management system
US9171310B2 (en) 2009-03-27 2015-10-27 Bank Of America Corporation Search term hit counts in an electronic discovery system
US9330374B2 (en) 2009-03-27 2016-05-03 Bank Of America Corporation Source-to-processing file conversion in an electronic discovery enterprise system
US9542410B2 (en) 2009-03-27 2017-01-10 Bank Of America Corporation Source-to-processing file conversion in an electronic discovery enterprise system
US9547660B2 (en) 2009-03-27 2017-01-17 Bank Of America Corporation Source-to-processing file conversion in an electronic discovery enterprise system
US9053454B2 (en) 2009-11-30 2015-06-09 Bank Of America Corporation Automated straight-through processing in an electronic discovery system
US9703863B2 (en) 2011-01-26 2017-07-11 DiscoverReady LLC Document classification and characterization
US8396871B2 (en) 2011-01-26 2013-03-12 DiscoverReady LLC Document classification and characterization
US9667514B1 (en) 2012-01-30 2017-05-30 DiscoverReady LLC Electronic discovery system with statistical sampling
US10467252B1 (en) 2012-01-30 2019-11-05 DiscoverReady LLC Document classification and characterization using human judgment, tiered similarity analysis and language/concept analysis
RU2713761C1 (ru) * 2019-06-14 2020-02-07 Публичное Акционерное Общество "Сбербанк России" (Пао Сбербанк) Способ и система поиска принадлежности ip-адреса территориальному кластеру на основе данных транзакций
WO2020251386A1 (fr) * 2019-06-14 2020-12-17 Публичное Акционерное Общество "Сбербанк России" Recherche d'appartenance d'une adresse ip à un groupe territorial sur la base de données de transactions
CN111786811A (zh) * 2020-05-25 2020-10-16 福建中锐电子科技有限公司 一种便携式现场电子数据取证终端与装置
CN111786811B (zh) * 2020-05-25 2022-07-08 福建中锐电子科技有限公司 一种便携式现场电子数据取证终端与装置

Also Published As

Publication number Publication date
WO2008070415A3 (fr) 2008-08-14

Similar Documents

Publication Publication Date Title
WO2008070415A2 (fr) Appareil et procédé de collecte d'informations réparties dans un réseau
US20070073894A1 (en) Networked information indexing and search apparatus and method
US9348918B2 (en) Searching content in distributed computing networks
JP6419633B2 (ja) 検索システム
US7865537B2 (en) File sharing system and file sharing method
US7440964B2 (en) Method, device and software for querying and presenting search results
US8027976B1 (en) Enterprise content search through searchable links
US6516337B1 (en) Sending to a central indexing site meta data or signatures from objects on a computer network
US8516582B2 (en) Method and system for real time classification of events in computer integrity system
US20090063448A1 (en) Aggregated Search Results for Local and Remote Services
US7930629B2 (en) Consolidating local and remote taxonomies
US8386476B2 (en) Computer-implemented search using result matching
US8572049B2 (en) Document authentication
US7797350B2 (en) System and method for processing downloaded data
US9584522B2 (en) Monitoring network traffic by using event log information
US20030069803A1 (en) Method of displaying content
JP5492295B2 (ja) コンテンツメッシュ検索
JP5320433B2 (ja) 統合検索装置、統合検索システム、統合検索方法
JP2005242586A (ja) 文書ビュー提供のためのプログラム、装置、システム及び方法
JP5431475B2 (ja) 検索システム、及び検索空間マップサーバ装置、並びにプログラム
US20060218208A1 (en) Computer system, storage server, search server, client device, and search method
JP2004046460A (ja) ファイル管理システムにおけるアクセス制御方式
US6957347B2 (en) Physical device placement assistant
Albertsen The paradigma web harvesting environment
JP2009122995A (ja) 関連処理記録の管理システム及び管理方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07871474

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase in:

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: COMMUNICATION UNDER RULE 112(1) EPC, EPO FORM 1205A DATED 27/08/09

122 Ep: pct application non-entry in european phase

Ref document number: 07871474

Country of ref document: EP

Kind code of ref document: A2