US20110264637A1 - Method and a system for information identification - Google Patents

Method and a system for information identification Download PDF

Info

Publication number
US20110264637A1
US20110264637A1 US13/172,998 US201113172998A US2011264637A1 US 20110264637 A1 US20110264637 A1 US 20110264637A1 US 201113172998 A US201113172998 A US 201113172998A US 2011264637 A1 US2011264637 A1 US 2011264637A1
Authority
US
United States
Prior art keywords
information
sub
items
item
comprises
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/172,998
Inventor
Ariel Peled
Ofir Carny
Lidror Troyansky
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Forcepoint LLC
Original Assignee
PortAuthority Technologies LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to US45937203P priority Critical
Priority to US10/815,764 priority patent/US7991751B2/en
Priority to US13/172,998 priority patent/US20110264637A1/en
Application filed by PortAuthority Technologies LLC filed Critical PortAuthority Technologies LLC
Assigned to PORTAUTHORITY TECHNOLOGIES INC. reassignment PORTAUTHORITY TECHNOLOGIES INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: PELED, ARIEL, CARNY, OFIR, TROYANSKY, LIDROR
Publication of US20110264637A1 publication Critical patent/US20110264637A1/en
Assigned to JPMORGAN CHASE BANK, N.A. reassignment JPMORGAN CHASE BANK, N.A. FIRST LIEN SECURITY AGREEMENT Assignors: PORTAUTHORITY TECHNOLOGIES, INC., WEBSENSE, INC.
Assigned to ROYAL BANK OF CANADA reassignment ROYAL BANK OF CANADA SECOND LIEN SECURITY AGREEMENT Assignors: PORTAUTHORITY TECHNOLOGIES, INC., WEBSENSE, INC.
Assigned to ROYAL BANK OF CANADA, AS SUCCESSOR COLLATERAL AGENT reassignment ROYAL BANK OF CANADA, AS SUCCESSOR COLLATERAL AGENT ASSIGNMENT OF SECURITY INTEREST Assignors: JPMORGAN CHASE BANK, N.A., AS EXISTING COLLATERAL AGENT
Assigned to PORT AUTHORITY TECHNOLOGIES, INC., WEBSENSE, INC. reassignment PORT AUTHORITY TECHNOLOGIES, INC. RELEASE OF SECOND LIEN SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME: 30704/0374 Assignors: ROYAL BANK OF CANADA, AS COLLATERAL AGENT
Assigned to PORT AUTHORITY TECHNOLOGIES, INC., WEBSENSE, INC. reassignment PORT AUTHORITY TECHNOLOGIES, INC. RELEASE OF FIRST LIEN SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME: 030694/0615 Assignors: ROYAL BANK OF CANADA, AS COLLATERAL AGENT
Assigned to RAYTHEON COMPANY reassignment RAYTHEON COMPANY PATENT SECURITY AGREEMENT Assignors: PORT AUTHORITY TECHNOLOGIES, INC., RAYTHEON CYBER PRODUCTS, LLC (FORMERLY KNOWN AS RAYTHEON CYBER PRODUCTS, INC.), RAYTHEON OAKLEY SYSTEMS, LLC, WEBSENSE, INC.
Assigned to PORTAUTHORITY TECHNOLOGIES, LLC reassignment PORTAUTHORITY TECHNOLOGIES, LLC CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: PORTAUTHORITY TECHNOLOGIES, INC.
Assigned to FORCEPOINT LLC reassignment FORCEPOINT LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: PORTAUTHORITY TECHNOLOGIES, LLC
Application status is Abandoned legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/02Comparing digital values

Abstract

A method for detecting an information item within an information sequence obtained from a digital medium, said information item comprising any one of a specified set of prestored information items, comprising: transforming each of the set of prestored information items into a respective representation, in accordance with a predetermined transformation format; transforming the information sequence obtained from the digital medium, in accordance with the transformation format; and determining the presence of one or more of the prestored information items within the transformed information sequence, utilizing the respective representation, wherein the information items are divided into sets, and applying a security policy that depends on the number of detected information items that belong to the same set.

Description

    RELATED APPLICATION/S
  • This application is a divisional of U.S. patent application Ser. No. 10/815,764 filed on Apr. 2, 2004, which claims the benefit of priority under 35 USC 119(e) of U.S. Provisional Patent Application No. 60/459,372 filed on Apr. 2, 2003. The contents of all of the above applications are incorporated by reference as if fully set forth herein.
  • FIELD OF THE INVENTION
  • The present invention relates generally to the field of analysis of digital information. More specifically, the present invention deals with methods for fast identification of information items within electronic traffic and digital media.
  • BACKGROUND OF THE INVENTION
  • The information and knowledge created and accumulated by organizations and businesses are most valuable assets. As such, managing and keeping the information and the knowledge inside the organization and restricting its distribution outside is of paramount importance for almost any organization, government entity or business and provides a significant leverage of its value. Most of the information in modern organizations and businesses is represented in a digital format. Digital content can be easily copied and distributed (e.g., via e-mail, instant messaging, peer-to-peer networks, FTP and web-sites), which greatly increase hazards such as business espionage and data leakage. It is therefore essential to monitor the information traffic in order to keep the information unavailable to unauthorized persons.
  • Various bills and regulations within the United States of America and other countries impose another level of importance to the problem of confidential information management and control. Regulations within the United States of America, such as the Health Insurance Portability and Accountability Act (HIPPA), the Gramm-Leach-Bliley act (GLBA) and the Sarbanes Oxley act (SOXA) implies that the information assets within organizations should be monitored and subjected to an information management policy, in order to protect clients privacy and to mitigate the risks of a potential misuse and fraud. In particular, the existence of covert channels of information, which can serves conspiracies to commit fraud or other illegal activities, pose severe risk from both legal and business perspectives.
  • Another aspect of the information management problem is to make the information explicitly available to authorized persons whenever needed, so that it can be utilized in order to create value for the organization. This aspect also requires tracking the information along its life cycle.
  • Methods that attempt to track digital information and manage information and knowledge exist. One of the most prevalent methods is based on key-words and key-phrases filtering: in this case, the system attempts to recognize a pre-defined set of previously stored information items, such as key-words, numbers and key-phrases, within the content, utilizing string comparison algorithms. Such methods are in wide usage, e.g., for email filtering utilizing string matching. However, and the usage of such methods may become prohibitively slow when the number of stored information items is large.
  • There is thus a recognized need for, and it would be highly advantageous to have, a method and system that allow fast and efficient recognition of large number of keywords and key phrases within electronic traffic, which will overcome the drawbacks of current methods as described above.
  • SUMMARY OF THE INVENTION
  • It is an object of the present invention to provide a method and a system that facilitates fast and efficient detection and identification of a large number of previously stored information and data items, such as words, key-phrases, credit-card numbers, social security numbers, names, addresses, email address, account numbers, and other strings within electronic traffic.
  • According to a first aspect of the present invention, there is provided a method for detecting an information item within an information sequence obtained from a digital medium, said information item comprising any one of a specified set of prestored information items, comprising:
  • transforming each of said set of prestored information items into a respective representation, in accordance with a predetermined transformation format;
  • transforming said information sequence obtained from said digital medium, in accordance with said transformation format;
  • determining the presence of one or more of said prestored information items within said transformed information sequence, utilizing said respective representation, wherein said information items are divided into sets, applying a security policy upon the detection of said information item in said information sequence. and wherein said security policy depends on the number of detected information items that belong to the same set.
  • In a preferred embodiment of the present invention the method further comprising storing the representations in a database.
  • In another preferred embodiment of the present invention the method further comprising sorting the representations into a sorted list.
  • In another preferred embodiment of the present invention the sorting is in accordance with a tree-sorting algorithm.
  • In another preferred embodiment of the present invention, the information item comprises a single word.
  • In another preferred embodiment of the present invention the information item comprises a sequence of words.
  • In another preferred embodiment of the present invention the information item comprises a delimited sequence of sub-items.
  • In a preferred embodiment of the present invention each of the sub-items comprises a sequence of alphanumeric characters.
  • In another preferred embodiment of the present invention, a type of the information item comprises one of a group of types comprising: a word, a phrase, a number, a credit-card number, a social security number, a name, an address, an email address, and an account number.
  • In another preferred embodiment of the present invention the information sequence is provided over a digital traffic channel.
  • In another preferred embodiment of the present invention the digital traffic channel comprises one of a group of channels comprising: email, instant messaging, peer-to-peer network, fax, and a local area network.
  • In another preferred embodiment of the present invention, the information sequence comprises the body of an email.
  • In another preferred embodiment of the present invention, the information sequence comprises an email attachment.
  • In another preferred embodiment of the present invention the method further comprising retrieving the information sequence from a digital storage medium.
  • In another preferred embodiment of the present invention the digital storage medium comprises a digital cache memory.
  • In another preferred embodiment of the present invention the representation depends only on the textual and numeric content of the information item.
  • In another preferred embodiment of the present invention the transforming comprises Unicode encoding.
  • In another preferred embodiment of the present invention the transforming comprises converting all characters to upper-case characters or to lower-case characters.
  • In another preferred embodiment of the present invention the transforming comprises encoding an information item into a numeric representation.
  • In another preferred embodiment of the present invention the method further comprising applying a first hashing function to the representations.
  • In another preferred embodiment of the present invention the information sequence comprises sub-sequences.
  • In another preferred embodiment of the present invention the sub-sequences are separated by delimiters.
  • In another preferred embodiment of the present invention the sub-sequences separated by delimiters are any of: words; names, and numbers.
  • In another preferred embodiment of the present invention the method further comprising scanning the information sequence to identify the sub-sequences.
  • In another preferred embodiment of the present invention the determining is performed by matching the information item to an ordered series of the sub-sequences.
  • In another preferred embodiment of the present invention the method further comprising applying a policy upon the detection of the information item in the information sequence.
  • In another preferred embodiment of the present invention the policy is a security policy, the security policy comprises at least one of the following group of security policies: blocking the transmission, logging a record of the detection and detection details, and reporting the detection and detection details.
  • In another preferred embodiment of the present invention the information items are divided into sets, and wherein the security policy depends on the number of detected information items that belong to the same set.
  • In another preferred embodiment of the present invention each of the sets comprises information items associated with a single individual.
  • In another preferred embodiment of the present invention the information item comprises a sequence of sub-items.
  • In another preferred embodiment of the present invention the sub-items are separated by delimiters.
  • In another preferred embodiment of the present invention a sub-item comprises one of a group comprising: a word, a number, and a character string.
  • In another preferred embodiment of the present invention the determining comprises using a state machine operable to detect the sequence of delimited sub-items within the information sequence.
  • In another preferred embodiment of the present invention the transforming comprises:
  • applying a first hashing function to assign a respective preliminary hash value to each sub-item within the information item; and
  • applying a second hashing function to assigning a global hash value to the information item based on the preliminary hash values of the sub-items.
  • In another preferred embodiment of the present invention the information sequence comprises sub-sequences, and wherein the determining comprises:
  • applying the first hashing function to assign a respective preliminary hash value to each of the sub-sequences;
  • applying the second hashing function to at least one of the preliminary hash values to assign a global hash value to the at least one of the sub-sequences; and
  • comparing the global hash value to the hash values of the series.
  • In another preferred embodiment of the present invention the sub-sequences comprise one of a group comprising: a word, a number, and a character string
  • In another preferred embodiment of the present invention the plurality of series comprises a plurality of ordered combinations of sub-sequences within the information or data sequence.
  • In another preferred embodiment of the present invention the plurality of series comprises a plurality of combinations of sub-sequences within the information or data sequence.
  • In another preferred embodiment of the present invention the second hash function is invariant to reordering of at least two of the sub-sequences.
  • In another preferred embodiment of the present invention the method further comprising checking whether the delimited segment was previously stored, and continuing the detection process only if the current delimited segment was previously stored. According to a second aspect of the present invention, a method for determining the absence of a specified information or data item from a list of information or data items, is presented. The method comprising:
  • (a) providing an initialized array of indicators;
  • (b) for each member of the list, performing:
  • (c) encoding the member with an encoding function to an integer no greater than the size of the array; and
      • i. setting a corresponding indicator;
      • ii. encoding the specified information or data item with the encoding function; and
      • iii. determining the status of an indicator corresponding to the encoded information or data item.
  • In another preferred embodiment of the present invention a size of the array is greater than the number of items in the list.
  • In another preferred embodiment of the present invention the encoding function comprises a hashing function.
  • In another preferred embodiment of the present invention the information item comprises a string of alphanumeric characters.
  • According to a third aspect of the present invention, a method for determining the absence of a specified information or data item from a list of information or data items is presented. The method comprising:
  • (a) providing a plurality initialized array of indicators, each of the arrays being associated with a respective encoding function for encoding a information or data item into an integer no greater than the size of the respective array;
  • (b) for each of the arrays, performing:
      • (i) encoding each member of the list with the respective encoding function; and
      • (ii) setting a corresponding indicator for each of the encoded members;
  • (c) encoding the specified information or data item with each of the encoding functions; and, for each of the encoded information or data items, determining the status of the corresponding indicator in the respective array.
  • In another preferred embodiment of the present invention the size of each of the arrays is greater than the number of items in the list.
  • In another preferred embodiment of the present invention at least one of the encoding functions comprises a hashing function.
  • In another preferred embodiment of the present invention the information or data item comprises a string of alphanumeric characters.
  • In a preferred embodiment of the present invention an apparatus for detecting an information item within an information sequence, the information item being any one of a specified set of information or data items, is presented. The apparatus comprising:
  • a preprocessor, for transforming the information item into a representation, in accordance with a transformation format; and
  • a scanner, for scanning the information sequence to identify sub-sequences; and
  • a comparator associated with the preprocessor and the scanner, for comparing the representation to the sub-sequences to determine the presence of the specified information item within the information sequence.
  • In a preferred embodiment of the present invention the apparatus for detecting a specified information item within an information sequence further comprising a user interface for inputting the information items.
  • In a preferred embodiment of the present invention the apparatus the scanner is further operable to transform the information sequence in accordance with the transformation format.
  • In a preferred embodiment of the present invention the scanner is further operable to transform the sub-sequences in accordance with the transformation format.
  • In a preferred embodiment of the present invention the apparatus further comprises an information storage or a database for storing a representation of each information or data item of the set.
  • In a preferred embodiment of the present invention the information sequence is obtained from a digital medium.
  • In a preferred embodiment of the present invention the apparatus further comprising a sorter, for forming a sorted list of the respective representations of set of information or data items.
  • In a preferred embodiment of the present invention the type of the information item comprises one of a group of types comprising: a word, a phrase, a number, a credit-card number, a social security number, a name, an address, an email address, and an account number.
  • In a preferred embodiment of the present invention the information sequence is provided over a digital traffic channel.
  • In a preferred embodiment of the present invention the apparatus further comprising retrieving the information sequence from a digital storage medium.
  • In a preferred embodiment of the present invention the digital storage medium comprises digital storage medium within a proxy server.
  • In a preferred embodiment of the present invention the apparatus further comprising a non-existence module comprising:
      • an encoder, for encoding the sub-sequences and the information or data item with an encoding function to respective integers, each of the integers being no greater than the size of the array; and
      • an array setter associated with the encoder, for setting indicators in an array of indicators in accordance with the encoded sub-sequences; and
      • a status checker associated with the encoder and the array setter, for determining the status of an indicator corresponding to the information or data item.
  • In a preferred embodiment of the present invention the encoding function comprises a hashing function.
  • The present invention successfully addresses the shortcomings of the presently known configurations by providing a method and system that facilitates fast and efficient detection and identification of a large number of previously stored information and data items, which can efficiently serve digital privacy and confidentiality enforcement as well as knowledge management.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The invention is herein described, by way of example only, with reference to the accompanying drawings. In the drawings:
  • FIG. 1 illustrates a system for fast detection of keywords, constructed and operative according to a preferred embodiment of the present invention.
  • FIG. 2 illustrates a system, substantially similar to the one described in FIG. 1, which also includes a fast-proof of non-existence module.
  • FIG. 3 illustrates a method for fast-proof of non-existence of items in a database, operative according to a preferred embodiment of the present invention.
  • FIG. 4 illustrates a system, substantially similar to the one described in FIG. 2, which also includes a cache filter, operable to filter out a short list of items, and
  • FIG. 5 contains some examples for a tree-based data-structure that facilitates detection of multi-words key-phrases, and
  • FIG. 6 is a flowchart illustrate algorithm for fast detection of key-phrases, according to preferred embodiment of the present invention.
  • DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
  • The present invention describes a method and a system for detection of a large number of previously stored information items, such as words, phrases, numbers, credit-card numbers, social security numbers, names, addresses, email addresses, account numbers and other pre-defined strings of characters, within information sequence (such as textual documents) in digital media and electronic traffic (e.g., emails), particularly but not exclusively.
  • According to a first aspect of the present invention, the method comprises pre-processing of the information items; storing them in a manner that facilitates fast comparison, and then performing sequential analysis of the inspected information sequence, preferably utilizing the delimiters within the information sequence (such as spaces between words) and comparing each of the delimited segment (e.g. each word or sequence of words within a textual document) with the pre-processed information items
  • With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of the preferred embodiments of the present invention only, and are presented in the cause of providing what is believed to be the most useful and readily understood description of the principles and conceptual aspects of the invention. In this regard, no attempt is made to show structural details of the invention in more detail than is necessary for a fundamental understanding of the invention, the description taken with the drawings making apparent to those skilled in the art how the several forms of the invention may be embodied in practice.
  • Reference is now made to FIG. 1, which illustrates a system for fast detection of previously stored information items, such as keywords, numbers and key-phrases, within a digital medium, constructed and operative according to a preferred embodiment of the present invention. The information items insertion module 110 allow user to insert keywords, key-phrases, numbers and character strings, preferably using a graphical user interface (GUI). The items are first pre-processed by the pre-processor 120. In a preferred embodiment of the present invention, the pre-processing comprises transforming the stored information items into a “canonized” form, in which they are represented in a lowercase (or uppercase) Unicode representation. In a preferred embodiment of the present invention, all non-alpha-numeric characters are omitted. In another preferred embodiment of the present invention the pre-processing comprises transforming the stored information items to their base form, whenever possible (e.g., by transforming verbs to “present simple” form, removing suffixes such as “'s” and “ly”, reducing to phonetic representation etc.). In another preferred embodiment of the present invention the pre-processing comprises encoding the information items into a numeric representation in a manner that facilitates fast detection, as explained below. In another preferred embodiment of the present invention, the numeric representation depends only on the textual and numeric content of the information item. The pre-processed items are thereafter preferably sorted and stored at the storage 130. A digital content to be analyzed 140 is thereafter preferably pre-processed, as explained below, by the pre-processor 145, and then scanned by the content scanner 150, preferably utilizing the existing delimiters (e.g., spaces between words)in order to facilitate faster scanning. After each delimiter, the comparator 160 efficiently compares the sequence which started at that delimiter (usually a word, a number or a sequence of words and numbers) with the sorted items in the storage 130, preferably using one of the methods and algorithms described below. In a preferred embodiment of the present invention, the storage 130 is a database that facilitates efficient queries.
  • In a preferred embodiment of the present invention, the information within the digital medium is first pre-processed and transformed into a representation that facilitates fast comparison with the stored information items. In a preferred embodiment of the present invention, the pre-processing comprises transforms that are applied on the to-be-detected information items, such as:
      • transforming the stored information items into a “canonized” form, in which they are represented in a lowercase (or uppercase) Unicode representation;
      • omitting all non-alpha-numeric characters;
      • transforming the stored information items to their base form, whenever possible (e.g., by transforming verbs to “present simple” form, removing suffixes such as “s” and “ly”, reducing to phonetic representation etc.);
      • encoding the information items into a numeric representation in a manner that facilitates fast detection, as explained below.
  • In a preferred embodiment of the present invention, the digital medium comprises a digital traffic channel, and information items that were found within the digital traffic are then used in order to apply a policy with respect to the traffic within the channel. In a preferred embodiment of the present invention, the digital traffic channel comprises email (both email body messages and attachments), and a policy, such as security policy, is applied with respect to the emails, as described, e.g., in PCT patent application number IL02/00037, in U.S. Patent Application No. 20020129140, filed Dec. 6, 2001, and in US provisional patent application 60/475,492, filled Jun. 4, 2003, the contents of which are hereby incorporated herein by reference in their entirety. In a preferred embodiment of the present invention the security policy comprises actions such as blocking the transmission, logging a record of the detection and detection details, and reporting the detection and its details.
  • In a preferred embodiment of the present invention, the information items are divided into sets, the security policy depends on the number of detected information items that belong to the same set. These sets may comprise, e.g., information items associated with a single individual, such as her or his name, her or his social security number, her or his address, her or his bank-account number, etc. and the policy may preclude dissemination of any two or more of these items via email.
  • In a preferred embodiment of the present invention, the digital medium comprises digital storage, and the system is operable to detect information items within the information stored in the storage, e.g., in order to detect keywords and keyphrases within a file system or within a proxy server. Such detection can be of importance both for applying a security policy and for information and knowledge management within the organization.
  • Reference is now made to FIG. 2, which illustrates a system, substantially similar to the one described in FIG. 1, where a fast proof of non-existence module 155 is introduced between the scanner 150 and the comparator 160. The proof of non-existence module is operable to prove, with a probability P, that a certain item does not exists in the list in the storage 130, thereby significantly reducing the number of queries to the storage 130.
  • Reference is now made to FIG. 3, which illustrates a method for proof of non-existence of items in a database, operative according to a preferred embodiment of the present invention. The input item 310 is preferably transformed into a numeric representation, by numeric encoding 320. The numeric representation is then subjected to one or more hash-functions hi 330 that transform the numeric representation X in stage 320 to an L-bits long number hi (x), where the distribution of the numbers is preferably close to uniform over the range 1-2 L. Array set 340 contains a corresponding array αi,of length 2 L, for each of the hash functions hi. The elements of the arrays are bits, which are all initiated to a have a zero value. The element of the array αi at the address hi (x) is then set to 1, indicating the existence of the element x.
  • Since the mapping of elements to addresses in the array is quasi-random, there is always the possibility of collisions between two different items, i.e., that hi(xi)=hi(x2) while x1≠x2. The probability that at least one event of that kind will happen become close to one when the number of items become substantially greater then the square root of the number of addresses (i.e., 2 L/2), a phenomenon known as the birthday problem. It is therefore not possible to positively indicate the existence of a certain item. However, if there is a 0 in at least one of the corresponding arrays αi, then one can tell for sure that the item does not exist. The method is therefore able to determine the absence of the item from the sequence, but cannot determine the presence of the item in the sequence with 100% certainty. In a preferred embodiment of the present invention, the search is stopped after the first 0 is encountered. Each of the arrays can therefore be considered a filter.
  • The array's optimal length (and the number of bits in the output of the hash function) is computed based on occupancy, the optimal being 50%, (see discussion below), which requires an array size of approximately 11.42 times the number of items for a single array, and an array size of approximately 1.42 times the number of items in the list times the number of hash functions, for a set of arrays with different hash functions. Each bit for which a respective hashed item exists is given a 1 value (in the first case this is done in the respective array).
  • Searching for an item is based on the fact that an item can only exist if all the corresponding bits are 1, so a process of computing hash functions and checking respective bits takes place, if any bit is 0, the item is not in the list of data items. When the item's hash value contain more bits then L times the number of arrays, and the different bits are statistically independent, one can simply use “bit masks” as the hash functions (i.e., selecting disjoint groups of bits from the item's hash value), however, if they do not contain enough bits, a more substantially independent scheme, such as a hash function of the basic hash function, is needed (although it might be slightly less efficient).
  • Following is an analysis illustrating why 50% occupancy is optimal, along with some implementation considerations:
  • Defining the Following Parameters:
      • N: number of items in the database (DB)
      • L: length of filter arrays (1 bit is assigned to each location in the array)
      • D=log2 L : the number of bits required to define a location in the array.
  • P = N L ,
  • “density of arrays”.
      • X=number of arrays.
      • K=XL is the total number of assigned bits.
  • C = K N
  • the number of total bits per item (the item “cost”)
  • Now, assuming a uniform distribution, the probability of collision with a specific item is
  • 1 L ,
  • and since N=PL, the probability of no collisions in an array filter is:
  • ( 1 - 1 L ) N = ( 1 - 1 L ) LP ( 1 e ) P = - P
  • which is the probability of a negative result for a negative input from one array filter.
  • The failure probability of a single array filter is therefore 1−e−P
  • And since:
  • L = K X , P = N L = N X K = X C
  • The total failure probability for X filters is:
  • ( 1 - - P ) X = ( 1 - - N X K ) X = ( 1 - - N X K ) N X K · K N = [ ( 1 - - N X K ) N X K ] K N
  • Assuming N, K are constant, the minimum of
  • [ ( 1 - - N X K ) N X K ] K N
  • is the minimum of
  • ( 1 - - N X K ) N X K at N X K = X C 0.7 .
  • And the failure rate is
  • ( 1 - - X C ) X = [ ( 1 - - X C ) X C ] C ,
  • which at 0.7, is about 0.6c≅0.5x
  • X C 0.7 C X 1.42 C X = K N X = L X N X = L N = 1 P
  • Which means that the optimal length of each filter array is about 42% longer (in bits) than the number of items in the list.
  • The probability of a certain bit to be zero is
  • ( 1 - 1 L ) N - P ,
  • so P=0.7
    Figure US20110264637A1-20111027-P00001
    ˜50% occupancy for each filter, which again results in failure rate of ˜0.5x .
  • There are two possible cases in which no 0 is encountered during the search process, and a direct query regarding the existence of the item in the storage should be made: the first is the case in which the item do exists in the list, and the second is a “false alarm” due to collisions. In order to minimize the probability of false alarms X should be increased, with the cost of a larger memory footprint. The optimal X is a tradeoff between the memory cost and the cost of accessing the storage.
  • Because the number of arrays, X, and the number of bits required to define a location in the array, D, are both integers, we should round by assigning the nearest values in the formulas (not by rounding to the nearest, because they are not linear) and choosing the best result.
  • Note that because of performance issues (cache thrashing) a small first filter might be a good thing regardless, but obviously not small enough to be saturated.
  • In a preferred embodiment of the present invention, the system also utilizes a cash memory that include a short list of common words that are not keywords or essential part of a key-phrase. Reference is now made to FIG. 4, which illustrates a system, substantially similar to the one described in FIG. 2, which also includes a cache filter 157, operable to filter out the short list described above.
  • In a preferred embodiment of the present invention, the list of information items is sub-divided to several lists, according to the frequencies of accuracy of the items in the list, such that items that are anticipated to appear frequently in the scanned content would appear in a separate list then less frequent items, and a separate non-existence filter is implemented to each of the lists, thereby facilitating optimized resource allocation.
  • In many cases, the items that need to be detected are sequences of delimited segments, e.g., a sequence of words delimited by spaces (a “key-phrase”). The detection problem in this case is, in general, more involved then single word detection, since a search must be performed for a plurality of sequences of words with a variable length, and can no longer be conducted for each word separately. In the following discussion, for sake of brevity and clarity, we will use the term “word” with respect to any delimited segment of the stored sequence of delimited segments.
  • According to a preferred embodiment of the present invention, the first word in each key-phrase is a root of a tree, and the last words are the leaves of the tree (see examples in FIG. 5). Whenever a root word is found, the corresponding tree is traversed in order to detect key-phrases.
  • In a preferred embodiment of the present invention, identification of key-phrases is based on the following scheme, dubbed Word-Based Hash-List (WBHL). Basically, the algorithm comprises two phases:
      • Pre-Processing: Each word (or other delimited segment of interest) is represented by its hash value. Each key-phrase is represented by the list or the set of the hashes of its single words. (See more detailed description below)
      • Scanning and filtering: The algorithm scans the words, evaluates their hash values and utilizes a hash-table for an immediate rule-out of words that are not contained in the key-phrases. If the scanned word belongs to one or more of the key-phrase, the algorithm efficiently check all possible candidates according to the hash values of the successive words. In case of a match, the original key-phrase is retrieved and compared with the scanned item. (See more detailed description below)
  • This method allow for commutativity, if required (i.e. “John Doe”=“Doe, John”), and for rapid clearance in cases where words from the key-phrases are not very common in the analyzed text (a probable scenario). It utilizes the fact that the basic units are words, and not characters, in order to achieve a better performance, compared with classical algorithms such as Boyer-Moore or Rabin-Karp String, as described, e.g., in sections 6.5-6.6 of R. A. Vowels: Algorithms and Data Structures in F and Fortran, Unicomp (1999), ISBN 0-9640135-4-1, the contents of which are hereby incorporated herein by reference in their entirety. Furthermore, the performance does not depend on the number key phrases (as long as their constituent words are not common in the analyzed text).
  • Disadvantages of the above scheme may be:
      • The non-commutative version may be slow if the first word in one or more key-phrase is common (e.g. ‘the” or “that”)
      • The commutative version may be slow if any word in one or more key-phrase is common
  • The speed issues problem may be avoided by removing common words in the canonization process. The removal may require exact textual matching for avoidance of false positives.
  • A more detailed description of the algorithm follows:
  • Key-Phrases Pre-Processing Phase:
      • Compute hash value for each word in key phrases.
      • Build oneWordsPhrases—a hash table for the hash values of each one-word phrase.
      • Build mutiWordsPhrases—a hash table for the hash values of each starting word in multi word phrases.
      • Build mutiWordsWords—a hash table for the hash values of each word in multi word phrases.
      • For each word in mutiWordsPhrases, add a hash set for each key-phrase containing that word. The hash set contains hashes of all other words in the phrase.
      • Associate the set with the text of the key-phrase in oneWordsPhrases and mutiWordsPhrases.
    Scanning & Analysis Phase:
  • Initialization:
      • “Canonize” Text
      • candidates: an empty set
      • i=0
  • Analysis:
  •   While i < number of words in the text
        Read Word W(i)
        Evaluate the hash of W(i) Evaluate Hash: H(W(i))
        (e.g., using CRC32)
        Locate H(W(i) in oneWordsPhrases. (if exists, do textual
    matching - compare with the actual verbatim)
        if exists:
          For each hash_set in candidates:
            If H(W(i)) not in hash_set,
            delete hash_set
          Else if size(hash_set) = 1:
            delete hash_set
            do textual match
          Else
            delete H(W(i)) from hash_set
      Append to candidates all hash_sets associated with H(W(i)) in
      multiWordsPhrases (They should not contain H(W(I)) )
    i = i+1
    end
  • The non-commutative version of the algorithm is substantially similar:
  • Key-Phrases Pre-Processing:
      • Compute hash value for each word in key phrases.
      • Build oneWordsPhrases—a hash table for the hash values of each one-word phrase.
      • Build mutiWordsPhrases—a hash table for the hash values of each starting word in multi word phrases.
      • Build mutiWordsWords—a hash table for the hash values of each word in multi word phrases.
      • For each word in mutiWordsPhrases, add a hash set for each key-phrase starting with that word. The hash set contains ordered hashes of all other words in the phrase.
      • Associate the set with the text of the key-phrase in oneWordsPhrases and mutiWordsPhrases.
        Scanning & Analysis phase:
  • Initialization:
      • “Canonize” Text
      • candidates: an empty set
      • i=0
  • Analysis:
  •   While i < number of words in the text
        Read Word W(i)
        Evaluate the hash of W(i) Evaluate Hash: H(W(i))
        (e.g., using CRC32)
        Locate H(W(i) in oneWordsPhrases. (if exists, do textual
    matching - compare with the actual verbatim)
        if exists:
          For each hash_set in candidates:
            If H(W(i)) not first of hash_set,
              delete hash_set
            Else if size(hash_set) = 1:
              delete hash_set
              do textual match
            Else
              delete H(W(i)) from hash_set
        Append to candidates all hash_sets associated with H(W(i)) in
        multiWordsPhrases (They should not contain H(W(I)) )
      i = i+1
    end
  • In a preferred embodiment of the present invention, the algorithm used for key-phrase identification comprises:
  • Pre-Processing phase: Each word is represented by its hash value. Each key-phrase is represented by a commutative (or non-commutative) hash of the hashes of keywords that comprise that key-phrase. The commutative hash is simply the XOR of all the hashes of the words that constitute the phrase.
  • Scanning and filtering phase: The algorithm scans the words, evaluates the hash values of each word and utilizes a hash-table for an immediate rule-out of words that are not contained in the key-phrases. If the scanned word belongs to one or key-phrase, the algorithm evaluates and checks the commutative hashes of bi-grams, (two consecutive words), three-grams etc.—until the maximum possible number of words in the key-phrases. In case of a match, the original key-phrase is retrieved and compared with against the scanned text.
  • This scheme also allows for commutativity and fast clearance, and has a better worst-case behavior then the word-based hash-list. It is also easy to implement and to verify, though it may be slightly slower than the word-based hash-list in some cases. Reference is now made to FIG. 6, which is a flowchart illustrates the algorithm for fast detection of key-phrases, according to preferred embodiment of the present invention.
  • The key-phrases pre-processing phase, 610, comprises:
  • Input: Key-Phrases and the Maximal Length of Phrase (“maxPhraseLength”)
  • Pre-Processing:
      • Compute hash value for each word in key phrases.
      • Build oneWordsPhrases—a hash table for the hash values of each one-word phrase.
      • Build mutiWordsWord—a hash table for the hash values of each word in multi word phrases.
      • Evaluate commutativeHash by XORing all the hash values of the words in mutiWordsWord
      • Build mutiWordsPhrases—a hash table for the hash values of each multi word phrase.
      • Associate the hash values with the text of the key-phrase in oneWordsPhrases and mutiWordsPhrases.
  • Initialization:
      • chainLength=0
      • hashBuffer=[ ]//Empty set
      • i=0
  • Analysis:
  •   While i < number of words in the text
        Read Word W(i)
        Evaluate the hash of W(i): H(W(i)) (e.g., using
    CRC32)
        Locate H(W(i) in oneWordsPhrases. (if exists, do textual match)
      if exists:
         hashBuffer+= [hashWord] // insert H(W(i) to buffer
         chainLength+=1
         while chainLength <= maxPhraseLength:
          evaluate the commutative/non-commutative hash for
          hashBuffer
          check if exists in hash-table mutiWordsPhrases
          if exists,
           do textual match
          else
           check possible matching with other initials of
           mutiWordsPhrases in the buffer
           if there is a match do textual match
    i = i+1
    end
  • Input: Key-Phrases and the Maximal Length of Phrase (“Maxphraselength”)
  • Pre-Processing:
      • Compute hash value for each word in key phrases.
      • Build oneWordsPhrases—a hash table for the hash values of each one word phrase.
      • Build mutiWordsWord—a hash table for the hash values of each word in multi word phrases.
      • Evaluate commutativeHash by XORing all the hash values of the words in mutiWordsWord
      • Evaluate nonCommutativeHash, if required, by first adding the numerical value of the index wordLocationInPhrase (which can be just the order of the word in the phrase—“1” for the first word in the phrase, “2” for the second, etc.) to the hash values of the words in mutiWordsWord, and then XORing all the resulted values.
      • Build mutiWordsPhrases—a hash table for the hash values of each multi word phrase.
      • Associate the hash values with the text of the key-phrase in oneWordsPhrases and mutiWordsPhrases.
  • The scanning and analysis phase, 620, comprises:
  • Initialization:
      • chainLength=0
      • hashBuffer=[ ]//Empty set
      • i=0
  • Analysis:
  •   While i < number of words in the text
        Read Word W(i)
        Evaluate the hash of W(i) Evaluate Hash: H(W(i)) (e.g., using
    CRC32)
        Locate H(W(i) in oneWordsPhrases. (if exists, do textual match)
        if exists:
           hashBuffer+= [hashWord] // insert H(W(i) to buffer
          chainLength+=1
          while chainLength <= maxPhraseLength:
           evaluate the commutative/non-commutative hash for
           hashBuffer
           check if exists in hash-table mutiWordsPhrases
           if exists,
             do textual match
           else
             check possible matching with other initials of
             mutiWordsPhrases in the buffer
             if there is a match do textual match
      i = i+1
      end
  • In a preferred embodiment of the present invention, a state-machine, (described, e.g., in David J. Comer, “Digital Logic and State Machine Design”, International Thomson Publishing; 3rd edition (June 1997), ISBN: 0030949041, the contents of which is hereby incorporated herein by reference in its entirety) is complied such that each keyword or key-phrase become a regular expression that leave the state-machine in an “accepting state”, thereby provide an efficient method to detect both keywords and key-phrases that contain more then one word.
  • In a preferred embodiment of the present invention, both the items in the inspected documents and the items in the list are sorted, and the comparison is performed between two sorted lists.
  • In a preferred embodiment of the present invention, the system includes a module that facilitates the automatic insertion of keywords and key-phrase into a keywords list, by comparing close documents with a different policy, and regarding the differences between the documents as a collection of “key-phrases”. For example, if in one standard contract the name of on of the sides is “John Doe” and in another contract the name is “Jane Smith” then both “John Doe” and “Jane Smith” can be regarded as key-phrases. A method for comparing documents and obtaining their differences is described, e.g., in provisional patent application number 60/422,128.
  • In a preferred embodiment of the present invention, the list of automatically detected keywords is further subjected to manual approval.
  • The present invention successfully addresses the shortcomings of the presently known configurations by providing a method and system for fast identification of keywords and key-phrases, which can efficiently serve current needs.
  • It is appreciated that one or more steps of any of the methods described herein may be implemented in a different order than that shown, while not departing from the spirit and scope of the invention.
  • While the present invention may or may not have been described with reference to specific hardware or software, the present invention has been described in a manner sufficient to enable persons having ordinary skill in the art to readily adapt commercially available hardware and software as may be needed to reduce any of the embodiments of the present invention to practice without undue experimentation and using conventional techniques.
  • Although the invention has been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims. All publications, patents and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention.

Claims (28)

1. A method for detecting an information item within an information sequence obtained from a digital medium, said information item comprising any one of a specified set of prestored information items, comprising:
transforming each of said set of prestored information items into a respective representation, in accordance with a predetermined transformation format;
transforming said information sequence obtained from said digital medium, in accordance with said transformation format;
determining the presence of one or more of said prestored information items within said transformed information sequence, utilizing said respective representation, wherein said information items are divided into sets, applying a security policy upon the detection of said information item in said information sequence, and wherein said security policy depends on the number of detected information items that belong to the same set.
2. A method according to claim 1 wherein each of said sets comprises information items associated with a single individual.
3. A method according to claim 1, wherein a type of said information item comprises one of a group of types comprising: a word, a phrase, a number, a credit-card number, a social security number, a name, an address, an email address, and an account number.
4. A method according to claim 1, wherein said information sequence is provided over a digital traffic channel.
5. A method according to claim 4, wherein said digital traffic channel comprises one of a group of channels comprising: email, instant messaging, peer-to-peer network, fax, and a local area network.
6. A method according to claim 1, wherein said information sequence comprises the body of an email, or wherein said information sequence comprises an email attachment.
7. A method according to claim 1, further comprising retrieving said information sequence from a digital storage medium.
8. A method for detecting an information item within an information sequence obtained from a digital medium, said information item comprising any one of a specified set of prestored information items, comprising:
transforming each of said set of prestored information items into a respective representation, in accordance with a predetermined transformation format;
transforming said information sequence obtained from said digital medium, in accordance with said transformation format;
determining the presence of one or more of said prestored information items within said transformed information sequence, utilizing said respective representation, wherein said information item comprises a sequence of sub-items, wherein said sub-items are separated by delimiters, the delimiters being resilient to reordering of the sub-items, wherein said transforming comprises:
applying a first hashing function to assign a respective preliminary hash value to each sub-item within said information item; and
applying a second hashing function to assigning a global hash value to said preliminary hash values of said sub-items,
wherein said information sequence comprises at least two sub-sequences, and wherein said determining comprises:
applying said first hashing function to assign a respective preliminary hash value to each of said sub-sequences;
applying said second hashing function to at least one of said preliminary hash values to assign a global hash value to at least one of said sub-sequences; and
comparing said global hash value to hash values of said sub-sequences, wherein said second hash function is invariant to reordering of at least two of said sub-sequences within a respective sub-item.
9. A method according to claim 8, wherein a sub-item comprises one of a group comprising: a word, a number, and a character string.
10. A method according to claim 8, wherein said determining comprises using a state machine operable to detect said sequence of delimited sub-items within said information sequence.
11. A method according to claim 8, wherein said determining comprises:
applying said first hashing function to assign a respective preliminary hash value to each of said sub-sequences;
applying said second hashing function to at least one of said preliminary hash values to assign a global hash value to at least one of said sub-sequences; and
comparing said global hash value to hash values of said sub-sequences.
12. A method according to claim 11, wherein said sub-sequences comprise one of a group comprising: a word, a number, and a character string.
13. A method according to claim 11, wherein said plurality of sub-sequences comprises a plurality of ordered combinations of sub-sequences within said data sequence.
14. A method according to claim 12, wherein said plurality of series comprises a plurality of combinations of sub-sequences within said data sequence.
15. A method according to claim 8, further comprising checking whether said delimited segment was previously stored, and continuing said detection process only if the current delimited segment was previously stored.
16. A method according to claim 8, wherein a type of said information item comprises one of a group of types comprising: a word, a phrase, a number, a credit-card number, a social security number, a name, an address, an email address, and an account number.
17. A method according to claim 8, wherein said information sequence is provided over a digital traffic channel.
18. A method according to claim 17, wherein said digital traffic channel comprises one of a group of channels comprising: email, instant messaging, peer-to-peer network, fax, and a local area network.
19. A method according to claim 8, wherein said information sequence comprises the body of an email, or wherein said information sequence comprises an email attachment.
20. A method according to claim 8, further comprising retrieving said information sequence from a digital storage medium.
21. A method for detecting an information item within an information sequence obtained from a digital medium, said information item comprising any one of a specified set of prestored information items, comprising:
transforming each of said set of prestored information items into a respective representation, in accordance with a predetermined transformation format;
transforming said information sequence obtained from said digital medium, in accordance with said transformation format;
determining the presence of one or more of said prestored information items within said transformed information sequence, utilizing said respective representation,
further comprising applying a policy upon the detection of said information item in said information sequence, wherein said policy is a security policy, said security policy comprises at least one of the following group of security policies: blocking transmission, logging a record of said detection and detection details, and reporting said detection and detection details.
22. An apparatus for detecting an information item within an information sequence, said information item being any one of a specified set of data items, comprising:
a preprocessor, for transforming said information item into a representation, in accordance with a transformation format;
a scanner, for scanning said information sequence to identify sub-sequences; and
a comparator associated with said preprocessor and said scanner, for comparing said representation to said sub-sequences to determine the presence of said specified information item within said information sequence; and
a non-existence module comprising:
an encoder, for encoding said sub-sequences and said data item with an encoding function to respective integers, each of said integers being no greater than the size of said array; and
an array setter associated with said encoder, for setting indicators in an array of indicators in accordance with said encoded sub-sequences; and
a status checker associated with said encoder and said array setter, for determining the status of an indicator corresponding to said data item.
23. The apparatus of claim 22, wherein said array of indicators is associated with a respective encoding function for encoding a data item into an integer no greater than the size of said respective array.
24. The apparatus of claim 23, wherein each member of said list is encodable with said respective encoding function; and wherein
a corresponding indicator is settable for each of said encoded members;
said specified data item is encodable with each of said encoding functions; and
the apparatus being configured to determine for each of said encoded data items, the status of the corresponding indicator in said array.
25. A method for determining the absence of a specified data item from a list of data items, comprising:
providing a plurality of initialized arrays of indicators, each of said arrays being associated with a respective encoding function for encoding a data item into an integer no greater than the size of said respective array;
for each of said arrays, performing:
encoding each member of said list with said respective encoding function; and
setting a corresponding indicator for each of said encoded members;
encoding said specified data item with each of said encoding functions; and
for each of said encoded data items, determining the status of the corresponding indicator in said respective array.
26. A method according to claim 25, wherein the size of each of said arrays is greater than the number of items in said list.
27. A method according to claim 25, wherein at least one of said encoding functions comprises a hashing function.
28. A method according to claim 25, wherein a data item comprises a string of alphanumeric characters.
US13/172,998 2003-04-02 2011-06-30 Method and a system for information identification Abandoned US20110264637A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US45937203P true 2003-04-02 2003-04-02
US10/815,764 US7991751B2 (en) 2003-04-02 2004-04-02 Method and a system for information identification
US13/172,998 US20110264637A1 (en) 2003-04-02 2011-06-30 Method and a system for information identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US13/172,998 US20110264637A1 (en) 2003-04-02 2011-06-30 Method and a system for information identification

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US10/815,764 Division US7991751B2 (en) 2003-04-02 2004-04-02 Method and a system for information identification

Publications (1)

Publication Number Publication Date
US20110264637A1 true US20110264637A1 (en) 2011-10-27

Family

ID=33101332

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/815,764 Active 2025-12-25 US7991751B2 (en) 2003-04-02 2004-04-02 Method and a system for information identification
US13/172,998 Abandoned US20110264637A1 (en) 2003-04-02 2011-06-30 Method and a system for information identification

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US10/815,764 Active 2025-12-25 US7991751B2 (en) 2003-04-02 2004-04-02 Method and a system for information identification

Country Status (1)

Country Link
US (2) US7991751B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9288184B1 (en) 2013-05-16 2016-03-15 Wizards Of The Coast Llc Distributed customer data management network handling personally identifiable information

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8776206B1 (en) * 2004-10-18 2014-07-08 Gtb Technologies, Inc. Method, a system, and an apparatus for content security in computer networks
US20070226251A1 (en) * 2006-03-24 2007-09-27 Rocket Software, Inc. Method of augmenting and controlling utility program execution for a relational database management system
US20080015977A1 (en) * 2006-06-14 2008-01-17 Curry Edith L Methods of deterring fraud and other improper behaviors within an organization
US20090205051A1 (en) * 2008-02-05 2009-08-13 Tony Spinelli Systems and methods for securing data in electronic communications
US8286171B2 (en) * 2008-07-21 2012-10-09 Workshare Technology, Inc. Methods and systems to fingerprint textual information using word runs
US20130103653A1 (en) * 2011-10-20 2013-04-25 Trans Union, Llc System and method for optimizing the loading of data submissions
US9378535B2 (en) * 2013-08-29 2016-06-28 Advanced Micro Devices, Inc. Efficient duplicate elimination

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020083343A1 (en) * 2000-06-12 2002-06-27 Mark Crosbie Computer architecture for an intrusion detection system
US20020129264A1 (en) * 2001-01-10 2002-09-12 Rowland Craig H. Computer security and management system
US20030004688A1 (en) * 2001-06-13 2003-01-02 Gupta Ramesh M. Virtual intrusion detection system and method of using same
US20030154399A1 (en) * 2002-02-08 2003-08-14 Nir Zuk Multi-method gateway-based network security systems and methods
US20030188189A1 (en) * 2002-03-27 2003-10-02 Desai Anish P. Multi-level and multi-platform intrusion detection and response system
US6711615B2 (en) * 1998-11-09 2004-03-23 Sri International Network surveillance
US6715084B2 (en) * 2002-03-26 2004-03-30 Bellsouth Intellectual Property Corporation Firewall system and method via feedback from broad-scope monitoring for intrusion detection
US6775657B1 (en) * 1999-12-22 2004-08-10 Cisco Technology, Inc. Multilayered intrusion detection system and method
US6839850B1 (en) * 1999-03-04 2005-01-04 Prc, Inc. Method and system for detecting intrusion into and misuse of a data processing system
US6871284B2 (en) * 2000-01-07 2005-03-22 Securify, Inc. Credential/condition assertion verification optimization
US20050086252A1 (en) * 2002-09-18 2005-04-21 Chris Jones Method and apparatus for creating an information security policy based on a pre-configured template
US20050097317A1 (en) * 2000-01-12 2005-05-05 Jonathan Trostle Directory enabled secure multicast group communications
US20050149759A1 (en) * 2000-06-15 2005-07-07 Movemoney, Inc. User/product authentication and piracy management system
US20100332481A1 (en) * 2002-09-18 2010-12-30 Rowney Kevin T Secure and scalable detection of preselected data embedded in electronically transmitted messages
US7979368B2 (en) * 2005-07-01 2011-07-12 Crossbeam Systems, Inc. Systems and methods for processing data flows
US8205259B2 (en) * 2002-03-29 2012-06-19 Global Dataguard Inc. Adaptive behavioral intrusion detection systems and methods
US20160308898A1 (en) * 2015-04-20 2016-10-20 Phirelight Security Solutions Inc. Systems and methods for tracking, analyzing and mitigating security threats in networks via a network traffic analysis platform

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7143290B1 (en) * 1995-02-13 2006-11-28 Intertrust Technologies Corporation Trusted and secure techniques, systems and methods for item delivery and execution
US5778395A (en) * 1995-10-23 1998-07-07 Stac, Inc. System for backing up files from disk volumes on multiple nodes of a computer network
US6233589B1 (en) * 1998-07-31 2001-05-15 Novell, Inc. Method and system for reflecting differences between two files
US8176563B2 (en) * 2000-11-13 2012-05-08 DigitalDoors, Inc. Data security system and method with editor
US7107464B2 (en) * 2001-07-10 2006-09-12 Telecom Italia S.P.A. Virtual private network mechanism incorporating security association processor
US7134041B2 (en) * 2001-09-20 2006-11-07 Evault, Inc. Systems and methods for data backup over a network
US7873899B2 (en) * 2002-10-04 2011-01-18 Oracle International Corporation Mapping schemes for creating and storing electronic documents

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6711615B2 (en) * 1998-11-09 2004-03-23 Sri International Network surveillance
US6839850B1 (en) * 1999-03-04 2005-01-04 Prc, Inc. Method and system for detecting intrusion into and misuse of a data processing system
US6775657B1 (en) * 1999-12-22 2004-08-10 Cisco Technology, Inc. Multilayered intrusion detection system and method
US6871284B2 (en) * 2000-01-07 2005-03-22 Securify, Inc. Credential/condition assertion verification optimization
US20050097317A1 (en) * 2000-01-12 2005-05-05 Jonathan Trostle Directory enabled secure multicast group communications
US20020083343A1 (en) * 2000-06-12 2002-06-27 Mark Crosbie Computer architecture for an intrusion detection system
US20050149759A1 (en) * 2000-06-15 2005-07-07 Movemoney, Inc. User/product authentication and piracy management system
US20020129264A1 (en) * 2001-01-10 2002-09-12 Rowland Craig H. Computer security and management system
US20030004688A1 (en) * 2001-06-13 2003-01-02 Gupta Ramesh M. Virtual intrusion detection system and method of using same
US20030154399A1 (en) * 2002-02-08 2003-08-14 Nir Zuk Multi-method gateway-based network security systems and methods
US6715084B2 (en) * 2002-03-26 2004-03-30 Bellsouth Intellectual Property Corporation Firewall system and method via feedback from broad-scope monitoring for intrusion detection
US20030188189A1 (en) * 2002-03-27 2003-10-02 Desai Anish P. Multi-level and multi-platform intrusion detection and response system
US8205259B2 (en) * 2002-03-29 2012-06-19 Global Dataguard Inc. Adaptive behavioral intrusion detection systems and methods
US20050086252A1 (en) * 2002-09-18 2005-04-21 Chris Jones Method and apparatus for creating an information security policy based on a pre-configured template
US20100332481A1 (en) * 2002-09-18 2010-12-30 Rowney Kevin T Secure and scalable detection of preselected data embedded in electronically transmitted messages
US7979368B2 (en) * 2005-07-01 2011-07-12 Crossbeam Systems, Inc. Systems and methods for processing data flows
US20160308898A1 (en) * 2015-04-20 2016-10-20 Phirelight Security Solutions Inc. Systems and methods for tracking, analyzing and mitigating security threats in networks via a network traffic analysis platform

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9288184B1 (en) 2013-05-16 2016-03-15 Wizards Of The Coast Llc Distributed customer data management network handling personally identifiable information
US9959397B1 (en) 2013-05-16 2018-05-01 Wizards Of The Coast Llc Distributed customer data management network handling personally identifiable information

Also Published As

Publication number Publication date
US20040199490A1 (en) 2004-10-07
US7991751B2 (en) 2011-08-02

Similar Documents

Publication Publication Date Title
Sacks-Davis et al. Multikey access methods based on superimposed coding techniques
KR0160215B1 (en) High volume document image archive system and method
JP5240475B2 (en) Approximate pattern matching method and apparatus
US8463800B2 (en) Attributes of captured objects in a capture system
US6549957B1 (en) Apparatus for preventing automatic generation of a chain reaction of messages if a prior extracted message is similar to current processed message
US8010689B2 (en) Locational tagging in a capture system
US10250640B2 (en) Information infrastructure management data processing tools with tags, configurable filters and output functions
US7593938B2 (en) Systems and methods of directory entry encodings
US8683035B2 (en) Attributes of captured objects in a capture system
US6741743B2 (en) Imaged document optical correlation and conversion system
Yang et al. Near-duplicate detection by instance-level constrained clustering
Roussev Data fingerprinting with similarity digests
US7574409B2 (en) Method, apparatus, and system for clustering and classification
US20040199594A1 (en) Apparatus, methods and articles of manufacture for intercepting, examining and controlling code, data and files and their transfer
US8806615B2 (en) System and method for protecting specified data combinations
US8700561B2 (en) System and method for providing data protection workflows in a network environment
US8650199B1 (en) Document similarity detection
US20130304761A1 (en) Digital Information Infrastruture and Method for Security Designated Data and with Granular Data Stores
US7747642B2 (en) Matching engine for querying relevant documents
US20020147734A1 (en) Archiving method and system
Seo et al. Local text reuse detection
US20080059420A1 (en) System and Method for Providing a Trustworthy Inverted Index to Enable Searching of Records
JP4824352B2 (en) The methods and systems outward communication is detected when containing specific contents
US7426752B2 (en) System and method for order-preserving encryption for numeric data
US7503070B1 (en) Methods and systems for enabling analysis of communication content while preserving confidentiality

Legal Events

Date Code Title Description
AS Assignment

Owner name: PORTAUTHORITY TECHNOLOGIES INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:PELED, ARIEL;CARNY, OFIR;TROYANSKY, LIDROR;SIGNING DATES FROM 20061213 TO 20061214;REEL/FRAME:026610/0808

AS Assignment

Owner name: JPMORGAN CHASE BANK, N.A., DELAWARE

Free format text: FIRST LIEN SECURITY AGREEMENT;ASSIGNORS:WEBSENSE, INC.;PORTAUTHORITY TECHNOLOGIES, INC.;REEL/FRAME:030694/0615

Effective date: 20130625

AS Assignment

Owner name: ROYAL BANK OF CANADA, CANADA

Free format text: SECOND LIEN SECURITY AGREEMENT;ASSIGNORS:WEBSENSE, INC.;PORTAUTHORITY TECHNOLOGIES, INC.;REEL/FRAME:030704/0374

Effective date: 20130625

AS Assignment

Owner name: ROYAL BANK OF CANADA, AS SUCCESSOR COLLATERAL AGEN

Free format text: ASSIGNMENT OF SECURITY INTEREST;ASSIGNOR:JPMORGAN CHASE BANK, N.A., AS EXISTING COLLATERAL AGENT;REEL/FRAME:032716/0916

Effective date: 20140408

AS Assignment

Owner name: WEBSENSE, INC., TEXAS

Free format text: RELEASE OF SECOND LIEN SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME: 30704/0374;ASSIGNOR:ROYAL BANK OF CANADA, AS COLLATERAL AGENT;REEL/FRAME:035801/0689

Effective date: 20150529

Owner name: PORT AUTHORITY TECHNOLOGIES, INC., CALIFORNIA

Free format text: RELEASE OF SECOND LIEN SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME: 30704/0374;ASSIGNOR:ROYAL BANK OF CANADA, AS COLLATERAL AGENT;REEL/FRAME:035801/0689

Effective date: 20150529

Owner name: WEBSENSE, INC., TEXAS

Free format text: RELEASE OF FIRST LIEN SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME: 030694/0615;ASSIGNOR:ROYAL BANK OF CANADA, AS COLLATERAL AGENT;REEL/FRAME:035858/0680

Effective date: 20150529

Owner name: PORT AUTHORITY TECHNOLOGIES, INC., CALIFORNIA

Free format text: RELEASE OF FIRST LIEN SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME: 030694/0615;ASSIGNOR:ROYAL BANK OF CANADA, AS COLLATERAL AGENT;REEL/FRAME:035858/0680

Effective date: 20150529

AS Assignment

Owner name: RAYTHEON COMPANY, MASSACHUSETTS

Free format text: PATENT SECURITY AGREEMENT;ASSIGNORS:WEBSENSE, INC.;RAYTHEON OAKLEY SYSTEMS, LLC;RAYTHEON CYBER PRODUCTS, LLC (FORMERLY KNOWN AS RAYTHEON CYBER PRODUCTS, INC.);AND OTHERS;REEL/FRAME:035859/0282

Effective date: 20150529

AS Assignment

Owner name: PORTAUTHORITY TECHNOLOGIES, LLC, TEXAS

Free format text: CHANGE OF NAME;ASSIGNOR:PORTAUTHORITY TECHNOLOGIES, INC.;REEL/FRAME:039609/0877

Effective date: 20151230

AS Assignment

Owner name: FORCEPOINT LLC, TEXAS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:PORTAUTHORITY TECHNOLOGIES, LLC;REEL/FRAME:043156/0759

Effective date: 20170728

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION