GB2440174A - Determining similarity of electronic documents by comparing hashed alphabetically ordered phrases - Google Patents

Determining similarity of electronic documents by comparing hashed alphabetically ordered phrases Download PDF

Info

Publication number
GB2440174A
GB2440174A GB0614332A GB0614332A GB2440174A GB 2440174 A GB2440174 A GB 2440174A GB 0614332 A GB0614332 A GB 0614332A GB 0614332 A GB0614332 A GB 0614332A GB 2440174 A GB2440174 A GB 2440174A
Authority
GB
United Kingdom
Prior art keywords
document
electronic
words
phrases
dividing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0614332A
Other versions
GB0614332D0 (en
Inventor
Stephen Robinson
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chronicle Solutions UK Ltd
Original Assignee
Chronicle Solutions UK Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chronicle Solutions UK Ltd filed Critical Chronicle Solutions UK Ltd
Priority to GB0614332A priority Critical patent/GB2440174A/en
Publication of GB0614332D0 publication Critical patent/GB0614332D0/en
Priority to PCT/GB2007/050419 priority patent/WO2008009991A1/en
Publication of GB2440174A publication Critical patent/GB2440174A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F17/2211

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An electronic system for automatically comparing the similarity of a first electronic document with a second electronic document, comprising processing the electronic data representing each document by: (a) dividing the document into phrases by splitting the electronic data at each electronic signal representing a punctuation mark; (b) dividing each of the phrases into words; (c) discarding electronic signals representing glue, or stop, words; (d) within each phrase sorting the words into alphabetical order; (e) generating a hash code for each alphabetically ordered phrase; and (f) comparing the hash codes for the first document with the hash codes of the second document; and (g) generating a value indicating the proportion of hash codes which match in the first and second documents being compared. The document may be captured on a network and the value may be a percentage or represented on a logarithmic scale.

Description

<p>SYSTEM</p>
<p>The present invention relates to a system for electronically comparing documents.</p>
<p>With the ever increasing dependency of organisations on computers to organise. store and communicate documentation, it is increasingly desirable to he able to control the distribution of electronic documentation. For example, it is desirable for an organisation to know who has received confidential documents, and/or to prevent unauthori sed persons receiving them Electronic documents are usually transmitted over a network in a packet fbrm so the actual data content of the document is mixed in each packet with other electronic information such as that which determines the type of data. routing information, check data and timing information. Thus, to determine the data content, the packets must be decoded.</p>
<p>One method currently used to monitor the electronic transmission of specified documents in data traffic over a network is to attach a packet sniffer (also known as network or protocol analyzer or Ethernet sniffer) to the network. Such a packet snifier copies each of the data packets transmitted over the network, stores the packets and subsequently decodes and analyses the content for comparison to one or more predetermined files, looking for a match.</p>
<p>A problem with known packet sniffers is that they are only able to match identical documents. If a document is modified or paraphrased in any way then the transmitted document would not match the watched document and therefore no action will he taken. In addition, known packet sniffers are slow and generally unable to operate fully in real time.</p>
<p>The present invention seeks to pros ide a system of comparing documents and determining how similar their content is to one or more pre-selected documents.</p>
<p>According to the first aspect of the present invention there is provided an electronic system for automaticaIl comparing the similarit ol a first electronic document with a second electronic document. the system comprising elements arranged for electronically processing the electronic data representing each document by: (a) dividing the document into phrases by splitting the electronic data at each electronic signal rcprcsdnting a punctuation mark; (b) dividing each of the phrases into words; (c) discarding electronic signals representing glue words; (d) within each phrase sorting the words into alphabetical order; (e) generating a hash code for each alphabetically ordered phrase: and (1) comparing the hash codes for the first document with the hash codes of the second document; and (g) generating a value indicating the proportion of hash codes which match in the first and second documents being compared.</p>
<p>Preferably each document is divided at electronic signals representing one or more of the punctuation symbols such as full stop, comma, semi-colon, colon, single quote, double quote, question mark and exclamation mark. The term words" is intended to 1 5 mean any continuous sequence of alphanumeric characters such as letters A to Z and numbers 0 to 9.</p>
<p>The invention has particular application in computer networks for comparing documents transferred over such networks in real time.</p>
<p>The comparison value may be generated as a percentage for ease of interpretation and is preferably converted to a logarithmic scale. l'he system can be adapted automatically to take action if the comparison value is higher than a predetermined value.</p>
<p>A corresponding method. computer program and computer are also provided.</p>
<p>The invention will now he described, by way of example, with reference to the accompanying drawing in which the single figure is a block diagram of a method according to the present invention for comparing the similarity of electronic documents according to the present invention.</p>
<p>Ihe single figure illustrates the steps involved in comparing electronic documents according to the system of the present invention. Electronic data representing a document to be tested is presented in a comparison step 10. The electronic data may he captured as raw packet data transmitted across a network, then decoded and recombined into its original form. If the document originates from network traffic then the comparison system should ideally be capable of processing each document in real-time, so that a backlog of documents does not build up. However the invention can be used in stand alone mode to compare documents.</p>
<p>The data representing the document is then divided in step 12 by dividing the text into phrases at punctuation symbols preferably including: full stop. comma, semi-colon, colon, single quotation marks, double quotation marks, question mark and exclamation mark.</p>
<p>Each of the phrases is then sub-divided into words in step 14. where words are defined as any continuous sequence of alphanumeric characters. i.e. of letters A to 7, and/or numbers 0 to 9. The phrases will thus be split into words at any character outside these ranges and all such characters are considered white-space and discarded in dividing the phrases into words.</p>
<p>Then at step 16 each of the words in each of the phrases is examined and all glue words are discarded. Glue words are those which do not add any intrinsic subject matter, such as, fbr example: a, the, and it. This advantageously reduces the number of words without significantly affecting the content.</p>
<p>Next at stage 18, the remaining words in each of the phrases are sorted into alphabetical order so that their position in the phrase is flO longer important.</p>
<p>Each alphabetically sorted phrase is then used to generate a hash code in stage 20.</p>
<p>Any hash algorithm could be used to create the hash code, for example the MI)4 algorithm would he suitable. It is advantageous to use a hash code to represent the contents of the phrase because it requires significantly less processing time to compare hash codes than to compare the alphabetically sorted phrases.</p>
<p>Once the hash codes for each phrase in a document have been created, they can he compared with the hash codes of one or more other documents at stage 22. For example with pre-selected documents where hash codes are already stored in the system such pre-selected documents may consist of documents that the administrator of the system has identified, for example confidential or classified documents. The pre-selected documents are processed in the same way as the document being tested before the system receives the first document to be tested.</p>
<p>The result of' the comparison between documents is preferably displayed in the form of a percentage value representing the number of matching phrases in the no documents. The more matching hash codes there arc in two documents being compared. the more likely it is that the documents are related. It is advantageous to display the percentages on a logarithmic scale since this makes it easier to visually identify similarities. It is also possible for the system to be configured to flag any documents which have a percentage match over a given threshold, so that appropriate action can be taken quickly. either manually, or automatically by the system, for example to block further dissemination of the document, to block future transmission of it, or to trace and log the source of the document.</p>

Claims (1)

  1. <p>CLAIMS</p>
    <p>I. An electronic system for automatically comparing the similarity of a first electronic document with a second electronic document, the system comprising elements arranged for electronically processing the electronic data representing each document by: (a) dividing the document into phrases by splitting the electronic data at each electronic signal representing a punctuation mark: (b) dividing each of the phrases into words; (c) discarding electronic signals representing glue words; (d) within each phrase sorting the words into alphabetical order: (e) generating a hash code for each alphabetically ordered phrase; and (1) comparing the hash codes for the first document with the hash codes of the second document; and (g) generating a value indicating the proportion of hash codes which match in the first and second documents being compared.</p>
    <p>2. A system according to claim 1 wherein each document is divided into phrases by dividing the electronic document text at each of the electronic signals representing the punctuation symbols: full stop. comma, semi-colon, colon, single quote, double quote, question mark and exclamation mark.</p>
    <p>3. A system according to claim I or 2 wherein words are defined as any continuous sequence of alphanumeric characters comprising letters A to Z and numbers 0 to 9.</p>
    <p>4. A system according to any one of the preceding claims wherein prior to dividing the document into phrases, the document is captured from a network.</p>
    <p>5. A system according to claim 4 adapted to perform the comparing in real-time.</p>
    <p>6. A system according to any one of the preceding claims comprising an additional element for taking action if the value of a document comparison is higher than a predetermined value.</p>
    <p>7. A system according to any one of the preceding claims wherein the value is generated as a percentage.</p>
    <p>8. A system according claim 7 comprising an additional element lbr converting the percentage value results to a logarithmic scale.</p>
    <p>9. A method of comparing the similarit of a first electronic document with a second electronic document Comprising the steps of: (a) dividing the document into phrases by splitting the electronic data at each electronic signal representing a punctuation mark: (h) dividing each of the phrases into words; (c) discarding electronic signals representing glue words: (d) within each phrase sorting the words into alphabetical order; 1 5 (C) generating a hash code for each alphabetically ordered phrase: and (f) comparing the hash codes for the first document with the hash codes of the second document; and (g) generating a value indicating the proportion of hash codes which match in the first and second documents being compared.</p>
    <p>1 0. A computer adapted to compare the similarity of a first electronic document with a second electronic document, comprising processing elements arrayed for processing the electronic data representing each document by: (a) dividing the document into phrases by splitting the electronic data at each electronic signal representing a punctuation mark: (b) dividing each of the phrases into words; (c) discarding electronic signals representing glue words; (d) within each phrase sorting the words into alphabetical order; (e) generating a hash code for each alphabetically ordered phrase; and (f) comparing the hash codes br the first document with the hash codes of the second document; and (g) generating a value indicating the proportion ob hash codes which match in the first and second documents being compared. f.)</p>
    <p>II. A computer program arranged so that when loaded on a computer the computer will automatically compare the similarity of a first electronic document with a second electronic document by performing the steps of: (a) dividing the document into phrases by splitting the electronic data at each electronic signal representing a punctuation mark: (h) dividing each of the phrases into words: (c) discarding electronic signals representing glue words; (d) within each phrase sorting the words into alphabetical order: (e) generating a hash code for each alphabetically ordered phrase; and (f) comparing the hash codes For the first document with the hash codes of the second document: and (g) generating a value indicating the proportion of hash codes which match in the first and second documents being compared.</p>
GB0614332A 2006-07-19 2006-07-19 Determining similarity of electronic documents by comparing hashed alphabetically ordered phrases Withdrawn GB2440174A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
GB0614332A GB2440174A (en) 2006-07-19 2006-07-19 Determining similarity of electronic documents by comparing hashed alphabetically ordered phrases
PCT/GB2007/050419 WO2008009991A1 (en) 2006-07-19 2007-07-19 Document similarity system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB0614332A GB2440174A (en) 2006-07-19 2006-07-19 Determining similarity of electronic documents by comparing hashed alphabetically ordered phrases

Publications (2)

Publication Number Publication Date
GB0614332D0 GB0614332D0 (en) 2006-08-30
GB2440174A true GB2440174A (en) 2008-01-23

Family

ID=36998334

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0614332A Withdrawn GB2440174A (en) 2006-07-19 2006-07-19 Determining similarity of electronic documents by comparing hashed alphabetically ordered phrases

Country Status (2)

Country Link
GB (1) GB2440174A (en)
WO (1) WO2008009991A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11177945B1 (en) 2020-07-24 2021-11-16 International Business Machines Corporation Controlling access to encrypted data

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4341705B2 (en) * 2007-07-17 2009-10-07 トヨタ自動車株式会社 In-vehicle image processing device
US8549327B2 (en) 2008-10-27 2013-10-01 Bank Of America Corporation Background service process for local collection of data in an electronic discovery system
US8250037B2 (en) 2009-03-27 2012-08-21 Bank Of America Corporation Shared drive data collection tool for an electronic discovery system
US8224924B2 (en) 2009-03-27 2012-07-17 Bank Of America Corporation Active email collector
US8417716B2 (en) 2009-03-27 2013-04-09 Bank Of America Corporation Profile scanner
US8364681B2 (en) 2009-03-27 2013-01-29 Bank Of America Corporation Electronic discovery system
US8504489B2 (en) 2009-03-27 2013-08-06 Bank Of America Corporation Predictive coding of documents in an electronic discovery system
US8200635B2 (en) 2009-03-27 2012-06-12 Bank Of America Corporation Labeling electronic data in an electronic discovery enterprise system
US9330374B2 (en) 2009-03-27 2016-05-03 Bank Of America Corporation Source-to-processing file conversion in an electronic discovery enterprise system
US9721227B2 (en) 2009-03-27 2017-08-01 Bank Of America Corporation Custodian management system
US8572227B2 (en) 2009-03-27 2013-10-29 Bank Of America Corporation Methods and apparatuses for communicating preservation notices and surveys
US8572376B2 (en) 2009-03-27 2013-10-29 Bank Of America Corporation Decryption of electronic communication in an electronic discovery enterprise system
US8806358B2 (en) 2009-03-27 2014-08-12 Bank Of America Corporation Positive identification and bulk addition of custodians to a case within an electronic discovery system
RU2420791C1 (en) * 2009-10-01 2011-06-10 ЗАО "Лаборатория Касперского" Method of associating previously unknown file with collection of files depending on degree of similarity
US8244767B2 (en) 2009-10-09 2012-08-14 Stratify, Inc. Composite locality sensitive hash based processing of documents
US9355171B2 (en) 2009-10-09 2016-05-31 Hewlett Packard Enterprise Development Lp Clustering of near-duplicate documents
US9053454B2 (en) 2009-11-30 2015-06-09 Bank Of America Corporation Automated straight-through processing in an electronic discovery system
US11797486B2 (en) 2022-01-03 2023-10-24 Bank Of America Corporation File de-duplication for a distributed database

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000007094A2 (en) * 1998-07-31 2000-02-10 The Regents Of The University Of California Method and apparatus for digitally shredding similar documents within large document sets in a data processing environment
WO2002010967A2 (en) * 2000-07-31 2002-02-07 Iit Research Institute System for similar document detection
US20020172425A1 (en) * 2001-04-24 2002-11-21 Ramarathnam Venkatesan Recognizer of text-based work

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6547829B1 (en) * 1999-06-30 2003-04-15 Microsoft Corporation Method and system for detecting duplicate documents in web crawls
US6658423B1 (en) * 2001-01-24 2003-12-02 Google, Inc. Detecting duplicate and near-duplicate files

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000007094A2 (en) * 1998-07-31 2000-02-10 The Regents Of The University Of California Method and apparatus for digitally shredding similar documents within large document sets in a data processing environment
WO2002010967A2 (en) * 2000-07-31 2002-02-07 Iit Research Institute System for similar document detection
US20020172425A1 (en) * 2001-04-24 2002-11-21 Ramarathnam Venkatesan Recognizer of text-based work

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11177945B1 (en) 2020-07-24 2021-11-16 International Business Machines Corporation Controlling access to encrypted data

Also Published As

Publication number Publication date
GB0614332D0 (en) 2006-08-30
WO2008009991A1 (en) 2008-01-24

Similar Documents

Publication Publication Date Title
GB2440174A (en) Determining similarity of electronic documents by comparing hashed alphabetically ordered phrases
US8625642B2 (en) Method and apparatus of network artifact indentification and extraction
US7594277B2 (en) Method and system for detecting when an outgoing communication contains certain content
CN111695033A (en) Enterprise public opinion analysis method, device, electronic equipment and medium
US7543076B2 (en) Message header spam filtering
US8838599B2 (en) Efficient lexical trending topic detection over streams of data using a modified sequitur algorithm
US20150033120A1 (en) System, process and method for the detection of common content in multiple documents in an electronic system
US20120215853A1 (en) Managing Unwanted Communications Using Template Generation And Fingerprint Comparison Features
US20090043853A1 (en) Employing pixel density to detect a spam image
US20190057148A1 (en) Method and equipment for determining common subsequence of text strings
JP2011129161A (en) Duplicate document detection and presentation functions
US9542474B2 (en) Forensic system, forensic method, and forensic program
CN110222513B (en) Abnormality monitoring method and device for online activities and storage medium
CN110674529A (en) Document auditing method and document auditing device based on data security information
US10042825B2 (en) Detection and elimination for inapplicable hyperlinks
US9411877B2 (en) Entity-driven logic for improved name-searching in mixed-entity lists
US9235624B2 (en) Document similarity evaluation system, document similarity evaluation method, and computer program
JP5094487B2 (en) Information leakage inspection apparatus, computer program, and information leakage inspection method
CN113312504A (en) Management method, device, equipment and medium for content audit project
US10210241B2 (en) Full text indexing in a database system
JP2011150388A (en) System for converting file storage destination path based on secrecy section information, and method
CN113992668B (en) Information real-time transmission method, device, equipment and medium based on multiple concurrences
CN111325629A (en) Enterprise investment value evaluation method, device, server and readable storage medium
CN111045983A (en) Nuclear power station electronic file management method and device, terminal equipment and medium
US7779351B2 (en) Coloring a generated document by replacing original colors of a source document paragraph with colors to identify the paragraph and with colors to mark color boundries

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)