WO2015183127A1 - Procédé de segmentation de données - Google Patents

Procédé de segmentation de données Download PDF

Info

Publication number
WO2015183127A1
WO2015183127A1 PCT/RU2014/000400 RU2014000400W WO2015183127A1 WO 2015183127 A1 WO2015183127 A1 WO 2015183127A1 RU 2014000400 W RU2014000400 W RU 2014000400W WO 2015183127 A1 WO2015183127 A1 WO 2015183127A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
functions
predicate
function
sliding window
Prior art date
Application number
PCT/RU2014/000400
Other languages
English (en)
Russian (ru)
Inventor
Леонид Валерьевич ЮРЬЕВ
Original Assignee
Общество С Ограниченной Ответственностью "Петер-Сервис Рнд"
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Общество С Ограниченной Ответственностью "Петер-Сервис Рнд" filed Critical Общество С Ограниченной Ответственностью "Петер-Сервис Рнд"
Priority to PCT/RU2014/000400 priority Critical patent/WO2015183127A1/fr
Publication of WO2015183127A1 publication Critical patent/WO2015183127A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions

Definitions

  • the invention relates to the field of computer technology, automated and information systems, automated analysis of electronic documents and information flows.
  • the invention can be used to develop new and improve existing systems for checking electronic documents or digital data streams for the presence of fragments or citations from other (reference) documents or files, including both full repetition of content and multiple repetition of individual fragments in random order.
  • a missed defect namely, that the distance from the last boundary of the selected segment also affects the result of the proposed predicate function.
  • the generated segment boundaries depend not only on the contents of the sliding window, but also on all data from the previous segment boundary. Therefore, when minor local changes are made to the beginning of the data stream, another segment map can be obtained.
  • the described method also has the aforementioned disadvantages, which are significant when searching for confidential and other important information.
  • the objective of this invention is to obtain a universal method of segmentation, which allows you to effectively identify matching fragments between digital documents containing arbitrary information with a clear result in the form of boundaries of coincidences, without access to the original standards, overcoming the significant limitations and disadvantages of the known solutions described above.
  • the technical result of this invention is to improve the accuracy. and the effectiveness of identifying matching fragments between digital documents.
  • the method of data segmentation includes the following steps: the user generates an ordered set of discrete and predicate functions based on their characteristics, taking into account the necessary segmentation parameters, receive the processed data from at least one data source, then shift the sliding window to the limits of the processed data, while determining the value of each function from the above set of data corresponding to the current position of the sliding window, after which op divide the boundaries of the data segment based on the order and values of the functions obtained in the previous step.
  • the boundaries of the data segment are determined based on the order and values of the functions obtained in the previous step for at least the last M positions, choosing the first true value as the boundary in accordance with the direction of forward movement and the order of functions, where M is the desired the maximum length of the segment boundary set by the user at the stage of forming an ordered set of discrete and predicate functions.
  • the predicate function is a detector of signatures (key and / or characteristic byte sequences) of data formats, files and documents.
  • the predicate function is a detector of the format and / or data structure based on static characteristics and optional heuristics.
  • the predicate function is a detector that distinguishes between the text (values and byte sequences corresponding to printed characters) and non-text information.
  • the predicate function is a detector that recognizes information in natural languages by the presence of characteristic sequences of bytes (p-gram).
  • the predicate function is a detector that recognizes a change in encoding (content-charset) for text information.
  • the predicate function is a detector that recognizes content-encoding for arbitrary information.
  • the predicate function is a detector that recognizes the nature of the distribution of byte values in a sliding window.
  • the invention may be implemented as a device comprising: a first unit configured to obtain the desired maximum length of the boundary of the segment M and form an ordered set of discrete and predicate functions based on their characteristics, taking into account the necessary segmentation parameters and the ability to transmit the specified dialing into the second block, configured to receive and transmit processed data and shift the sliding window within the received data from the last by determining the values of each function from the above set on the data corresponding to the current position of the sliding window, the second block is configured to receive and receive the received values and the order of functions in the third block, configured to determine the boundary of the data segment based on the received order and values of functions for at least the last M positions, and the first value “true” is selected as the boundary in accordance with the forward direction of the moving windows and order functions.
  • the blocks are connected in series.
  • This device can be implemented as a hardware module or virtualized device.
  • virtualization we mean emulation or imitation of this device on a computer system.
  • the invention can be implemented as a data segmentation system, including:
  • one or more command processing devices one or more data storage devices, one or more programs, where one or more programs are stored on one or more data storage devices and executed on one or more processors, and one or more programs includes the following instructions: user forms an ordered set of discrete and predicate functions based on their characteristics, taking into account the necessary segmentation parameters, then receive the processed data from at least one data source, shift t a sliding window within the processed data, in this case, the value of each function from the above set is determined on the data corresponding to the current position of the sliding window, then the boundaries of the data segment are determined based on the order and values of the functions obtained in the previous step.
  • one or more programs additionally contain the following instructions: determine the boundaries of the data segment based on the order and values of the functions obtained in the previous step for at least the last M positions, choosing the first “true” value as the boundary in accordance directions of movement forward and the order of functions, where M is the desired maximum length of the segment boundary specified by the user at the stage of forming an ordered set of discrete and predicate functions.
  • the present invention in its various embodiments can be implemented in the form of a method implemented on a computer, in the form of a system or computer-readable medium containing instructions for performing the aforementioned method.
  • the method of data segmentation includes the following steps:
  • the necessary segmentation parameters are determined.
  • the user generates an ordered set of discrete and predicate functions based on their characteristics, taking into account the necessary segmentation parameters;
  • a typical example of a suitable predicate function is the calculation of the value “false / true” as the result equal to zero hashes modulo prime.
  • the properties of the selected combination of the hashing function and the reduction operation are modulo, while the size of the result of the hashing function and the numerical size of the module will determine the size of the set of initial values for which the result of the predicate function will be “true”.
  • suitable hashing functions the well-known MD5, RIPE-MD, SHA1, SHA256, etc. can be cited.
  • Another typical example is the use of a hash function followed by a comparison of the result. For example, generating the result “true” in case of equality of the lower 9 bits of the result of the function MD5 to 123.
  • the data may be separate files or streams of network traffic, but not limited to these examples.
  • a data source can be storage media, network devices, remote computer systems.
  • Hard drives (hdd), solid state drives (ssd), flash memory, CD / DVD / Blue-Ray readers can act as storage media.
  • the sliding window is shifted within the processed data, while the value of each function from the above set is determined on the data corresponding to the current position of the sliding window.
  • Receive processed data Set the sliding window to the beginning of the processed data. Successively move the sliding window forward from the current position. For this position, the contents of the window of each of the N selected functions are analyzed, taking into account their order, while the resulting result of each function is saved. The boundaries of the data segment are determined based on the order and values of the functions obtained in the previous
  • the selected order of functions simultaneously determines the priority of using their result.
  • Figure 1 can serve as an illustration of the operation of the invention.
  • the probability of distortion of the result by false-positive detection of coincidence as a result of hash collision should not exceed 2 ⁇ 32 .
  • the lower limit of the selection range W is determined by the parameters of the hash function during the formation of shingles. To comply with the conditions for the probability of hash collisions, taking into account the so-called “birthday paradox”, the size of the hash result must be at least 8 bytes (64 bits), which is also the lower limit for choosing the size of the sliding window W.
  • Skipping a match longer than 1000 bytes is possible when generating a segment larger than 1 000 bytes, or entering into synchronization of the segment boundaries of the analyzed data stream and a reference document of more than 1000 bytes.
  • the size of the sliding window W 10
  • SHA1 hash function
  • a number of primes as the module values: 593, 587, 577, 571, 277.
  • the listed numbers are selected analytically based on well-known knowledge of discrete mathematics, probability theory, and analysis of operations.
  • the average segment length will be about 400 bytes, and the rms value will be about 500.
  • the number of segments less than 10 bytes long will be about 1.6% in quantitative terms or about 0.001% of the data volume.
  • the data segmentation method includes the following steps:
  • the beginning of the data stream is certainly considered the boundary of the segment, i.e. the beginning of their first one.
  • the contents of the sliding window is analyzed first function F 1 - Signature Detector "ZIP”, "DOC” and "JPG". Suppose that in the first iteration, none of the signatures were detected.
  • the SHA1 hash is calculated from the current contents of the sliding window, and the result is used further in the other five functions.
  • the first and last position of the data stream or file is, of course, considered the boundary of the segment.
  • the window is sequentially shifted from the first position forward until the edge of the window coincides with the logical end
  • the size of the sliding window W is selected based on the requirements of the predicate functions that form the analyzer as described below.
  • a sliding window As such, they can be used, including, but not limited to:
  • detectors of the format and / or data structure based on static characteristics and optional heuristics. Such detectors generate a value of "true" when changing the nature of the data of the current content
  • Detectors that recognize the nature of the distribution of byte values in a sliding window. For example, including, but not limited to: duplicate byte values, uniform and / or stochastic distribution, a subset of byte values from a valid set of values, etc.
  • a hashing function (message digest) that produces a result with a distribution close to a uniform discrete one.
  • a hashing function (message digest) that produces a result with a distribution close to a uniform discrete one.
  • a hashing function (message digest) that produces a result with a distribution close to a uniform discrete one.
  • CRC cyclic checksum
  • Rabin-Karp rolling hash ring hashing cryptographic hash functions (GOST R 34.11-2012, MD5, RIPE-MD, SHA1, SHA256, etc. .d.), schemes
  • predicate As a result of the constructed predicate function, a statement (predicate) is accepted that the hashing result equal to zero modulo a selectable natural number, the selection criteria of which are described below, is equal to zero. In other words, the remainder is calculated by dividing the resulting hash value by
  • the contents of the sliding window is inversely proportional to the selected value as a result of well-known laws of discrete mathematics.
  • the use of primes can be similarly recommended for
  • the analyzer For each position of the sliding window, the analyzer returns the serial number of the very first function that issued
  • the selected order of functions simultaneously determines the priority of using their result.
  • next segment boundary is determined by iteratively viewing the data stream M positions forward.
  • next boundary is indicated according to the first element of the list, when its value corresponds to the first function of the analyzer, or if after the last boundary there was
  • the (first) list item is deleted. What will correspond to the search in the table described in the section "Implementation of the invention", while maintaining it up to date without reprocessing the data.
  • the data stream is sequentially segmented at the designated boundaries.
  • the size of the generated segments is determined by the characteristics and order of the predicate functions of the detectors in the analyzer.
  • hashing segments can be used as well-known hash functions (SHAl, MD5, ⁇ -MD), as well as any other satisfying the target application requirements.
  • hash functions SHAl, MD5, ⁇ -MD
  • Hashing is carried out simultaneously when moving the sliding window forward, but another sequence of actions can be used.
  • the access key is the hash value obtained by
  • the hash value obtained by segmentation is used as a key when searching the index of digital fingerprints.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne le domaine des équipements informatiques, des systèmes automatisés et d'information, de l'analyse automatisée de documents électroniques et de flux d'informations. Le procédé de segmentation de données comprend les étapes suivantes : L'utilisateur forme un jeu ordonné des fonctions discrètes et prédicatives sur la base de leurs caractéristiques en tenant compte des paramètres de segmentation nécessaires ; on reçoit des données à partir d'au moins source de données ; on décale la fenêtre mobile dans les limites des données traitées ; on détermine la valeur de chaque fonction dans le jeu de données indiqué pour les données correspondant à la position actuelle de la fenêtre mobile ; on détermine les limites du segment de données sur la base de l'ordre et de la valeur des fonctions obtenues au stade précédent.
PCT/RU2014/000400 2014-05-30 2014-05-30 Procédé de segmentation de données WO2015183127A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/RU2014/000400 WO2015183127A1 (fr) 2014-05-30 2014-05-30 Procédé de segmentation de données

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RU2014/000400 WO2015183127A1 (fr) 2014-05-30 2014-05-30 Procédé de segmentation de données

Publications (1)

Publication Number Publication Date
WO2015183127A1 true WO2015183127A1 (fr) 2015-12-03

Family

ID=54699336

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2014/000400 WO2015183127A1 (fr) 2014-05-30 2014-05-30 Procédé de segmentation de données

Country Status (1)

Country Link
WO (1) WO2015183127A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6026397A (en) * 1996-05-22 2000-02-15 Electronic Data Systems Corporation Data analysis system and method
US20080159331A1 (en) * 2006-12-29 2008-07-03 Riverbed Technology, Inc. Data segmentation using shift-varying predicate function fingerprinting
US20100114980A1 (en) * 2008-10-28 2010-05-06 Mark David Lillibridge Landmark chunking of landmarkless regions
US20130066625A1 (en) * 2003-11-21 2013-03-14 Nuance Communications Austria Gmbh Text segmentation and label assignment with user interaction by means of topic specific language models and topic-specific label statistics

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6026397A (en) * 1996-05-22 2000-02-15 Electronic Data Systems Corporation Data analysis system and method
US20130066625A1 (en) * 2003-11-21 2013-03-14 Nuance Communications Austria Gmbh Text segmentation and label assignment with user interaction by means of topic specific language models and topic-specific label statistics
US20080159331A1 (en) * 2006-12-29 2008-07-03 Riverbed Technology, Inc. Data segmentation using shift-varying predicate function fingerprinting
US20100114980A1 (en) * 2008-10-28 2010-05-06 Mark David Lillibridge Landmark chunking of landmarkless regions

Similar Documents

Publication Publication Date Title
CN110162750B (zh) 文本相似度检测方法、电子设备及计算机可读存储介质
US20140059016A1 (en) Deduplication device and deduplication method
CN110750615B (zh) 文本重复性判定方法和装置、电子设备和存储介质
CN105446964A (zh) 用于文件的重复数据删除的方法及装置
Laurenson Performance analysis of file carving tools
CN111159697A (zh) 一种密钥检测方法、装置及电子设备
CN107085568A (zh) 一种文本相似度判别方法及装置
US11163948B2 (en) File fingerprint generation
WO2021121280A1 (fr) Agent polyvalent pour balayage de point de terminaison
TWI699663B (zh) 分段方法、分段系統及非暫態電腦可讀取媒體
Oliver et al. Designing the elements of a fuzzy hashing scheme
CN116821903A (zh) 检测规则确定及恶意二进制文件检测方法、设备及介质
WO2015183127A1 (fr) Procédé de segmentation de données
US20210357363A1 (en) File comparison method
CN111159996B (zh) 基于文本指纹算法的短文本集合相似度比较方法及系统
CN110968649B (zh) 用于管理数据集的方法、设备和计算机程序产品
CN108874753B (zh) 主题帖回复的查找方法、装置和计算机设备
Dandass et al. An empirical analysis of disk sector hashes for data carving
CN112685740A (zh) 一种压缩包安全检测方法、装置、终端及存储介质
CN111400342A (zh) 数据库更新方法、装置、设备及存储介质
CN113407375B (zh) 数据库删除数据的恢复方法、装置、设备和存储介质
US20220365909A1 (en) Apparatus and method for detecting target file based on network packet analysis
CN114222989B (zh) 用于端点扫描的多功能代理
CN108664900A (zh) 一种用于识别文字作品异同的方法与设备
US11423208B1 (en) Text encoding issue detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14893008

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14893008

Country of ref document: EP

Kind code of ref document: A1