CN111159117B - Low-overhead file operation log acquisition method - Google Patents

Low-overhead file operation log acquisition method Download PDF

Info

Publication number
CN111159117B
CN111159117B CN201911303119.5A CN201911303119A CN111159117B CN 111159117 B CN111159117 B CN 111159117B CN 201911303119 A CN201911303119 A CN 201911303119A CN 111159117 B CN111159117 B CN 111159117B
Authority
CN
China
Prior art keywords
file
log
kernel
information
filtering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911303119.5A
Other languages
Chinese (zh)
Other versions
CN111159117A (en
Inventor
张为华
鲁云萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN201911303119.5A priority Critical patent/CN111159117B/en
Publication of CN111159117A publication Critical patent/CN111159117A/en
Application granted granted Critical
Publication of CN111159117B publication Critical patent/CN111159117B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1734Details of monitoring file system events, e.g. by the use of hooks, filter drivers, logs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention relates to a low-overhead file operation log acquisition method, which comprises the following steps: 1) Collecting file operation log information in a kernel by adopting a kernel probe; 2) Setting a shared memory for writing information acquired by the kernel probe in the kernel space, and reading the information acquired by the kernel probe from the shared memory by the user space; 3) The number of logs is reduced through a deduplication algorithm, and the log acquisition overhead is reduced. Compared with the prior art, the method has the advantages that the kernel probe technology with low cost is selected for file operation information acquisition, the information in the kernel is transferred to the user layer in a shared memory mode, and then the log quantity is reduced through an online duplicate removal algorithm, so that the system cost is reduced.

Description

Low-overhead file operation log acquisition method
Technical Field
The invention relates to the field of data protection, in particular to a low-overhead file operation log acquisition method.
Background
With the rapid development of the fields of the internet, social media, cloud computing, internet of things, mobile short videos, electronic commerce and the like, the amount of data generated each year worldwide is increasing explosively. The age of big data has come and data has become the most important digital asset in the world. The development of technology brings great living convenience to the life of people, such as mobile phone payment, face recognition, intelligent voice, unmanned supermarket and the like. But at the same time brings the risk of data leakage to people. Various data leakage events are endless. The demand for data protection is also increasing. At present, 80% of data is stored in a file, and the log of file operation is recorded to be one of important measures for protecting the data, so that the data is leaked, and the leakage source can be found by backtracking the log of the file operation. However, the existing log acquisition method has a main problem that the system overhead is too large. The existing log acquisition method has the following three main reasons that the cost is large:
(1) The method is characterized in that a system call interception method with high cost is adopted to record file operation logs, and all file operation logs are recorded through system call interception of all file operations.
(2) Log information is passed from kernel space to user space through the expensive printk function.
(3) Because a large number of redundant logs and logs generated by temporary files exist in the file operation log, the system log is excessively large, and the disk IO overhead is large.
The existing file operation log acquisition method is large in system overhead, is not beneficial to the deployment of an actual production environment, and causes storage overhead due to the fact that the log quantity is too large. Aiming at the problem of large expenditure of the existing file operation log acquisition method, the existing solution is to acquire the file operation log through a stackable file system with small expenditure, and then record part of file operations instead of all file operations in the system or record only part of file operations of users instead of all file operations of all users. Although the method can reduce the system overhead, all file operations of all users cannot be recorded, when the file which is not recorded with the log is leaked, the source cannot be traced through the file operation log, and a leakage person and a leakage mode can be found.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a low-overhead file operation log acquisition method.
The aim of the invention can be achieved by the following technical scheme:
a low-overhead file operation log acquisition method comprises the following steps:
1) Collecting file operation log information in a kernel by adopting a kernel probe;
2) Setting a shared memory for writing information acquired by the kernel probe in the kernel space, and reading the information acquired by the kernel probe from the shared memory by the user space;
3) The number of logs is reduced through a deduplication algorithm, and the log acquisition overhead is reduced.
In the step 2), the user space reads the information acquired by the kernel probe from the shared memory in real time through a mmap mechanism.
In the step 3), the duplication is removed by constructing a hash table, wherein the key values in the hash table are all structural bodies, the keys in the hash table are the same parts in the file operation log, and the values in the hash table are log information after duplication removal.
The duplication removing algorithm comprises a filtering module and a merging module, wherein the filtering module comprises kernel layer filtering and user layer filtering, the kernel layer filtering is used for filtering a file operation log, the user layer filtering is used for filtering a temporary file, the merging module is used for merging read-write operations of the file, when the same file has multiple read-write operations, the multiple continuous read operations are merged into a read log, and the multiple continuous write operations are merged into a write log.
The specific operation flow of the merging module is as follows:
first, searching is carried out to find out whether the piece of log information exists in the existing log information, if so, merging is carried out, and if not, the piece of log information is inserted into a hash table.
The same parts in the file operation log include file information, process information and user information, and specifically include a process ID, a parent process ID, a user ID, a file name and a type of file operation.
The complexity of the hash table is O (1), a linked list method is adopted for solving the hash conflict, and a division hash method is selected by a hash function.
In the step 1), file operation log information is collected by adopting an eBPF at a virtual file layer of a kernel.
Filtering the temporary files filters file names, including temporary files with suffix names of. swp and.tmp.
Compared with the prior art, the invention has the following advantages:
1. the invention adopts the low-cost kernel probe to collect the file operation information in the kernel. And meanwhile, the kernel information is transferred to the user space through the shared memory with low overhead, and then the log quantity is reduced in the user space through a deduplication algorithm, so that the overhead is low.
2. And the system overhead is reduced while all file operation logs of all users are recorded.
Drawings
Fig. 1 is a frame diagram of the present invention.
Fig. 2 is a diagram of the overhead of the present invention.
FIG. 3 is a bar graph of the present invention versus overhead for a prior log collection method.
FIG. 4 is a flow chart of the deduplication algorithm of the present invention.
Detailed Description
The invention will now be described in detail with reference to the drawings and specific examples.
As shown in FIG. 1, the invention provides a low-cost file operation log acquisition method aiming at the problem of high cost of the current file operation log system. The system overhead can be reduced, and all file operations can be recorded, specifically:
the method comprises the steps of collecting file operation information in a kernel by adopting a kernel probe technology with low cost, transmitting the information in the kernel to a user layer by adopting a shared memory mode, and reducing the log quantity by a deduplication algorithm, so that the log collection cost is reduced.
The specific design scheme comprises the following steps:
the method comprises the steps of acquiring kernel information related to all file operations by adopting a kernel probe, wherein the kernel probe can track almost all kernel functions, and specifically track functions related to a virtual file system layer aiming at the acquisition of the file operation information in the kernel. Tracking the system call layer, if there are too many system call functions, the related system call functions need to be processed correspondingly for each file operation, and the different system call functions may be the same function of the virtual file system layer. Tracking file system layers, different file systems need to be processed, different file systems used by different systems are different, related file operation functions are different, tracking file system layers are too many, and different hook functions need to be used for different file systems. Therefore, for file operation related information collection, virtual file system layer related functions need to be tracked.
Then, the invention opens up a memory in the kernel space, then maps the content of the memory to the user space through the mmap mechanism, thereby realizing the design of the shared memory, the kernel probe writes the acquired information into the shared memory, the user space continuously reads the information of the kernel probe from the memory through the mmap mechanism, the mmap mechanism is mainly a technology for mapping a file into the memory, and the shared memory can be designed through the mmap mechanism.
Finally, the log quantity is reduced by designing an online deduplication algorithm, and compared with the traditional sequential deduplication method, the method provided by the invention has the advantages that the deduplication is performed in a way of constructing the hash table, a great amount of time is saved, the data deduplication efficiency is improved, and the requirement on data deduplication zero error is met. The deduplication algorithm mainly comprises a filtering module and a merging module, wherein the filtering module mainly filters operation logs of some temporary files, such as temporary files generated when a vim opens the files. The invention also designs a proper data structure to realize the combination, and the invention firstly searches to find whether the log information exists in the existing log information or not before the combination, if yes, inserts the log information if not. In order to reduce the cost, the invention selects a hash table with the time complexity of O (1), different logs obtain unique and different keys after hash function operation, and different logs are inserted into the hash table by utilizing the characteristic of the hash table, the same log is not inserted, and the value in the hash table is log information after duplication removal.
For the file operation log, the hash table key values designed by the invention are all structural bodies, and the hash table key design mainly comprises the same parts in the file operation log, such as file information, process information and user information. Specific process ID, father process ID, user ID, file name, type of file operation. The design of the hash table value mainly comprises different parts in the file read-write operation log, including information such as the number of file read-write data and the number of read-write times. Meanwhile, the hash table value is designed to prevent the read and written data information from being lost when the log is read and written by the deduplication file. For example, ten times when reading a file, the amount of read data is different. The hash table value is designed to be updated continuously, so that the number of specific read-write data in the read-write operation log is reserved, in the example, hash conflicts are solved through a linked list method, and a division hash method is selected as the hash function.
If kernel probes are directly used for writing kernel modules, development and debugging are difficult, system stability is affected, and the kernel probes cannot be compatible with operating systems of different versions, eBPF is high in safety, stable and compatible with the operating systems of different versions, and can be used in an actual production environment, and the kernel probes are supported, so that in the embodiment, the eBPF is used for collecting information of file systems, the eBPF is mainly used for collecting file operation logs in a virtual file layer of a kernel, because the file systems are various, the file systems selected by different systems are different, and if log collection is carried out on data operation in the file system layer, log information collection needs to be carried out on each file system, and the workload is too great. File operations are of various types, such as read, write, copy, delete, modify properties, and the like. And selecting different kernel functions according to corresponding file operations, and tracking the different kernel functions through the eBPF so as to acquire all information of the file operation kernels. And simultaneously, the eBPF supports the mode of transferring file operation information from the kernel space to the user space in a shared memory mode.
When file operation information is transferred from the kernel space to the user space, the log quantity of the written file can be effectively reduced by the duplication elimination algorithm realized by the invention, the log quantity is reduced, the disk IO is reduced, and the system overhead is further reduced. The implementation of the deduplication algorithm mainly comprises the implementation of a filtering module and the implementation of a merging module. The implementation of the filtering module is divided into kernel layer filtering and user layer filtering, wherein the kernel layer filtering is to filter some file operation logs, such as logs of some kernel daemons, in the eBPF code, and the kernel daemons continuously read configuration files, and the processes need to be filtered through process pid when in the kernel layer. User layer filtering, i.e. filtering implemented at the user layer when information passes from the kernel layer to the user layer, is mainly filtering temporary files. The current stage of filtering the temporary files mainly filters file names, such as temporary files with suffix names of swp and tmp, and the realization of merging mainly merges read-write logs through a hash table designed by the invention after log information is transferred from a kernel space to a user space through a shared memory, so that redundant logs in the read-write logs are removed, and the number of the logs is reduced.
As shown in fig. 4, the de-duplication algorithm embodying the present invention has the following flow:
(1) Firstly, judging whether a temporary file operation log or other logs needing to be filtered exist in the log information, and if so, filtering out the log information. If there is no execution of the next operation.
(2) Creating a hash table and storing log information.
(3) Searching the hash table, judging whether redundant logs exist in the logs, if so, updating the hash table, and merging log information. If not, perform the following operations
(4) And writing the log information in the hash table into the log file.
Examples
And testing the system overhead of the log acquisition method, wherein the test environment is as follows: two machines of 1.87GHz 16-core Intel Xeon processors are provided with 8GB memory, 40GB hard disk size and Linux 4.15.9 operating system, thus forming a cluster.
As shown in fig. 2-3, in order to test the system performance overhead of the acquisition method proposed by the present invention, the impact of the acquisition method proposed by the present invention on the system overhead is observed by comparing the read-write performance of a machine that loads the log acquisition system implemented herein with a machine that does not load the log acquisition system. Meanwhile, the improvement effect of the performance of the log system is observed by comparing the system performance of the log system with the system performance of the Progger of the traditional log system. For the performance test of the system overhead, a bonnie++ tool is selected, 100 small files of 1KB are respectively created, then IO times of a system using the log collecting tool and a system not using the log collecting tool are respectively tested, the performance loss percentage is calculated, and then the performance losses of a log collecting system DataLogger designed by the invention and a log system Progger of the existing open source are respectively calculated, as shown in a table 1.
TABLE 1 DataLogger System overhead Performance test
Figure BDA0002322366940000061

Claims (6)

1. The low-overhead file operation log acquisition method is characterized by comprising the following steps of:
1) Collecting file operation log information in a kernel by adopting a kernel probe;
2) Setting a shared memory for writing information acquired by the kernel probe in the kernel space, and reading the information acquired by the kernel probe from the shared memory by the user space;
3) The number of logs is reduced through a deduplication algorithm, and the log acquisition overhead is reduced;
in the step 3), the duplication is removed by constructing a hash table, wherein key values in the hash table are all structural bodies, keys in the hash table are the same parts in a file operation log, and the values in the hash table are log information after duplication removal;
the duplicate removal algorithm comprises a filtering module and a merging module, wherein the filtering module comprises kernel layer filtering and user layer filtering, the kernel layer filtering is used for filtering through a process pid so as to realize the filtering of a file operation log, the user layer filtering is used for realizing the filtering of a temporary file, the merging module is used for merging read-write operations of the file, and when the same file has multiple read-write operations, the multiple continuous read operations are merged into a read log, and the multiple continuous write operations are merged into a write log;
filtering the temporary files filters file names, including temporary files with suffix names of. swp and.tmp.
2. The method for collecting the log of file operations with low overhead according to claim 1, wherein in the step 2), the user space reads the information collected by the kernel probe from the shared memory in real time through a mmap mechanism.
3. The method for collecting the operation log of the file with low overhead according to claim 1, wherein the specific operation flow of the merging module is as follows:
first, searching is carried out to find out whether the piece of log information exists in the existing log information, if so, merging is carried out, and if not, the piece of log information is inserted into a hash table.
4. The method of claim 1, wherein the same parts in the file operation log include file information, process information and user information, and specifically include process ID, parent process ID, user ID, file name and type of file operation.
5. The method for collecting the operation log of the file with low overhead according to claim 1, wherein the complexity of the hash table is O (1), a linked list method is adopted for solving the hash conflict, and a division hash method is selected by a hash function.
6. The method for collecting low-overhead file operation log according to claim 1, wherein in the step 1), the file operation log information is collected by using an eBPF in a virtual file layer of the kernel.
CN201911303119.5A 2019-12-17 2019-12-17 Low-overhead file operation log acquisition method Active CN111159117B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911303119.5A CN111159117B (en) 2019-12-17 2019-12-17 Low-overhead file operation log acquisition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911303119.5A CN111159117B (en) 2019-12-17 2019-12-17 Low-overhead file operation log acquisition method

Publications (2)

Publication Number Publication Date
CN111159117A CN111159117A (en) 2020-05-15
CN111159117B true CN111159117B (en) 2023-07-04

Family

ID=70557639

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911303119.5A Active CN111159117B (en) 2019-12-17 2019-12-17 Low-overhead file operation log acquisition method

Country Status (1)

Country Link
CN (1) CN111159117B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115629944B (en) * 2022-12-21 2023-04-07 杭州谐云科技有限公司 Processing method and log processing system for container log
CN115840938B (en) * 2023-02-21 2023-05-09 山东捷讯通信技术有限公司 File monitoring method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107070897A (en) * 2017-03-16 2017-08-18 杭州安恒信息技术有限公司 Network log storage method based on many attribute Hash duplicate removals in intruding detection system
CN109542341A (en) * 2018-11-06 2019-03-29 网宿科技股份有限公司 A kind of read-write IO monitoring method, device, terminal and computer readable storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130159977A1 (en) * 2011-12-14 2013-06-20 Microsoft Corporation Open kernel trace aggregation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107070897A (en) * 2017-03-16 2017-08-18 杭州安恒信息技术有限公司 Network log storage method based on many attribute Hash duplicate removals in intruding detection system
CN109542341A (en) * 2018-11-06 2019-03-29 网宿科技股份有限公司 A kind of read-write IO monitoring method, device, terminal and computer readable storage medium

Also Published As

Publication number Publication date
CN111159117A (en) 2020-05-15

Similar Documents

Publication Publication Date Title
CN103955530B (en) Data reconstruction and optimization method of on-line repeating data deletion system
US9996557B2 (en) Database storage system based on optical disk and method using the system
CN102541757B (en) Write cache method, cache synchronization method and device
JP5233233B2 (en) Information search system, information search index registration device, information search method and program
CN106528717A (en) Data processing method and system
CN111159117B (en) Low-overhead file operation log acquisition method
CN102331957B (en) File backup method and device
CA3176450A1 (en) Method and apparatus for implementing incremental data consistency
CN104239443B (en) A kind of storage method of serialized data operation log
WO2014058711A1 (en) Creation of inverted index system, and data processing method and apparatus
CN104731896A (en) Data processing method and system
CN104484131B (en) The data processing equipment of multiple disks server and corresponding processing method
CN102306168A (en) Log operation method and device and file system
CN104239438A (en) File information storage method and file information read-write method based on separate storage
CN109918386A (en) A kind of data reconstruction method and device, computer readable storage medium
CN107533495A (en) Technology for data backup and resume
CN106250496A (en) A kind of method and system of the data collection in journal file
US8843450B1 (en) Write capable exchange granular level recoveries
KR101674176B1 (en) Method and apparatus for fsync system call processing using ordered mode journaling with file unit
CN102521256B (en) High-reliability data protection method of real-time/historical database
CN107315661B (en) Deleted file recovery method and device for cluster file system
CN107451014A (en) A kind of data reconstruction method and device
CN106055546A (en) Optical disk library full-text retrieval system based on Lucene
CN110019063A (en) Method, terminal device and the storage medium of calculate node data disaster tolerance playback
CN103403709A (en) Method, device and system for data reading and writing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant