CN111159117B

CN111159117B - Low-overhead file operation log acquisition method

Info

Publication number: CN111159117B
Application number: CN201911303119.5A
Authority: CN
Inventors: 张为华; 鲁云萍
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2019-12-17
Filing date: 2019-12-17
Publication date: 2023-07-04
Anticipated expiration: 2039-12-17
Also published as: CN111159117A

Abstract

The invention relates to a low-overhead file operation log acquisition method, which comprises the following steps: 1) Collecting file operation log information in a kernel by adopting a kernel probe; 2) Setting a shared memory for writing information acquired by the kernel probe in the kernel space, and reading the information acquired by the kernel probe from the shared memory by the user space; 3) The number of logs is reduced through a deduplication algorithm, and the log acquisition overhead is reduced. Compared with the prior art, the method has the advantages that the kernel probe technology with low cost is selected for file operation information acquisition, the information in the kernel is transferred to the user layer in a shared memory mode, and then the log quantity is reduced through an online duplicate removal algorithm, so that the system cost is reduced.

Description

Low-overhead file operation log acquisition method

Technical Field

The invention relates to the field of data protection, in particular to a low-overhead file operation log acquisition method.

Background

With the rapid development of the fields of the internet, social media, cloud computing, internet of things, mobile short videos, electronic commerce and the like, the amount of data generated each year worldwide is increasing explosively. The age of big data has come and data has become the most important digital asset in the world. The development of technology brings great living convenience to the life of people, such as mobile phone payment, face recognition, intelligent voice, unmanned supermarket and the like. But at the same time brings the risk of data leakage to people. Various data leakage events are endless. The demand for data protection is also increasing. At present, 80% of data is stored in a file, and the log of file operation is recorded to be one of important measures for protecting the data, so that the data is leaked, and the leakage source can be found by backtracking the log of the file operation. However, the existing log acquisition method has a main problem that the system overhead is too large. The existing log acquisition method has the following three main reasons that the cost is large:

(1) The method is characterized in that a system call interception method with high cost is adopted to record file operation logs, and all file operation logs are recorded through system call interception of all file operations.

(2) Log information is passed from kernel space to user space through the expensive printk function.

(3) Because a large number of redundant logs and logs generated by temporary files exist in the file operation log, the system log is excessively large, and the disk IO overhead is large.

The existing file operation log acquisition method is large in system overhead, is not beneficial to the deployment of an actual production environment, and causes storage overhead due to the fact that the log quantity is too large. Aiming at the problem of large expenditure of the existing file operation log acquisition method, the existing solution is to acquire the file operation log through a stackable file system with small expenditure, and then record part of file operations instead of all file operations in the system or record only part of file operations of users instead of all file operations of all users. Although the method can reduce the system overhead, all file operations of all users cannot be recorded, when the file which is not recorded with the log is leaked, the source cannot be traced through the file operation log, and a leakage person and a leakage mode can be found.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a low-overhead file operation log acquisition method.

The aim of the invention can be achieved by the following technical scheme:

a low-overhead file operation log acquisition method comprises the following steps:

1) Collecting file operation log information in a kernel by adopting a kernel probe;

2) Setting a shared memory for writing information acquired by the kernel probe in the kernel space, and reading the information acquired by the kernel probe from the shared memory by the user space;

3) The number of logs is reduced through a deduplication algorithm, and the log acquisition overhead is reduced.

In the step 2), the user space reads the information acquired by the kernel probe from the shared memory in real time through a mmap mechanism.

In the step 3), the duplication is removed by constructing a hash table, wherein the key values in the hash table are all structural bodies, the keys in the hash table are the same parts in the file operation log, and the values in the hash table are log information after duplication removal.

The duplication removing algorithm comprises a filtering module and a merging module, wherein the filtering module comprises kernel layer filtering and user layer filtering, the kernel layer filtering is used for filtering a file operation log, the user layer filtering is used for filtering a temporary file, the merging module is used for merging read-write operations of the file, when the same file has multiple read-write operations, the multiple continuous read operations are merged into a read log, and the multiple continuous write operations are merged into a write log.

The specific operation flow of the merging module is as follows:

first, searching is carried out to find out whether the piece of log information exists in the existing log information, if so, merging is carried out, and if not, the piece of log information is inserted into a hash table.

The same parts in the file operation log include file information, process information and user information, and specifically include a process ID, a parent process ID, a user ID, a file name and a type of file operation.

The complexity of the hash table is O (1), a linked list method is adopted for solving the hash conflict, and a division hash method is selected by a hash function.

In the step 1), file operation log information is collected by adopting an eBPF at a virtual file layer of a kernel.

Filtering the temporary files filters file names, including temporary files with suffix names of. swp and.tmp.

Compared with the prior art, the invention has the following advantages:

1. the invention adopts the low-cost kernel probe to collect the file operation information in the kernel. And meanwhile, the kernel information is transferred to the user space through the shared memory with low overhead, and then the log quantity is reduced in the user space through a deduplication algorithm, so that the overhead is low.

2. And the system overhead is reduced while all file operation logs of all users are recorded.

Drawings

Fig. 1 is a frame diagram of the present invention.

Fig. 2 is a diagram of the overhead of the present invention.

FIG. 3 is a bar graph of the present invention versus overhead for a prior log collection method.

FIG. 4 is a flow chart of the deduplication algorithm of the present invention.

Detailed Description

The invention will now be described in detail with reference to the drawings and specific examples.

As shown in FIG. 1, the invention provides a low-cost file operation log acquisition method aiming at the problem of high cost of the current file operation log system. The system overhead can be reduced, and all file operations can be recorded, specifically:

the method comprises the steps of collecting file operation information in a kernel by adopting a kernel probe technology with low cost, transmitting the information in the kernel to a user layer by adopting a shared memory mode, and reducing the log quantity by a deduplication algorithm, so that the log collection cost is reduced.

The specific design scheme comprises the following steps:

the method comprises the steps of acquiring kernel information related to all file operations by adopting a kernel probe, wherein the kernel probe can track almost all kernel functions, and specifically track functions related to a virtual file system layer aiming at the acquisition of the file operation information in the kernel. Tracking the system call layer, if there are too many system call functions, the related system call functions need to be processed correspondingly for each file operation, and the different system call functions may be the same function of the virtual file system layer. Tracking file system layers, different file systems need to be processed, different file systems used by different systems are different, related file operation functions are different, tracking file system layers are too many, and different hook functions need to be used for different file systems. Therefore, for file operation related information collection, virtual file system layer related functions need to be tracked.

Then, the invention opens up a memory in the kernel space, then maps the content of the memory to the user space through the mmap mechanism, thereby realizing the design of the shared memory, the kernel probe writes the acquired information into the shared memory, the user space continuously reads the information of the kernel probe from the memory through the mmap mechanism, the mmap mechanism is mainly a technology for mapping a file into the memory, and the shared memory can be designed through the mmap mechanism.

Finally, the log quantity is reduced by designing an online deduplication algorithm, and compared with the traditional sequential deduplication method, the method provided by the invention has the advantages that the deduplication is performed in a way of constructing the hash table, a great amount of time is saved, the data deduplication efficiency is improved, and the requirement on data deduplication zero error is met. The deduplication algorithm mainly comprises a filtering module and a merging module, wherein the filtering module mainly filters operation logs of some temporary files, such as temporary files generated when a vim opens the files. The invention also designs a proper data structure to realize the combination, and the invention firstly searches to find whether the log information exists in the existing log information or not before the combination, if yes, inserts the log information if not. In order to reduce the cost, the invention selects a hash table with the time complexity of O (1), different logs obtain unique and different keys after hash function operation, and different logs are inserted into the hash table by utilizing the characteristic of the hash table, the same log is not inserted, and the value in the hash table is log information after duplication removal.

For the file operation log, the hash table key values designed by the invention are all structural bodies, and the hash table key design mainly comprises the same parts in the file operation log, such as file information, process information and user information. Specific process ID, father process ID, user ID, file name, type of file operation. The design of the hash table value mainly comprises different parts in the file read-write operation log, including information such as the number of file read-write data and the number of read-write times. Meanwhile, the hash table value is designed to prevent the read and written data information from being lost when the log is read and written by the deduplication file. For example, ten times when reading a file, the amount of read data is different. The hash table value is designed to be updated continuously, so that the number of specific read-write data in the read-write operation log is reserved, in the example, hash conflicts are solved through a linked list method, and a division hash method is selected as the hash function.

If kernel probes are directly used for writing kernel modules, development and debugging are difficult, system stability is affected, and the kernel probes cannot be compatible with operating systems of different versions, eBPF is high in safety, stable and compatible with the operating systems of different versions, and can be used in an actual production environment, and the kernel probes are supported, so that in the embodiment, the eBPF is used for collecting information of file systems, the eBPF is mainly used for collecting file operation logs in a virtual file layer of a kernel, because the file systems are various, the file systems selected by different systems are different, and if log collection is carried out on data operation in the file system layer, log information collection needs to be carried out on each file system, and the workload is too great. File operations are of various types, such as read, write, copy, delete, modify properties, and the like. And selecting different kernel functions according to corresponding file operations, and tracking the different kernel functions through the eBPF so as to acquire all information of the file operation kernels. And simultaneously, the eBPF supports the mode of transferring file operation information from the kernel space to the user space in a shared memory mode.

When file operation information is transferred from the kernel space to the user space, the log quantity of the written file can be effectively reduced by the duplication elimination algorithm realized by the invention, the log quantity is reduced, the disk IO is reduced, and the system overhead is further reduced. The implementation of the deduplication algorithm mainly comprises the implementation of a filtering module and the implementation of a merging module. The implementation of the filtering module is divided into kernel layer filtering and user layer filtering, wherein the kernel layer filtering is to filter some file operation logs, such as logs of some kernel daemons, in the eBPF code, and the kernel daemons continuously read configuration files, and the processes need to be filtered through process pid when in the kernel layer. User layer filtering, i.e. filtering implemented at the user layer when information passes from the kernel layer to the user layer, is mainly filtering temporary files. The current stage of filtering the temporary files mainly filters file names, such as temporary files with suffix names of swp and tmp, and the realization of merging mainly merges read-write logs through a hash table designed by the invention after log information is transferred from a kernel space to a user space through a shared memory, so that redundant logs in the read-write logs are removed, and the number of the logs is reduced.

As shown in fig. 4, the de-duplication algorithm embodying the present invention has the following flow:

(1) Firstly, judging whether a temporary file operation log or other logs needing to be filtered exist in the log information, and if so, filtering out the log information. If there is no execution of the next operation.

(2) Creating a hash table and storing log information.

(3) Searching the hash table, judging whether redundant logs exist in the logs, if so, updating the hash table, and merging log information. If not, perform the following operations

(4) And writing the log information in the hash table into the log file.

Examples

And testing the system overhead of the log acquisition method, wherein the test environment is as follows: two machines of 1.87GHz 16-core Intel Xeon processors are provided with 8GB memory, 40GB hard disk size and Linux 4.15.9 operating system, thus forming a cluster.

As shown in fig. 2-3, in order to test the system performance overhead of the acquisition method proposed by the present invention, the impact of the acquisition method proposed by the present invention on the system overhead is observed by comparing the read-write performance of a machine that loads the log acquisition system implemented herein with a machine that does not load the log acquisition system. Meanwhile, the improvement effect of the performance of the log system is observed by comparing the system performance of the log system with the system performance of the Progger of the traditional log system. For the performance test of the system overhead, a bonnie++ tool is selected, 100 small files of 1KB are respectively created, then IO times of a system using the log collecting tool and a system not using the log collecting tool are respectively tested, the performance loss percentage is calculated, and then the performance losses of a log collecting system DataLogger designed by the invention and a log system Progger of the existing open source are respectively calculated, as shown in a table 1.

TABLE 1 DataLogger System overhead Performance test

Claims

1. The low-overhead file operation log acquisition method is characterized by comprising the following steps of:

3) The number of logs is reduced through a deduplication algorithm, and the log acquisition overhead is reduced;

in the step 3), the duplication is removed by constructing a hash table, wherein key values in the hash table are all structural bodies, keys in the hash table are the same parts in a file operation log, and the values in the hash table are log information after duplication removal;

the duplicate removal algorithm comprises a filtering module and a merging module, wherein the filtering module comprises kernel layer filtering and user layer filtering, the kernel layer filtering is used for filtering through a process pid so as to realize the filtering of a file operation log, the user layer filtering is used for realizing the filtering of a temporary file, the merging module is used for merging read-write operations of the file, and when the same file has multiple read-write operations, the multiple continuous read operations are merged into a read log, and the multiple continuous write operations are merged into a write log;

2. The method for collecting the log of file operations with low overhead according to claim 1, wherein in the step 2), the user space reads the information collected by the kernel probe from the shared memory in real time through a mmap mechanism.

3. The method for collecting the operation log of the file with low overhead according to claim 1, wherein the specific operation flow of the merging module is as follows:

4. The method of claim 1, wherein the same parts in the file operation log include file information, process information and user information, and specifically include process ID, parent process ID, user ID, file name and type of file operation.

5. The method for collecting the operation log of the file with low overhead according to claim 1, wherein the complexity of the hash table is O (1), a linked list method is adopted for solving the hash conflict, and a division hash method is selected by a hash function.

6. The method for collecting low-overhead file operation log according to claim 1, wherein in the step 1), the file operation log information is collected by using an eBPF in a virtual file layer of the kernel.