WO2023070462A1

WO2023070462A1 - File deduplication method and apparatus, and device

Info

Publication number: WO2023070462A1
Application number: PCT/CN2021/127162
Authority: WO
Inventors: 郭小东; 张海波; 陈咸彰; 黄永兵; 刘铎; 谭玉娟
Original assignee: 华为技术有限公司; 重庆大学
Priority date: 2021-10-28
Filing date: 2021-10-28
Publication date: 2023-05-04
Also published as: CN118120212A

Abstract

Embodiment of the present application provide a file deduplication method and apparatus, and a device. In the method, during file writing, in response to a write request, a file in the write request is temporarily stored in a first storage space, and the file in the write request is compared with a file in a second storage space, so that whether the file in the write request is a repeated file is determined. By means of the method, repeated files can be automatically removed during file writing, and storage space occupation is reduced; a user does not need to actively initiate a file deduplication request, so that the performance overhead is reduced.

Description

A file deduplication method, device and equipment

technical field

The present application relates to the technical field of communication, and in particular to a method, device and equipment for deduplication of files.

Background technique

The storage space of terminal equipment is consumed quickly, and insufficient storage space is one of the key factors for users to switch phones. With the widespread application of devices such as the mobile Internet and smart terminals, more and more duplicate files are generated in the social process, taking up a lot of space. In order to reduce the storage space occupied by duplicate files, there are currently some applications for file deduplication (such as various mobile phone cleaning tools). Duplicate files, obtain the scan results, and provide the scan results to the user; the user confirms and deletes the duplicate files one by one through manual operation. However, this method takes a long time to scan, and requires the user to select and remove duplicate files one by one, which takes a long time; and since each file may correspond to a social software interaction window, directly deleting duplicate files may lead to an interaction window An exception is displayed or the dialog is unavailable. Therefore, how to effectively remove duplicate files without the user and the application being aware becomes a problem to be solved.

Contents of the invention

Embodiments of the present application provide a file deduplication method, device, and device. The method can automatically remove duplicate files and reduce storage space occupation; it is insensitive to applications and does not require users to perform complex operations, reducing system processing overhead.

In a first aspect, the embodiment of the present application provides a file deduplication method, and the file deduplication method is implemented by a terminal device or a device deployed on a cloud. Wherein, the terminal device or the device deployed on the cloud obtains the write request, and the write request includes the first file; in response to the write request, stores the first file, and the first file is stored in the first storage space; determines whether the second storage space There is a second file, the second file is the same as the first file, and the second storage space and the first storage space are located at different layers of the storage system. For example, the first storage space is located in the memory space, and the second storage space is located in the external storage space (such as a disk). In this method, when the write request is obtained, the first file included in the write request is stored in an independent storage space (the first storage space), and it is judged whether the existing file stored in the second storage space is There is a file identical to the first file (that is, it is judged whether there is a duplicate file). This method performs duplicate checks while obtaining write requests, and realizes online deduplication (also known as online file deduplication), which can make users and applications insensitive; and this method (online file deduplication) does not need to In the technology, the files that have been written into the external storage space (such as disks) are re-read into the cache and then deduplicated, which can reduce the number of times of repeated writing to the hard disk and avoid the overhead of hard disk writing caused by repeated files; and the method After the user turns on the file deduplication function, a duplicate check can be performed every time a write request is received, which avoids the user from repeatedly performing deduplication operations manually and can improve user experience.

In a possible design, the file deduplication method provided in the first aspect may be applied to a scenario where an application program of a terminal device performs a write operation. Wherein, the terminal device obtains the write request of the application program, and the write request includes the first file; in response to the write request, stores the first file in the first storage space; determines whether the second file exists in the second storage space, and the second The file is the same as the first file, and the second storage space and the first storage space are located at different layers of the storage system. Through this method, the terminal device can implement online file deduplication during the writing operation performed by the application program, thereby reducing the storage space occupied. And for the terminal device, the file deduplication process is insensitive to the application, does not require the internal ecological cooperation of the terminal device, and does not require the user to perform complicated operations, and the system overhead is low.

In a possible design, in the absence of the second file, store the first file in the third storage space, and perform a buffer operation on the first file in the third storage space; after performing the buffer operation After that, the first file is stored in the second storage space. Through this method, a first storage space is newly created in the existing file cache space (third storage space) for performing file duplication checking operations, thereby realizing online file deduplication.

In a possible design, in the absence of the second file, perform a cache operation on the first file in the first storage space; after performing the cache operation, store the first file in the second storage space . Through this method, the first storage space is used to perform file duplication checking operations, thereby realizing online file deduplication; and the first storage space is compatible with the existing file cache space, which is beneficial to project implementation; Buffer operations such as space allocation are simplified and postponed, which is beneficial to reduce system operation overhead.

In a possible design, if there is a second file, the link identifier of the first file is associated with the second file, and the link identifier of the first file is used to obtain the first file, and then from the first storage space delete the first file. Through this method, when there are duplicate files, the system can directly delete the duplicate files from the cache without generating additional data copies, which is beneficial to reduce system overhead; is associated with a file in , so that the file can also be found.

In a possible design, the second file is the same as the first file, which means that the characteristic information of the second file is the same as the characteristic information of the first file. Through this method, it can be determined whether the file in the write request is a duplicate file by using feature information comparison.

In a possible design, the feature information of the first file is determined according to the sampling data of the first file. Wherein, the sampled data is partial data obtained from the data of the first file through a sampling algorithm. Through this method, only a small amount of file data is sampled to obtain feature information, which is beneficial to reduce system overhead.

In a possible design, the feature information of the first file is determined according to the sampling data and file information of the first file. Wherein, the file information includes information such as file type and file size. Through this method, the combination of sampling data and file information can more accurately reflect the characteristic information of the file and the uniqueness of the characteristic information.

In a possible design, the feature information includes fingerprint information and/or file identification ID. Wherein, the feature information of the file is unique, and the feature information of the file is unique for each file.

In a possible design, the feature information of the first file is determined in response to an instruction to close the first file. Through this method, the process of determining the feature information of the file can be performed during the file closing operation after the writing operation is completed, which is beneficial to reduce system overhead.

In a possible design, the feature information of the first file is determined; according to the feature information of the first file, through the index directory, it is determined whether there is a third file in the index directory, the file name of the third file and the feature of the first file The information is the same, and the third file is associated with the storage address of the second file. Through this method, based on searching an index directory provided by the embodiment of the present application, it can be judged whether the first file is a duplicate file, which is beneficial to remove duplicate files more effectively.

In a possible design, when the third file does not exist in the index directory, a fourth file is added in the index directory, the file name of the fourth file is the feature information of the first file, and the fourth file is the same as the first file The storage address of the file is associated. Through this method, when the file in the write request is not a duplicate file, the index directory can be updated so that the index directory includes files that have been written to the disk, which is beneficial to more accurately judging whether there is a duplicate file in the system.

In a possible design, prompt information is generated, and the prompt information includes one or more of the following: a reminder of deleted duplicate files, storage capacity released by deleting duplicate files, number of deleted duplicate files, and file types of duplicate files. Through this method, the performance of file deduplication can be explicitly displayed to the user, thereby enhancing the user experience.

In a possible design, a record log is generated, and the record log includes one or more of the following contents: data in the index directory, the storage location corresponding to the first file identifier, data in the first storage space, and the duplicate files deleted. Freed storage capacity, number of deduplicated files, deduplicated file types. Through this method, a debugging application program interface API or a debugging log can be provided externally, which is beneficial for users to perform system debugging.

In a possible design, an instruction is obtained, which indicates enabling the file deduplication function; in response to the instruction, an operation of obtaining a write request is performed. Through this method, a file deduplication function switch can be provided to the user, and the user only needs to turn on the switch to realize automatic file deduplication, and the user does not need to participate in the file deduplication process, which optimizes user experience.

In a possible design, the overall process of implementing the file deduplication method in the first aspect may be embedded in the main process of the file access process. With this method, there is no need to expand an independent file deduplication thread, but to embed it in an existing thread, which is beneficial to reduce overhead.

In a second aspect, the embodiment of the present application provides a file search method, and the file search method is implemented by a terminal device or a device deployed on a cloud. Wherein, the terminal device or the device deployed on the cloud obtains the first file, and determines the characteristic information of the first file; according to the characteristic information of the first file, it is determined whether there is a third file in the index directory, and the file name of the third file and The feature information of the first file is the same, the third file is associated with the storage address of the second file, and the second file is stored in the second storage space. In this method, the index directory is constructed in the form of files, the third file in the index directory corresponds to the second file stored in the second storage space, and the characteristic information of the second file is used as the file name of the third file, and the The third file is associated with the storage address of the second file, for example, the storage address of the second file may be stored in the third file. In this method, the storage space required for the index directory stored in the form of a file is small, which greatly reduces the storage overhead; and the search speed of the index directory under this method is faster than that of the prior art, which can greatly improve system performance .

In a possible design, the characteristic information of the first file is determined according to the sampling data of the first file; wherein, the sampling data is partial data obtained from the data of the first file through a sampling algorithm. Through this method, only a small amount of file data is sampled to obtain feature information, which is beneficial to reduce system overhead.

In a possible design, when the third file does not exist in the index directory, the first file is stored in the second storage space, and a fourth file is added in the index directory, and the file name of the fourth file is the first The feature information of a file, the fourth file is associated with the storage address of the first file. Through this method, when the file in the write request is not a duplicate file, the index directory can be updated so that the index directory includes files that have been written to the disk, which is beneficial to more accurately judging whether there is a duplicate file in the system.

In a possible design, when the third file exists in the index directory, the link identifier of the first file is associated with the storage address of the second file, and the link identifier of the first file is used to obtain the first file. Through this method, when the first file is a duplicate file and the duplicate file is deleted, if you need to access the corresponding file, you can access the storage address of the second file associated with the link identifier of the first file, thereby maintaining the normal file access.

In a possible design, the overall process of executing the file search method of the second aspect may be embedded in the main flow of the file access process. With this method, there is no need to expand an independent file deduplication thread, but to embed it in an existing thread, which is beneficial to reduce overhead.

In a third aspect, the embodiment of the present application provides a file deduplication device, and the file deduplication device includes a file operation module, a file cache module and an information processing module. Wherein, the file operation module is used to obtain the write request, and the write request includes the first file; the file cache module is used to store the first file in response to the write request, and the first file is stored in the first storage space; the information processing module is used to determine Whether there is a second file in the second storage space, the second file is the same as the first file, and the second storage space and the first storage space are located in different layers of the storage system.

In a possible design, the file caching module is further configured to store the first file in the third storage space when the second file does not exist, and perform a buffer operation on the first file in the third storage space ; After the buffer operation is completed, store the first file in the second storage space.

In a possible design, the file caching module is further configured to perform a buffer operation on the first file in the second storage space when the second file does not exist; after performing the buffer operation, the first file stored in the second storage space.

In a possible design, the information processing module is further configured to associate the link identifier of the first file with the second file when the second file exists, and the link identifier of the first file is used to obtain the first file; The file caching module is also used to delete the first file from the first storage space.

In a possible design, the feature information includes fingerprint information and/or file ID. Wherein, the feature information of the file is unique, and the feature information of the file is unique for each file.

In a possible design, the information processing module is further configured to determine the characteristic information of the first file according to the sampled data of the first file, where the sampled data is part of the data obtained from the data of the first file through a sampling algorithm.

In a possible design, the information processing module is also used to determine the feature information of the first file; according to the feature information of the first file, determine whether there is a third file in the index directory, the file name of the third file is the same as that of the first file The feature information is the same, and the third file is associated with the storage address of the second file in the second storage space.

In a possible design, the device for deduplicating files also includes a prompt module, the prompt module is used to generate prompt information, and the prompt information includes one or more of the following: prompts for deleted duplicate files, storage released by deleting duplicate files Capacity, number of duplicate files to delete, file type of duplicate files.

In a possible design, the file deduplication device further includes a generation module, and the generation module is used to generate a record log, and the record log includes one or more of the following contents: data in the index directory, storage corresponding to the first file identifier location, data in primary storage, storage capacity freed by deduplication, number of deduplicated files, file types of deduplicated files.

In a possible design, the device for deduplication of files further includes an execution module, and the execution module is configured to acquire an instruction indicating to enable the function of deduplication of files; in response to the instruction, perform an operation of acquiring a write request.

The module for implementing the file deduplication method provided in the above third aspect and any possible design thereof can also realize the beneficial effects of the file deduplication method provided in the first aspect.

In a fourth aspect, the embodiment of the present application provides a file search device, and the file search device includes a file operation module and an information processing module. Wherein, the file operation module is used to obtain the first file, and the information processing module is used to determine the characteristic information of the first file; the file operation module is also used to determine whether there is a third file in the index directory according to the characteristic information of the first file, and the first The file names of the three files are the same as the feature information of the first file, and the third file is associated with the storage address of the second file in the second storage space.

In a possible design, the information processing module is used to determine the feature information of the first file, including:

According to the sampling data of the first file, the feature information of the first file is determined; the sampling data is part of the data obtained from the data of the first file through a sampling algorithm.

In a possible design, the file search device further includes a file cache module, and the file cache module is used to store the first file in the second storage space when the third file does not exist in the index directory, and store the first file in the index A fourth file is added to the directory, the file name of the fourth file is the feature information of the first file, and the fourth file is associated with the storage address of the first file.

In a possible design, the information processing module is further configured to associate the link identifier of the first file with the second file when there is a third file in the index directory, and the link identifier of the first file is used to obtain the second file. A file; the file cache module is also used to delete the first file from the first storage space.

The module for implementing the file search method provided in the above fourth aspect and any possible design thereof can also realize the beneficial effects of the file search method provided in the second aspect.

In a fifth aspect, the embodiment of the present application provides a device, and the device may be a terminal device or a device deployed on a cloud. Wherein, the device includes one or more processors and memory; the memory is coupled with one or more processors, and the memory stores a computer program, and when the one or more processors execute the computer program, the device performs the following operations:

Obtain a write request, where the write request includes the first file;

In response to the write request, store the first file, and the first file is stored in the first storage space;

It is determined whether a second file exists in the second storage space, the second file is the same as the first file, and the first storage space and the second storage space are located at different layers of the storage system.

For the introduction of the first storage space, the second storage space, the sampling data of the first file, the feature information of the first file, the link identifier of the first file associated with the second file, the generation of prompt information, and the generation of recording logs, please refer to The corresponding description in the first aspect will not be repeated here.

In a sixth aspect, the embodiment of the present application provides a device, and the device may be a terminal device or a device deployed on a cloud. Wherein, the device includes one or more processors and memory; the memory is coupled with one or more processors, and the memory stores a computer program, and when the one or more processors execute the computer program, the device performs the following operations:

Obtain the first file, and determine the feature information of the first file;

According to the feature information of the first file, determine whether there is a third file in the index directory, the file name of the third file is the same as the feature information of the first file, and the third file is associated with the storage address of the second file in the second storage space .

For the introduction of the characteristic information of the first file, the third file, the sampling data of the first file, the link identifier of the first file associated with the second file, etc., please refer to the corresponding description in the second aspect, and details will not be repeated here.

In the seventh aspect, the embodiment of the present application provides a computer-readable storage medium, the above-mentioned computer-readable storage medium stores a computer program, and the above-mentioned computer program is executed by a processor to realize the above-mentioned first aspect or second aspect and its possible realization The method described in any one of the methods.

In the eighth aspect, the embodiment of the present application provides a chip system, the chip system includes a processor, and may also include a memory, which is used to implement the method described in the first aspect or the second aspect of the terminal device or the terminal device deployed on the cloud the functionality of the device. The system-on-a-chip may consist of chips, or may include chips and other discrete devices.

In the ninth aspect, the embodiments of the present application provide a computer program product, including instructions, which, when the instructions are run on a computer, cause the computer to execute any one of the first aspect or the second aspect and possible implementations thereof the method described.

Description of drawings

FIG. 1a is a schematic flow diagram of a user manually performing a file deduplication function;

Figure 1b is a schematic diagram of a file abnormality after the user manually performs the file deduplication function;

FIG. 2 is a schematic diagram of a hardware structure of a terminal device provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a software structure of a terminal device provided in an embodiment of the present application;

Figure 4a is a modular flow chart for implementing a method for deduplication of files provided by the embodiment of the present application;

Fig. 4b is another modular flow chart for implementing a file deduplication method provided by the embodiment of the present application;

FIG. 5 is a schematic diagram of an index directory provided by the embodiment of the present application;

FIG. 6 is a schematic flow diagram of implementing a file deduplication function for an application program in an Android system terminal provided by an embodiment of the present application;

FIG. 7a is a schematic diagram of a process for performing a write operation in the first storage space provided by an embodiment of the present application;

FIG. 7b is a schematic diagram of another process for performing a write operation in the first storage space provided by the embodiment of the present application;

FIG. 8 is a schematic diagram of determining feature information based on sampled data provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of associating a link identifier of a file with the same file provided by an embodiment of the present application;

FIG. 10 is a schematic diagram of a link correspondence relationship provided by the embodiment of the present application;

FIG. 11 is a schematic diagram of an output file access authorization interface provided by the embodiment of the present application;

FIG. 12 is a schematic diagram of an external device calling a file deduplication function provided by an embodiment of the present application;

FIG. 13 is a schematic flowchart of a file deduplication method provided in the embodiment of the present application;

FIG. 14 is a schematic flowchart of a file search method provided in the embodiment of the present application;

Fig. 15 is a schematic diagram of a device provided by an embodiment of the present application;

FIG. 16 is a schematic diagram of a file deduplication device provided in the embodiment of the present application;

FIG. 17 is a schematic diagram of a file search device provided by an embodiment of the present application.

Detailed ways

In the embodiment of this application, "/" can indicate that the objects associated before and after are in an "or" relationship, for example, A/B can indicate A or B; "and/or" can be used to describe that there are three types of associated objects A relationship, for example, A and/or B, may mean: A exists alone, A and B exist simultaneously, and B exists independently, where A and B may be singular or plural. In order to facilitate the description of the technical solutions of the embodiments of the present application, in the embodiments of the present application, words such as "first" and "second" may be used to distinguish technical features with the same or similar functions. The words "first" and "second" do not limit the number and execution order, and the words "first" and "second" do not necessarily mean that they must be different. In the embodiments of this application, words such as "exemplary" or "for example" are used to represent examples, illustrations or illustrations, and any embodiment or design described as "exemplary" or "for example" should not be interpreted It is more preferred or more advantageous than other embodiments or design solutions. The use of words such as "exemplary" or "for example" is intended to present related concepts in a specific manner for easy understanding.

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

The storage space of terminal equipment is consumed quickly, and insufficient storage space is one of the key factors for users to switch phones. With the widespread application of devices such as the mobile Internet and smart terminals, more and more duplicate files are generated in the social process, taking up a lot of space. For example, according to some survey data, on the premise that some users have the habit of cleaning files, more than a quarter of users have duplicate files with a capacity greater than 2 gigabytes (Gigabyte, GB); Up to 16.49GB, or even more.

Therefore, in order to reduce the occupation of storage space by duplicate files, on the one hand, there are currently some applications for deduplication of files (such as various mobile phone cleaning tools). Among them, the mobile phone cleaning tool can provide a user entry, and the user can scan and identify duplicate files in the terminal device after manual activation, obtain the scanning result, and provide the scanning result to the user. Users manually confirm and delete duplicate files one by one. For example, FIG. 1a shows a process when a user manually performs a file deduplication function. Among them, the display interface of the terminal device will display information such as storage space occupied by the system, junk files, duplicate files, etc. at present. The user can manually select to clean up duplicate files, and the display interface of the terminal device will display multiple duplicate files and the source of the files, as shown in Figure 1a. However, this method takes a long time to scan, and requires the user to select and remove duplicate files one by one, which takes a long time; and since each file may correspond to a social software interaction window, directly deleting duplicate files may lead to an interaction window An exception is displayed or the dialog is unavailable. For example, Fig. 1b shows a situation where the file is abnormal after the user manually operates and executes the file deduplication function. Since the user directly deletes the duplicate files when cleaning the duplicate files, when the user opens the social software interaction window to search for pictures again, the interaction window cannot display the original picture normally.

On the other hand, there is currently a solution to realize file deduplication by providing an application program interface (application interface, API) mode. For example, the Apple file system (APFS) has a copy-on-write feature. If a user operation is to copy a file stored on APFS and copy it to another folder on the same APFS file system, APFS will create a new file marked "copy-on-write" and point to all the original files. storage. However, in this file deduplication scheme APFS does not try to determine whether an existing file or a file copied from an external source matches any file already on the file system. In addition, the solution needs to provide an API, which needs to be modified in cooperation with the application ecology, which greatly limits the application scenarios.

Therefore, how to effectively remove duplicate files without the user and the application being aware becomes a problem to be solved.

In order to solve the above problems, an embodiment of the present application provides a file deduplication method, which can effectively remove duplicate files and reduce storage space occupation; and when the file deduplication method is applied to a terminal device, the The application is insensitive and does not require users to perform complex operations, reducing the processing overhead of the system.

Wherein, the file deduplication method provided in the embodiment of the present application can be applied to a terminal device, or deployed in a device on the cloud. Optionally, the method for deduplication of files may also be applied to a scenario of deduplication of files on the cloud controlled by a terminal device. The exemplary terminal devices provided in the following embodiments of the present application are firstly introduced below.

FIG. 2 shows a schematic structural diagram of a terminal device 100 . The terminal device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, and an antenna 2 , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone jack 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen 194, and A subscriber identification module (subscriber identification module, SIM) card interface 195 and the like. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, bone conduction sensor 180M, etc.

It can be understood that, the structure shown in the embodiment of the present application does not constitute a specific limitation on the terminal device 100 . In other embodiments of the present application, the terminal device 100 may include more or fewer components than shown in the figure, or combine certain components, or separate certain components, or arrange different components. The illustrated components can be realized in hardware, software or a combination of software and hardware.

The processor 110 may include one or more processing units, for example: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processing unit (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural network processor (neural-network processing unit, NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.

The controller can generate an operation control signal according to the instruction opcode and timing signal, and complete the control of fetching and executing the instruction.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated access is avoided, and the waiting time of the processor 110 is reduced, thereby improving the efficiency of the system.

In some embodiments, processor 110 may include one or more interfaces. The interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transmitter (universal asynchronous receiver/transmitter, UART) interface, mobile industry processor interface (mobile industry processor interface, MIPI), general-purpose input and output (general-purpose input/output, GPIO) interface, subscriber identity module (subscriber identity module, SIM) interface, and /or universal serial bus (universal serial bus, USB) interface, etc.

The MIPI interface can be used to connect the processor 110 with peripheral devices such as the display screen 194 and the camera 193 . MIPI interface includes camera serial interface (camera serial interface, CSI), display serial interface (display serial interface, DSI), etc. In some embodiments, the processor 110 communicates with the camera 193 through a CSI interface to realize the shooting function of the terminal device 100 . The processor 110 communicates with the display screen 194 through the DSI interface to realize the display function of the terminal device 100 .

The GPIO interface can be configured by software. The GPIO interface can be configured as a control signal or as a data signal. In some embodiments, the GPIO interface can be used to connect the processor 110 with the camera 193 , the display screen 194 , the wireless communication module 160 , the audio module 170 , the sensor module 180 and so on. The GPIO interface can also be configured as an I2C interface, I2S interface, UART interface, MIPI interface, etc.

The USB interface 130 is an interface conforming to the USB standard specification, specifically, it can be a Mini USB interface, a Micro USB interface, a USB Type C interface, and the like. The USB interface 130 can be used to connect a charger to charge the terminal device 100, and can also be used to transmit data between the terminal device 100 and peripheral devices. It can also be used to connect headphones and play audio through them. This interface can also be used to connect other terminal devices, such as AR devices.

It can be understood that the interface connection relationship between the modules shown in the embodiment of the present application is only a schematic illustration, and does not constitute a structural limitation of the terminal device 100 . In other embodiments of the present application, the terminal device 100 may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.

The terminal device 100 implements a display function through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 194 is used to display images, videos and the like. The display screen 194 includes a display panel. The display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active matrix organic light emitting diode or an active matrix organic light emitting diode (active-matrix organic light emitting diode, AMOLED), flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light emitting diodes (quantum dot light emitting diodes, QLED), etc. In some embodiments, the terminal device 100 may include 1 or N display screens 194, where N is a positive integer greater than 1.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the terminal device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. Such as saving music, video and other files in the external memory card.

The internal memory 121 may be used to store computer-executable program codes including instructions. The internal memory 121 may include an area for storing programs and an area for storing data. Wherein, the stored program area can store an operating system, at least one application program required by a function (such as a sound playing function, an image playing function, etc.) and the like. The storage data area can store data created during the use of the terminal device 100 (such as audio data, phonebook, etc.) and the like. In addition, the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (universal flash storage, UFS) and the like. The processor 110 executes various functional applications and data processing of the terminal device 100 by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.

Based on the schematic diagram of the hardware structure of the terminal device 100 in the embodiment of the present application shown in FIG. 2 , the software structure block diagram of the terminal device 100 in the embodiment of the present application is introduced below, as shown in FIG. 3 .

The software system of the terminal device 100 may adopt a layered architecture, an event-driven architecture, a micro-kernel architecture, a micro-service architecture, or a cloud architecture. In this embodiment of the present application, an Android system with a layered architecture is taken as an example to illustrate the software structure of the terminal device 100 .

The layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate through software interfaces. In some embodiments, the Android system is divided into four layers, which are respectively the application program layer, the application program framework layer, the Android runtime (Android runtime) and the system library, and the kernel layer from top to bottom.

The application layer can consist of a series of application packages.

As shown in Figure 3, the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, short message and multi-screen agent.

The application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer. The application framework layer includes some predefined functions.

As shown in Figure 3, the application framework layer can include window managers, content providers, view systems, phone managers, resource managers, notification managers and multi-screen frameworks, etc.

A window manager is used to manage window programs. The window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, capture the screen, etc.

Content providers are used to store and retrieve data and make it accessible to applications. Said data may include video, images, audio, calls made and received, browsing history and bookmarks, phonebook, etc.

The view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. The view system can be used to build applications. A display interface can consist of one or more views. For example, a display interface including a text message notification icon may include a view for displaying text and a view for displaying pictures.

The phone manager is used to provide the communication function of the terminal device 100 . For example, the management of call status (including connected, hung up, etc.).

The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and so on.

The notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and can automatically disappear after a short stay without user interaction. For example, the notification manager is used to notify the download completion, message reminder, etc. The notification manager can also be a notification that appears on the top status bar of the system in the form of a chart or scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialog window. For example, a text message is displayed in the status bar, a prompt sound is issued, the terminal device vibrates, and the indicator light flashes, etc.

The multi-screen framework is used to notify the "multi-screen agent" of the application layer of each event that the terminal device 100 establishes a connection with the large-screen device, and can also be used to assist the "multi-screen agent" in response to the instructions of the "multi-screen agent" of the application layer. "Multi-screen agent" to obtain data information.

Android Runtime includes core library and virtual machine. The Android runtime is responsible for the scheduling and management of the Android system.

The core library consists of two parts: one part is the function function that the java language needs to call, and the other part is the core library of Android.

The application layer and the application framework layer run in virtual machines. The virtual machine executes the java files of the application program layer and the application program framework layer as binary files. The virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.

A system library can include multiple function modules. For example: surface manager (surface manager), media library (media libraries), 3D graphics processing library, 2D graphics engine, etc.

The surface manager is used to manage the display subsystem and provides the fusion of 2D and 3D layers for multiple applications.

The media library supports playback and recording of various commonly used audio and video formats, as well as still image files, etc. The media library can support multiple audio and video encoding formats.

The 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing, etc.

2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is the layer between hardware and software. The kernel layer includes at least a display driver, a camera driver, an audio driver, and a sensor driver.

Fig. 4a is a modular flow chart of a method for implementing file deduplication provided by the embodiment of the present application. FIG. 4a is described by taking the internal modularization process of the terminal device as an example. It can be understood that when the file deduplication method provided by the embodiment of the present application is applied to the cloud, or is applied to the interaction scenario between the terminal and the cloud, there is also a modular process similar to FIG. 4a. Among them, the existing file access process in the terminal device includes: when the application initiates a file access request, the system directly writes the file in the file access request to the file cache in the VFS through a write operation (write), and then writes the file access The file in the request is written to the file system. Further, files can also be written into the drive and flash memory (flash). That is to say, in the existing file access process, the files are directly written into the memory space and the external storage space through the write operation, which cannot realize the judgment of duplicate files and deduplication of online files. The modular process for implementing the file deduplication method shown in FIG. 4a mainly includes a file operation module, a file cache module, an information processing module, a file index module, and a VFS. Among them, different from the existing file access process, the file cache module shown in Figure 4a is a new cache module in the memory space, which is used to intercept the write operation of the system and cache the file in the write operation; and combine the information processing module and file The indexing module realizes the calculation of feature information for the cached files, judges whether the file is a duplicate file according to the feature information, and deduplicates the duplicate files online. In the modularization process shown in Figure 4a, after the file cache module, information processing module and file index module perform the above operations, the non-duplicate files will continue to be written into the VFS, and then written into the file system/block device layer/driver/ In the flash memory, complete the file access process. To adopt the file deduplication process shown in Figure 4a, it is necessary to add a cache space in the existing memory space to realize online file deduplication. It should be noted that the file cache module shown in Figure 4a is mainly used to perform file comparison and file deduplication operations, and the cache area operations in the file access process (such as setting flags, writing checks, and space allocation) are still performed by the VFS. file cache to execute.

FIG. 4b is a modular flow chart of another method for implementing file deduplication provided by the embodiment of the present application. FIG. 4b is described by taking the internal modularization process of the terminal device as an example. It can be understood that when the file deduplication method provided by the embodiment of the present application is applied to the cloud, or is applied to the interaction scenario between the terminal and the cloud, there is also a modular process similar to FIG. 4b. Among them, different from the existing file access process, the file cache module shown in Figure 4b has enhanced the original file cache, such as adding functions such as calculating feature information for cached files, file comparison, and file deduplication. , so as to realize online file de-duplication; buffer operations in the file access process (such as setting flags, writing checks, and space allocation) are also performed by the file cache module shown in Figure 4b, but the order of execution is the same as that of existing There is a delay compared to the file access process. That is to say, the file cache in the VFS shown in FIG. 4b will not perform write operations (for example, no buffer operations will be performed).

To sum up, in the modular process shown in Figure 4a or Figure 4b, the file deduplication method process provided by the embodiment of the present application can be embedded in the existing file access process, and does not require an independent background thread, which is beneficial Reduce system write overhead. Moreover, the embodiment of the present application creates a new file caching module, which is used to realize deduplication of online files.

For ease of understanding, related terms involved in the embodiments of the present application are introduced below.

1. File operation module: used to intercept the file access request of the application, call the file cache module to cache data, call the information processing module to identify duplicate files, and combine the file cache module and information processing module to remove duplicate files or save non-duplicate files.

2. File cache module: used to build an independent self-built file cache space, and cache intercepted files through the self-built file cache space. For example, use the method shown in Figure 4a to create a new cache space in the existing memory space, cache and store the intercepted file data; or use the self-built file cache space to replace the files in the VFS file cache in the way shown in Figure 4b Cache, used to store intercepted file data.

3. Information processing module: used to obtain file data from the file cache module and calculate feature information of the file, and also to initiate a feature information retrieval request or a request for adding feature information to the file index module.

4. File indexing module: used to construct and maintain the index directory, and retrieve target characteristic information in the index directory. Wherein, the index directory can be regarded as a kind of database, and the index directory does not occupy memory.

5. File directory: used to record files stored in the file system. The directory items in the file directory include but are not limited to the file name, the link identifier of the file, the number of repetitions of the file, and the like.

6. File characteristic information: information used to indicate that each file is unique. The feature information of the file may include but not limited to fingerprint, file ID and so on. For example, for two files (file 1 and file 2), when the contents of file 1 and file 2 are different, fingerprint 1 of file 1 and fingerprint 2 of file 2 are different, that is, fingerprint 1 is used to identify file 1 , fingerprint 2 is used to identify file 2. Optionally, when file 1 and file 2 have the same content (including but not limited to file 1 and file 2 have the same content and the same file name, and file 1 and file 2 have the same content but different file names), file 1 and file 2 File 2 has the same fingerprint (for example, both are fingerprint 1).

7. Index directory: a data access mode, creating a directory in the system as an index directory. For example, the index directory in this embodiment of the present application may be an index table of feature information. Wherein, the index directory is constructed and maintained by the file index module in an indexing manner based on the file directory. The index directory includes one or more feature information indexes, for example, includes multiple fingerprint indexes. Each fingerprint index corresponds to a file in an index directory, the file name is the fingerprint, and the link identifier (inode) corresponding to the file indicates the inode of the file corresponding to the fingerprint. For example, FIG. 5 is a schematic diagram of an index directory provided by the embodiment of the present application. Wherein, the system includes file A, file B and file C, the link identifier of file A is inode1, the link identifier of file B is inode2, and the link identifier of file C is inode3. When building an index directory, for file A, first calculate the feature information of file A (that is, calculate the fingerprint of file A), generate fingerprint A1, and fingerprint A1 points to the link identifier inode1 of file A, then generate a fingerprint index in the index directory : Fingerprint A1-inode1. Similarly, for files such as file B and file C, other fingerprint indexes in the index directory are generated: fingerprint B2-inode2, fingerprint C3-inode3, etc., as shown in FIG. 5 . Wherein, by associating the file fingerprint with the link identifier of the file, the location of the file can be obtained directly through the link identifier when searching the index directory, which is beneficial to realize more efficient file search.

In the following, an application embodiment of the file access method applied to a terminal device of the Android system will be described in detail by taking the Android system as an example with reference to FIG. 4a and FIG. 4b.

FIG. 6 is a schematic flow diagram of implementing a file deduplication function for an application program in a terminal device using an Android system according to an embodiment of the present application. In this scenario, when the application program in the terminal device requests to write a file, the terminal device can execute the file deduplication method during the file writing process. The specific process is provided by the file operation module, information processing module, file cache The interaction between the module and the file index module includes the following steps:

1. When an application program requests to write a file, the file operation module obtains the write request, and the write request includes the first file. The file operation module calls the file cache module to store the first file in the first storage space.

In one implementation, in the modular process shown in Figure 4a, when the file operation module detects the write request of the application, it can intercept the write request, and cache the first file in the write request to the newly added file A cache module (the first storage space). In the file cache module, operations such as calculating feature information, comparing duplicate files, and removing duplicate files are performed, as shown in Figure 7a. After the file cache module executes the file deduplication operation, it uses the standard write function system call to cache the file in the write request to the VFS (the third storage space), and continues to perform the buffer operation in the VFS. Wherein, the cache area operation in Fig. 7a refers to the write request operation not performed in the file cache module, including but not limited to setting flag bits, write check and space allocation, data write-back and other operations. The cache area operation in Figure 7a is the same as the cache area operation in the existing write request. For example, a file is divided into multiple pages (page), and each page is executed to set flags, write checks and space allocation, and data Write back and other operations. When multiple pages of the same file are executed with the above buffer operations, the file will be written to the disk, and the system will release the memory occupied by the file. It can be seen that the process shown in Figure 7a adopts the two-time cache serial mode, and embeds the interception cache, calculation and deduplication functions in the existing cache; according to the characteristic information of the file, the deduplication operation is performed on the duplicate file, and no longer sends to the system Continue to write duplicate files, discard duplicate files directly from memory; continue to write to the system for non-duplicate files.

In one implementation, in the modularized process shown in Figure 4b, when the file operation module detects a write request from the application, the file operation module defines the system and calls the caching (caching) function, and first builds a self-built file cache (section One storage space); through the file cache module, based on the copy_from_user function, the intercepted first file is cached to the self-built file cache at one time, as shown in Figure 7b. Among them, one-time caching refers to caching all the pages of the same file to the self-built file cache instead of caching each page one by one. In a one-shot cache implementation, buffer operations are deferred and simplified. For example, for M pages, the buffer operation includes setting flags M times, writing check and space allocation once, and writing data back N times. Wherein, the characteristic information of the cached files can be calculated in the file caching module shown in FIG. 7b, so as to judge whether the cached files are duplicate files. If it is a duplicate file, discard the duplicate file from the memory; if it is a non-duplicate file, continue to write to the system. It can be seen that in the process shown in Figure 7b, an independent self-built file cache is built to cache file data, and the file feature information is calculated and written down to the cache at one time, so that the entire deduplication operation has only one data copy; at the same time, the optimization of the cache operation is postponed , the duplicate data will eventually be discarded from the memory, no external storage write operation will be generated, and low-overhead file deduplication will be completed in the file access path.

Optionally, in the implementation manner shown in FIG. 7b , operations such as buffer operation, feature information calculation, and duplicate file removal may be performed during the closing operation. Among them, the closing operation is a file operation performed after the writing operation. When the writing operation (such as writing the file into the self-built file cache) is completed, the system can perform the closing operation. During the closing operation, continue to execute the file shown in Figure 7b Operations such as cache area operations, feature information calculation, and duplicate file removal can help reduce system write operation overhead.

2. The information processing module determines the feature information of the first file through a sampling algorithm. Specifically, the information processing module adopts a sample hash algorithm to obtain sample data of the first file, and determine feature information of the first file according to the sample data of the first file. It can be seen that the information processing module only needs to sample a small amount of file data to obtain feature information, which is beneficial to reduce system overhead. Optionally, the information processing module may also determine the characteristic information of the first file according to the sampling data of the first file and the file information of the first file. Wherein, the feature information may include but not limited to fingerprint information, file ID, etc., and the file information may include but not limited to file type, file size, etc. It can be understood that the characteristic information of the first file calculated and determined in combination with the sampled data of the first file and the file information of the first file can better reflect the uniqueness of the first file.

For example, FIG. 8 is a schematic diagram of sampling and calculating characteristic information provided by an embodiment of the present application. Wherein, the first storage space can be regarded as data in a tree structure, and files are stored in pages. The information processing module can obtain the sampling data of the file through the sampling hash algorithm. For example, the partial data of sampling page1, page3 and page5 respectively constitute the first segment cyclic redundancy check (cyclic redundancy check, CRC), the middle segment CRC and the tail segment CRC of the sampled data, as shown in FIG. 8 . Combined with file information (for example, file type, file size, etc.), feature information is determined, for example, it is also called a fingerprint (fingerprint, FP) of the file. Wherein, the information processing module keeps the overhead of calculating characteristic information basically stable through sampling calculation, thereby reducing the impact of sampling and calculating characteristic information on the writing performance of the storage system.

3. The information processing module judges whether there is a second file in the second storage space according to the first file, and the second file is the same as the first file. In one implementation, the specific judging method includes: the information processing module determines the feature information of the first file, and determines whether there is a third file in the index directory according to the feature information of the first file, and the file name of the third file is the same as the first file. The feature information of one file is the same, and the third file is associated with the storage address of the second file in the second storage space. Wherein, if the feature information of the second file is the same as that of the first file in the second storage space, it means that the second file is the same as the first file, and the first file is a duplicate file. It should be noted that feature information is unique information. When the feature information of the first file is the same as that of the second file, it can be determined that the first file and the second file are the same file.

4. If the second file exists, the file operation module associates the link identifier of the first file with the second file, and the link identifier of the first file is used to obtain the first file. That is to say, when the first file is a duplicate file, the link identifier of the first file is associated with the second file, so that when the first file is searched, the second file identical to the first file can be obtained. After the link identifier of the first file is associated with the second file, even if the first file is deleted, the same file (that is, the second file) can be found through the link identifier of the first file, thereby ensuring the accuracy of the file orientation path sex.

For example, FIG. 9 is a schematic diagram of an operation process for duplicate files provided by the embodiment of the present application. The left part in FIG. 9 is a file access list, which shows the files included in the write request and the link identifiers of the files. Wherein, the file access list includes two columns, the first column is the file name, and the second column is the link identifier (inode) of the file. Wherein, the link identifier of the file is used to obtain the file. The right part of FIG. 9 shows some directory entries of the file directory (including the link identifier of the file and the number of repetitions of writing the file). It can be understood that the file directory is stored in the second storage space. For example, inode1 of file A included in the write request. The terminal device stores the file A in the first storage space, and judges whether there is a second file in the second storage space, and the second file is the same as the file A. The specific judgment method, for example, the information processing module judges whether there is a second file in the second storage space according to the characteristic information of the file A, and the characteristic information of the second file is the same as the characteristic information of the file A. If there is no second file, it means that file A is not a duplicate file. Write file A into the file directory. Since file A is written for the first time, the number of write repetitions for file A is 1. The file included in the write request again is file D, and the link identifier of file D is inode1. The terminal device stores the file D in the first storage space, and judges whether there is a second file in the second storage space, and the second file is the same as the file D. The specific judgment method, for example, the information processing module judges whether there is a second file in the second storage space according to the characteristic information of the file D, and the characteristic information of the second file is the same as the characteristic information of the file D. If the feature information of file A and file D is the same, it means that file D is the same as file A, and file D is a duplicate file. In this case, the file operation module associates the link identifier of file D with the link identifier of file A. For example, inode1 of file D points to repeated inode1. At this time, the number of file write repetitions corresponding to inode1 is updated to 2, as It is shown in the second row and the second column of the table on the right side of Fig. 9 .

With this method, there is no need to repeatedly perform substantive write operations, and it is only necessary to associate the link identifier of the duplicate file with the same stored file through a hard link, so that the same stored file can be obtained through the link identifier during subsequent calls. For example, FIG. 10 shows a link correspondence after file deduplication. Among them, the number of repetitions of inode1 is 2, which means that the same files are all linked to inode1. The file system only needs to store the same file once. In this case, the duplicate files will eventually be discarded from the memory, and no external storage write operations will be generated, so that low-overhead file deduplication can be completed in the file access path. Moreover, the link correspondence shown in FIG. 10 still includes file D, so it is indifferent to upper-layer applications. It can be seen that there will be no additional data copy in the system, and it will not compete with other processes for computing resources, which will help reduce file writing overhead. And the deduplication process is completed on the input/output (I/O) path, and does not require background threads or services to respond offline.

In an implementation manner, the operations of the file index module on the index directory may include but not limited to creating fingerprints, inserting fingerprints, retrieving fingerprints, deleting fingerprints, and the like. For example, when creating an index directory, a file in the index directory is created according to the characteristic information of the file, and the file name is a fingerprint. For another example, for non-duplicate files, insert a file into the index directory according to the characteristic information of the non-duplicate files, and the file name is the fingerprint of the non-duplicate files.

In one implementation, in the operation process shown in FIG. 6, when the terminal device of the Android system executes the file deduplication method for social software, the above steps may specifically be:

1. In the Android kernel library, modify the code of a typical write operation: the file operation module judges whether the current write request is a write request sent by social software according to the application ID of the process; if it is a write request sent by social software, the file operation module intercepts it The write request calls the file cache module to create a unique cache space (first storage space) for the target file in the kernel for caching its write data.

2. In the Android kernel library, modify the code of the typical closing operation: if it is a closing request sent by social software, the information processing module calls the sampling data of the first file in the first storage space to determine the characteristic information of the first file; and Whether there is feature information of the second file in the index directory is searched, and the feature information of the second file is the same as the feature information of the first file. If the same characteristic information is retrieved in the index directory, it is determined that the first file is a duplicate file, and the file operation module executes the operation of removing duplicate files as shown in FIG. 9 . If the same feature information is not retrieved in the index directory, it is determined that the first file is not a duplicate file, and the file operation module calls the first file in the first storage space to replace the cached data in the second storage space in the file system, And set the flag bit, so that the data of the first file can be synchronized back to the flash memory by the background thread of the file system.

The following analyzes and compares the effect of the terminal device using the file deduplication method provided by the embodiment of the present application. Table 1 is a storage space comparison table provided by the embodiment of the present application. Among them, Table 1 shows the comparison of the space occupied by the non-deduplication device and the space occupied by the deduplication device after multiple operations. Among them, multiple operations may include but are not limited to: using social software to send multiple times (video/PPT/picture files, etc.), using a browser to save files to system storage multiple times, calling video/PPT/picture multiple times from one application to Other applications (such as saving pictures from social software to the gallery, calling files from the gallery to social software).

Table 1: Storage space comparison table

It can be seen that using the file access method provided by the embodiment of the present application, when the application program performs repeated operations for many times, the storage space occupation of the terminal device will not increase sequentially, which is beneficial to reduce the storage space occupation, and has no impact on the application .

In an example, the operation process shown in FIG. 6 is the operation of the internal system of the terminal device, which is invisible to the user. However, in order to optimize user experience and present technical value, the terminal device can also display the effect of file deduplication to users through interface display or voice prompts.

In one implementation manner, the terminal device disables the file deduplication function by default, and the file deduplication function needs to be enabled after user authorization. The specific implementation manner may be to obtain an instruction, which indicates to enable the file deduplication function; in response to the instruction, perform an operation of obtaining a write request. For example, the terminal device provides a switch button for the file deduplication function in related operations such as system settings, or prompts the user whether to enable the file deduplication function during the installation and upgrade of a new system. If the user decides to enable the file deduplication function, the user can turn on the switch button of the file deduplication function in the system settings; for the terminal device, the user's operation is converted into an instruction, which instructs to enable the file deduplication function. In response to this instruction, an operation of acquiring a write request is performed.

In an implementation manner of enabling the file deduplication function, the terminal device may output a user prompt. For example, output user prompts in the interface where the user authorizes to enable the file deduplication function or the system upgrade prompt interface. The user prompts may include but are not limited to: the prompt system can automatically realize application transparency in real time (or at regular intervals), without user participation, and with extremely low overhead deduplication to implement functions related to storage saving, as shown in Figure 11. For another example, the terminal device can output user prompts through voice broadcast, and the broadcast system can automatically realize the file deduplication function in real time (or regularly) to the user.

In the implementation of enabling the file deduplication function, the terminal device can generate prompt information, which may include but not limited to: prompts for deleted duplicate files, storage capacity released by deleting duplicate files, number of deleted duplicate files, duplicate file file type, etc. For example, the file deduplication prompt information is output in the interface where the user authorizes the file deduplication function to be enabled. The file deduplication prompt information includes but is not limited to: the prompt system presents statistics based on accumulation, year, month, day, etc. No need to participate) Automatically optimize the storage space of 20GB, optimize 1000 groups of files with the same content, and the category is video, etc., as shown in Figure 11.

In one example, the operation process shown in Figure 6 is the operation of the internal system of the terminal device. In order to facilitate system and application development, the terminal device can also generate a record log. The record log includes but is not limited to: the data in the index directory, the first A storage location corresponding to a file identifier, data in the first storage space, storage capacity released by deleting duplicate files, number of deleted duplicate files, and file types of deleted duplicate files. For example, the terminal device can generate a record log of the file deduplication function. The record log includes data in the index directory (for example, the respective characteristic information and file addresses of one or more files included in the index directory, which can directly provide the characteristic information value and the file address value without displaying the data structure of the index directory), The specific value of the storage capacity released by deleted duplicate files (for example, the storage capacity released by deleted duplicate files is 6GB), the number of deleted duplicate files (for example, 1000 groups of deleted duplicate files), etc.

In an implementation manner, the terminal device provides an API to the external device, so that the external device can call the file deduplication function through the API. For example, in order to facilitate system and application development and debug file deduplication function, the terminal device provides a debugging API so that external devices can call the file deduplication function, such as calling the file operation module and information processing module through the API, so that the external device can execute File deduplication function, as shown in Figure 12. It can be understood that when the external device implements the function module of file deduplication through the API call, the interaction between the file operation module, the information processing module, the file cache module and the file index module refers to the description in the embodiment of FIG. repeat. The external device in this implementation manner can be, for example, a server. When the server calls the file deduplication function through the API, automatic file deduplication can be realized on the server, and duplicate files can be effectively removed.

The specific flow of the file deduplication method provided in the embodiment of the present application will be described in detail below.

FIG. 13 is a schematic flow diagram of a file deduplication method provided in an embodiment of the present application. The process of the file deduplication method is executed by a terminal device or a device deployed on the cloud, and includes the following steps:

S101. Obtain a write request, where the write request includes the first file.

Wherein, the write request is used to request to write a file, and the method of requesting to write a file may be that an application program initiates a file access request, for example, a write operation is performed through a control signal such as a pwrite function.

S102. In response to the write request, store the first file, where the first file is stored in the first storage space.

After the write request is intercepted, the first file included in the write request may be cached. For a specific implementation, refer to the corresponding description in FIG. 4a or FIG. 4b , which will not be repeated here.

S103. Determine whether a second file exists in the second storage space, where the second file is the same as the first file, and the second storage space and the first storage space are located at different layers of the storage system.

Wherein, the first storage space and the second storage space are located at different layers of the storage system, which means that the first storage space and the second storage space are different in levels. For example, the first storage space is a memory space (such as a cache), and the second storage space is an external storage space (such as a disk). That is to say, during the file access process, the first file in the write request is temporarily stored in the memory space and not written into the external storage space, which is beneficial to reduce the overhead of writing to the external storage space. And after judging whether the first file is a duplicate file, if it is a duplicate file, the first file is directly deleted from the memory space to realize online file deduplication.

In one implementation, in order to reduce the loss of writing performance, in this embodiment of the application, the characteristic information of the file is determined by sampling part of the data of the file. The terminal device determines feature information of the first file according to the sampling data of the first file. For a specific implementation manner, refer to a method for determining characteristic information by sampling data shown in FIG. 8 , which will not be repeated here.

In an implementation manner, in the absence of the second file, the first file is stored in the third storage space, and a buffer operation is performed on the first file in the third storage space; after the buffer operation is completed, Store the first file in the second storage space. For example, in the memory space shown in FIG. 4a, the first storage space refers to the cache space occupied by the file cache module, and the third storage space refers to the file cache in the VFS. Wherein, the data structure of the first storage space is the same as the data structure of the third storage space. For example, the first storage space adopts a cache data structure, and operations of caching files can be performed in the first storage space; the third storage space also adopts a cache data structure, and operations of caching files can also be performed in the third storage space. In this implementation mode, there are two serial data copies in the entire deduplication operation process. For the specific implementation mode, refer to the corresponding descriptions in FIG. 4a and FIG. 7a , which will not be repeated here. After the cache area operation is performed, the first file is written from the memory space to the external storage space to complete the file access process.

In an implementation manner, when the second file does not exist, a cache operation is performed on the first file in the first storage space; after the cache operation is performed, the first file is stored in the second storage space. For example, in the memory space shown in FIG. 4b, the first storage space includes the cache space occupied by the file cache module and the file cache in the VFS. In this implementation mode, there is only one data copy in the whole deduplication operation process. For the specific implementation mode, refer to the corresponding descriptions in FIG. 4b and FIG. 7b , which will not be repeated here. After the cache area operation is performed, the first file is written from the memory space to the external storage space to complete the file access process.

In an implementation manner, if the second file exists, the link identifier of the first file is associated with the second file, and the first file is deleted from the first storage space. Wherein, the link identifier of the first file is used to acquire the first file. For a specific implementation manner, refer to the corresponding description in FIG. 9 , which will not be repeated here.

In one implementation, after the feature information of the first file is determined, it is determined whether there is a third file in the index directory according to the feature information of the first file, the file name of the third file is the same as the feature information of the first file, and the third file The file is associated with the storage address of the second file in the second storage space. Wherein, the index directory is shown in FIG. 5 . For example, the feature information of the first file is calculated as fingerprint A1. By searching the index directory as shown in FIG. 5 , it is determined that the fingerprint A1 exists in the index directory. It means that the file names of the first file and the third file are the same, so it can be deduced that the file A associated with the third file is the same file as the first file, that is, the first file is a duplicate file. Wherein, when the third file exists in the index directory, the link identifier of the first file is associated with the second file. For a specific implementation manner, refer to a manner of file association shown in FIG. 9 , which will not be repeated here.

In an implementation manner, when the third file does not exist in the index directory, the first file is written into the file system according to a normal file access process.

In one implementation, when the third file does not exist in the index directory, a fourth file is newly created in the index directory, the file name of the fourth file is the characteristic information of the first file, and the fourth file and the first file are in the The storage address in the second storage space is associated. That is to say, when the first file is not a duplicate file, a new fingerprint can be inserted into the index directory, thereby facilitating subsequent judgment of other files by the terminal device. For example, when the intercepted write request includes the fifth file, it is determined whether there is a file in the index directory whose feature information is the same as that of the fifth file.

In an implementation manner, the file deduplication method further includes the following steps:

Prompt information is generated, and the prompt information includes one or more of the following: a prompt of deleted duplicate files, storage capacity released by deleting duplicate files, number of deleted duplicate files, and file types of duplicate files. For a specific implementation manner, refer to the description of generating prompt information in the foregoing embodiments, and details are not repeated here.

Generate a record log, which includes one or more of the following: data in the index directory, storage location corresponding to the first file identifier, data in the first storage space, storage capacity released by deleting duplicate files, deleting duplicate files The number of duplicate files and the file types of the deleted files. For the specific implementation manner, refer to the description of the output record log in the foregoing embodiments, and details are not repeated here.

Obtain an instruction, the instruction instructs to enable the file deduplication function;

In response to this instruction, an operation of acquiring a write request is performed.

For the specific implementation, refer to the description of the output file access authorization interface in FIG. 11 , which will not be repeated here.

The embodiment of the present application provides a file deduplication method. The file deduplication method stores the first file in the write request in the first storage space by obtaining the write request, and judges whether there is a second file in the second storage space. file, the second file is the same as the first file. The method can effectively remove duplicate files of the terminal device and reduce storage space occupation; it is insensitive to applications and does not require users to perform complicated operations, thereby reducing system processing overhead. Moreover, after the data of the first file is deleted, the same second file can also be queried through the link identifier of the first file, so that the access process of the file is not affected.

In one example, FIG. 14 is a schematic flow chart of a file search method provided in an embodiment of the present application. The file search method can also be executed by a terminal device or a device deployed on the cloud, and includes the following steps:

S201. Acquire a first file, and determine feature information of the first file.

Wherein, the first file in this embodiment may be a file included in the write request. For example, when a write request is detected in the online mode, the first file included in the write request is acquired. The first file may also be a file already written in the file system. For example, one or more files in the file system are detected in the offline mode, and respective characteristic information of the one or more files are respectively determined.

In an implementation manner, the characteristic information of the first file is determined according to the sampling data of the first file. Wherein, the sampled data is partial data obtained from the data of the first file through a sampling algorithm. For a specific implementation manner, refer to the descriptions of determining the characteristic information of the first file and the method for obtaining sampled data in the embodiment in FIG. 6 and FIG. 8 , and details will not be repeated here. It can be understood that acquiring the characteristic information of the first file by sampling is beneficial to reduce data processing overhead.

S202. Determine whether a third file exists in the index directory according to the feature information of the first file, and the file name of the third file is the same as the feature information of the first file.

Wherein, the third file is a file in the index directory, and the third file is associated with the storage address of the second file in the second storage space, which means that the second file pointed to by the third file has been written into the disk, and is the system files that already exist in . By indexing the directory, it can be found whether a file identical to the first file already exists in the system.

In one implementation, when the third file does not exist in the index directory, the first file is stored in the second storage space, and a fourth file is added to the index directory, and the file name of the fourth file is the first file The characteristic information of the fourth file is associated with the storage address of the first file. For example, the feature information of the first file is calculated as fingerprint D4. By searching the index directory as shown in FIG. 5 , it is determined that the fingerprint D4 does not exist in the index directory. It means that the same file as the first file does not exist in the system, and the first file is a non-duplicate file. A fourth file is inserted into the index directory as shown in FIG. 5 , the file name of the fourth file is fingerprint D4, and the fourth file points to the storage address of the first file in the second storage space.

In one implementation manner, if the third file exists in the index directory, the link identifier of the first file is associated with the second file, and the first file is deleted from the first storage space. Wherein, the link identifier of the first file is used to obtain the first file. For specific implementation manners, refer to the corresponding descriptions in the embodiments in FIG. 9 and FIG. 10 , and details are not repeated here.

An embodiment of the present application provides a file search method. The file search method acquires a first file and determines the characteristic information of the first file; according to the characteristic information of the first file, it is determined whether there is a third file in the index directory, and the third file The file name of is the same as the feature information of the first file. The method of searching through the index directory is conducive to simplifying the process of searching for files. And, when the first file is a duplicate file, and after the duplicate file is deleted, if you need to access the corresponding file, you can access the second file (the same file as the first file) linked to the feature information of the first file, so that Keep normal file access.

In order to realize each function in the method provided by the embodiment of the present application, the device or device provided by the embodiment of the present application may include a hardware structure and/or a software module, and may be realized in the form of a hardware structure, a software module, or a hardware structure plus a software module the above functions. Whether one of the above-mentioned functions is executed in the form of a hardware structure, a software module, or a hardware structure plus a software module depends on the specific application and design constraints of the technical solution. The division of modules in the embodiments of the present application is schematic, and is only a logical function division. There may be other division methods in actual implementation. In addition, each functional module in each embodiment of the present application can be integrated into a processing In the controller, it can also be physically present separately, or two or more modules can be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules.

FIG. 15 is a device 1500 provided by an embodiment of the present application, which is used to implement the file deduplication function or file search function in the above method embodiments. The device may be a terminal device or a device deployed on the cloud, or a device in the terminal device or the device deployed on the cloud, or a device that can be matched and used with the terminal device or the device deployed on the cloud. Wherein, the device may be a system on a chip. The device 1500 includes at least one processor 1502, configured to implement the functions of the terminal device or the device deployed on the cloud in the file deduplication method or the file search method provided in the embodiment of the present application. Exemplarily, the processor 1502 may store the first file in the first storage space in response to the write request. For details, refer to the detailed description in the method example, and details are not repeated here. Device 1500 may also include at least one memory 1503 for storing program instructions and/or data. The memory 1503 is coupled to the processor 1502 . The coupling in the embodiments of the present application is an indirect coupling or a communication connection between devices, units or modules, which may be in electrical, mechanical or other forms, and is used for information exchange between devices, units or modules. Processor 1502 may cooperate with memory 1503 . Processor 1502 may execute program instructions stored in memory 1503 . At least one of the at least one memory may be included in the processor. The device 1500 may further include a communication interface 1501, which may be, for example, a transceiver, an interface, a bus, a circuit, or a device capable of implementing a sending and receiving function. Wherein, the communication interface 1501 is used to communicate with other devices through a transmission medium, so that the devices used in the device 1500 can communicate with other devices. Exemplarily, the other device may be a terminal. The processor 1502 uses the communication interface 1501 to send and receive data, and is used to implement the method executed by the terminal device or the device deployed on the cloud described in the embodiment corresponding to FIG. 13 or FIG. 14 . The embodiment of the present application does not limit the specific connection medium among the communication interface 1501, the processor 1502, and the memory 1503. In the embodiment of the present application, in FIG. 15, the memory 1503, the processor 1502, and the communication interface 1501 are connected through the bus 1504. The bus is represented by a thick line in FIG. 15, and the connection mode between other components is only for schematic illustration. , is not limited. The bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 15 , but it does not mean that there is only one bus or one type of bus.

In this embodiment of the application, the processor may be a general-purpose processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, and may implement or Execute the methods, steps and logic block diagrams disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the methods disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.

In the embodiment of the present application, the memory may be a non-volatile memory, such as a hard disk (hard disk drive, HDD) or a solid-state drive (solid-state drive, SSD), etc., and may also be a volatile memory (volatile memory), such as Random-access memory (RAM). A memory is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory in the embodiment of the present application may also be a circuit or any other device capable of implementing a storage function, and is used for storing program instructions and/or data.

Figure 16 shows a file deduplication device 1600 provided by the embodiment of the present application. The file deduplication device can be a terminal device or a device deployed on the cloud, or it can be a terminal device or a device deployed on the cloud. A device, or a device that can be used in conjunction with a terminal device or a device deployed on the cloud. In one design, the file deduplication device may include a module corresponding to one-to-one execution of the methods/operations/steps/actions described in the example corresponding to Figure 13, and the module may be a hardware circuit, software, or Hardware circuit combined with software implementation. In one design, the device may include a file operation module 1601 , a file cache module 1602 , and an information processing module 1603 . Exemplarily, the file operation module 1601 is configured to obtain a write request, where the write request includes the first file. The file caching module 1602 is configured to store the first file in response to the write request, and the first file is stored in the first storage space. The information processing module 1603 is configured to determine whether there is a second file in the second storage space, the second file is the same as the first file, and the second storage space and the first storage space are located at different layers of the storage system.

Exemplarily, the file caching module 1602 is also used for:

In the absence of the second file, storing the first file in the third storage space, and performing a cache operation on the first file in the third storage space;

After the buffer operation is performed, the first file is stored in the second storage space.

Exemplarily, the file caching module 1602 is also used for:

When the second file does not exist, perform a buffer operation on the first file in the second storage space;

The information processing module 1603 is further configured to associate the link identifier of the first file with the second file if the second file exists, and the link identifier of the first file is used to obtain the first file;

The file caching module 1602 is further configured to delete the first file from the first storage space.

Exemplarily, the information processing module 1603 is also used to:

determining feature information of the first file;

According to the feature information of the first file, determine whether there is a third file in the index directory, the file name of the third file is the same as the feature information of the first file, and the third file is related to the storage address of the second file in the second storage space couplet.

Exemplarily, the file deduplication apparatus 1600 further includes a generation module 1604, and the generation module 1604 is used to generate prompt information, and the prompt information includes one or more of the following: prompt of deleted duplicate files, storage capacity released by deleting duplicate files, Delete the number of duplicate files, the file type of duplicate files.

Exemplarily, the generation module 1604 is also used to generate a record log, which includes one or more of the following: data in the index directory, storage location corresponding to the first file identifier, data in the first storage space, duplicate deletion The storage capacity freed by the files, the number of deduplicated files, the file types of deduplicated files.

Exemplarily, the file deduplication apparatus 1600 further includes an execution module 1605, and the execution module 1605 is configured to obtain an instruction, the instruction indicates enabling the file deduplication function; in response to the instruction, perform an operation of obtaining a write request.

Figure 17 shows a file search device 1700 provided by the embodiment of the present application. The file search device may be a terminal device or a device deployed on the cloud, or a device in a terminal device or a device deployed on the cloud. Or it is a device that can be matched with terminal devices or devices deployed on the cloud. In one design, the file search device may include a one-to-one corresponding module for executing the methods/operations/steps/actions described in the example corresponding to Figure 14, and the module may be a hardware circuit, software, or hardware Circuit combined with software implementation. In one design, the device may include a file operation module 1701 and an information processing module 1702 . Exemplarily, the file operation module 1701 is used to acquire the first file and determine the characteristic information of the first file. The information processing module 1702 is configured to determine feature information of the first file. The information processing module 1702 is also used to determine whether there is a third file in the index directory according to the feature information of the first file. The file name of the third file is the same as the feature information of the first file. The storage addresses of the two storage spaces are associated.

Exemplarily, the information processing module 1702 is used to determine the feature information of the first file, including:

Exemplarily, the file search apparatus 1700 further includes a file cache module 1703, and the file cache module 1703 is configured to store the first file in the second storage space and store the first file in the index directory when the third file does not exist in the index directory. A fourth file is added, the file name of the fourth file is the feature information of the first file, and the fourth file is associated with the storage address of the first file.

Exemplarily, the information processing module 1702 is further configured to associate the link identifier of the first file with the second file when the third file exists in the index directory, and the link identifier of the first file is used to obtain the first file;

The file caching module 1703 is further configured to delete the first file from the first storage space.

The technical solutions provided by the embodiments of the present application may be fully or partially implemented by software, hardware, firmware or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part. The computer may be a general computer, a special computer, a computer network, a network device, a terminal device or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media. The available medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a digital video disc (digital video disc, DVD)), or a semiconductor medium. In the embodiments of the present application, on the premise that there is no logical contradiction, the various embodiments may refer to each other, for example, the methods and/or terms between the method embodiments may refer to each other, such as the functions and/or terms between the device embodiments Or terms may refer to each other, for example, functions and/or terms between the apparatus embodiment and the method embodiment may refer to each other. Apparently, those skilled in the art can make various changes and modifications to the present application without departing from the scope of the present application. In this way, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalent technologies, the present application is also intended to include these modifications and variations.

Claims

A file deduplication method is characterized in that, comprising:

Obtain a write request, where the write request includes the first file;

storing the first file in response to the write request, where the first file is stored in a first storage space;

It is determined whether a second file exists in a second storage space, the second file is the same as the first file, and the second storage space and the first storage space are located at different layers of the storage system.
The method according to claim 1, further comprising:

If the second file does not exist, store the first file in a third storage space, and perform a buffer operation on the first file in the third storage space;

After the buffer operation is performed, the first file is stored in the second storage space.
The method according to claim 1, further comprising:

If the second file does not exist, perform a buffer operation on the first file in the first storage space;

After the buffer operation is performed, the first file is stored in the second storage space.
The method according to any one of claims 1 to 3, wherein the method further comprises:

If there is a second file, associate the link identifier of the first file with the second file, the link identifier of the first file is used to obtain the first file, from the first storage Delete the first file in the space.
The method according to any one of claims 1 to 3, wherein the second file is the same as the first file, including:

The characteristic information of the second file is the same as the characteristic information of the first file.
The method according to claim 5, wherein the method further comprises:

The feature information of the first file is determined according to the sample data of the first file, where the sample data is part of the data obtained from the data of the first file through a sampling algorithm.
The method according to any one of claims 4 to 6, wherein the determining whether the second file exists in the second storage space comprises:

determining feature information of the first file;

According to the feature information of the first file, determine whether there is a third file in the index directory, the file name of the third file is the same as the feature information of the first file, and the third file is the same as the second file are associated with storage addresses in the second storage space.
The method according to any one of claims 1 to 7, further comprising:

Prompt information is generated, and the prompt information includes one or more of the following: a reminder of deleted duplicate files, storage capacity released by deleting duplicate files, number of deleted duplicate files, and file types of duplicate files.
The method according to any one of claims 1 to 7, further comprising:

Generate a record log, the record log includes one or more of the following: data in the index directory, the storage location corresponding to the first file identifier, data in the first storage space, and the data released by deleting duplicate files. Storage capacity, number of deduplicated files, file types of deduplicated files.
The method according to any one of claims 1 to 7, wherein before obtaining the write request, the method further comprises:

Obtaining an instruction, the instruction instructs to enable the file deduplication function;

In response to the instruction, an operation of obtaining a write request is performed.
A file search method, characterized in that, comprising:

Acquiring a first file, and determining feature information of the first file;

According to the feature information of the first file, determine whether there is a third file in the index directory, the file name of the third file is the same as the feature information of the first file, and the third file and the second file are in The storage address of the second storage space is associated.
The method according to claim 11, wherein said determining the feature information of said first file comprises:

According to the sampling data of the first file, the feature information of the first file is determined; the sampling data is part of the data obtained from the data of the first file through a sampling algorithm.
The method according to claim 11 or 12, characterized in that the method further comprises:

In the case that there is no third file in the index directory, store the first file in the second storage space, and add a fourth file in the index directory, the file name of the fourth file is characteristic information of the first file, and the fourth file is associated with the storage address of the first file.
The method according to claim 11 or 12, characterized in that the method further comprises:

If there is a third file in the index directory, associate the link identifier of the first file with the second file, the link identifier of the first file is used to obtain the first file, from The first file is deleted from the first storage space.
A file deduplication device is characterized in that it comprises:

A file operation module, configured to obtain a write request, where the write request includes the first file;

A file caching module, configured to store the first file in response to the write request, and the first file is stored in the first storage space;

An information processing module, configured to determine whether there is a second file in the second storage space, the second file is the same as the first file, and the second storage space and the first storage space are located at different layers of the storage system .
The device according to claim 15, wherein the file caching module is also used for:

If the second file does not exist, store the first file in a third storage space, and perform a buffer operation on the first file in the third storage space;

After the buffer operation is performed, the first file is stored in the second storage space.
The device according to claim 15, wherein the file caching module is also used for:

If the second file does not exist, perform a buffer operation on the first file in the second storage space;

After the buffer operation is performed, the first file is stored in the second storage space.
The device according to any one of claims 15 to 17, wherein the information processing module is further configured to associate the link identifier of the first file with the second file if there is a second file Associated, the link identifier of the first file is used to obtain the first file;

The file caching module is further configured to delete the first file from the first storage space.
The device according to any one of claims 15 to 17, wherein the information processing module is further used for:

According to the sampling data of the first file, the feature information of the first file is determined; the sampling data is part of the data obtained from the data of the first file through a sampling algorithm.
The device according to claim 18 or 19, wherein the information processing module is further used for:

determining feature information of the first file;

According to the feature information of the first file, determine whether there is a third file in the index directory, the file name of the third file is the same as the feature information of the first file, and the third file is the same as the second file are associated with storage addresses in the second storage space.
The device according to any one of claims 15 to 20, characterized in that the device further comprises a generation module, the generation module is used to generate prompt information, and the prompt information includes one or more of the following: deleted duplicates Tips for files, storage capacity freed by deduplicating files, number of deduplicated files, file types of duplicate files.
The device according to any one of claims 15 to 20, wherein the generating module is further configured to generate a record log, and the record log includes one or more of the following: data in the index directory, The storage location corresponding to the first file identifier, the data in the first storage space, the storage capacity released by deleting duplicate files, the number of deleted duplicate files, and the file types of deleted duplicate files.
A file search device, characterized in that it comprises:

A file operation module, configured to obtain the first file;

an information processing module, configured to determine feature information of the first file;

The information processing module is further configured to determine whether there is a third file in the index directory according to the characteristic information of the first file, the file name of the third file is the same as the characteristic information of the first file, and the third file The third file is associated with the storage address of the second file in the second storage space.
The device according to claim 23, wherein the information processing module is used to determine the feature information of the first file, including:

According to the sampling data of the first file, the feature information of the first file is determined; the sampling data is part of the data obtained from the data of the first file through a sampling algorithm.
The device according to claim 23 or 24, characterized in that the device further comprises a file caching module, and the file caching module is configured to save the third file to the A file is stored in the second storage space, and a fourth file is added in the index directory, the file name of the fourth file is the feature information of the first file, and the fourth file is the same as the first file The storage address of a file is associated.
The device according to claim 23 or 24, wherein the information processing module is further configured to associate the link identifier of the first file with the third file when there is a third file in the index directory. The two files are associated, and the link identifier of the first file is used to obtain the first file;

The file caching module is further configured to delete the first file from the first storage space.
A device, characterized in that the device comprises one or more processors and a memory; the memory is coupled to the one or more processors, the memory stores a computer program, and the one or more processors When the device executes the computer program, the device executes the method according to any one of claims 1 to 14.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement the method according to any one of claims 1 to 14.
A computer program product, characterized by comprising instructions, which, when run on a computer, cause the computer to execute the method according to any one of claims 1 to 14.