WO2023070462A1 - Procédé et appareil de déduplication de fichiers et dispositif - Google Patents
Procédé et appareil de déduplication de fichiers et dispositif Download PDFInfo
- Publication number
- WO2023070462A1 WO2023070462A1 PCT/CN2021/127162 CN2021127162W WO2023070462A1 WO 2023070462 A1 WO2023070462 A1 WO 2023070462A1 CN 2021127162 W CN2021127162 W CN 2021127162W WO 2023070462 A1 WO2023070462 A1 WO 2023070462A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- file
- storage space
- data
- files
- storage
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 185
- 230000004044 response Effects 0.000 claims abstract description 20
- 230000006870 function Effects 0.000 claims description 70
- 230000010365 information processing Effects 0.000 claims description 47
- 238000005070 sampling Methods 0.000 claims description 44
- 238000004590 computer program Methods 0.000 claims description 15
- 238000004422 calculation algorithm Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 description 55
- 238000013461 design Methods 0.000 description 43
- 238000010586 diagram Methods 0.000 description 25
- 230000009286 beneficial effect Effects 0.000 description 21
- 238000012545 processing Methods 0.000 description 15
- 230000003993 interaction Effects 0.000 description 11
- 238000004891 communication Methods 0.000 description 10
- 238000007726 management method Methods 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 6
- 239000003795 chemical substances by application Substances 0.000 description 5
- 238000004140 cleaning Methods 0.000 description 5
- 229920001621 AMOLED Polymers 0.000 description 4
- 238000013475 authorization Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 239000002096 quantum dot Substances 0.000 description 2
- 238000009877 rendering Methods 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000005856 abnormality Effects 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 210000000988 bone and bone Anatomy 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/06—Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]
Definitions
- the present application relates to the technical field of communication, and in particular to a method, device and equipment for deduplication of files.
- Embodiments of the present application provide a file deduplication method, device, and device.
- the method can automatically remove duplicate files and reduce storage space occupation; it is insensitive to applications and does not require users to perform complex operations, reducing system processing overhead.
- the embodiment of the present application provides a file deduplication method, and the file deduplication method is implemented by a terminal device or a device deployed on a cloud.
- the terminal device or the device deployed on the cloud obtains the write request, and the write request includes the first file; in response to the write request, stores the first file, and the first file is stored in the first storage space; determines whether the second storage space There is a second file, the second file is the same as the first file, and the second storage space and the first storage space are located at different layers of the storage system.
- the first storage space is located in the memory space
- the second storage space is located in the external storage space (such as a disk).
- the first file included in the write request is stored in an independent storage space (the first storage space), and it is judged whether the existing file stored in the second storage space is There is a file identical to the first file (that is, it is judged whether there is a duplicate file).
- This method performs duplicate checks while obtaining write requests, and realizes online deduplication (also known as online file deduplication), which can make users and applications insensitive; and this method (online file deduplication) does not need to
- the files that have been written into the external storage space (such as disks) are re-read into the cache and then deduplicated, which can reduce the number of times of repeated writing to the hard disk and avoid the overhead of hard disk writing caused by repeated files; and the method After the user turns on the file deduplication function, a duplicate check can be performed every time a write request is received, which avoids the user from repeatedly performing deduplication operations manually and can improve user experience.
- the file deduplication method provided in the first aspect may be applied to a scenario where an application program of a terminal device performs a write operation.
- the terminal device obtains the write request of the application program, and the write request includes the first file; in response to the write request, stores the first file in the first storage space; determines whether the second file exists in the second storage space, and the second The file is the same as the first file, and the second storage space and the first storage space are located at different layers of the storage system.
- the terminal device can implement online file deduplication during the writing operation performed by the application program, thereby reducing the storage space occupied.
- the file deduplication process is insensitive to the application, does not require the internal ecological cooperation of the terminal device, and does not require the user to perform complicated operations, and the system overhead is low.
- the first storage space is used to perform file duplication checking operations, thereby realizing online file deduplication; and the first storage space is compatible with the existing file cache space, which is beneficial to project implementation; Buffer operations such as space allocation are simplified and postponed, which is beneficial to reduce system operation overhead.
- the link identifier of the first file is associated with the second file, and the link identifier of the first file is used to obtain the first file, and then from the first storage space delete the first file.
- the system can directly delete the duplicate files from the cache without generating additional data copies, which is beneficial to reduce system overhead; is associated with a file in , so that the file can also be found.
- the second file is the same as the first file, which means that the characteristic information of the second file is the same as the characteristic information of the first file.
- the feature information of the first file is determined according to the sampling data of the first file.
- the sampled data is partial data obtained from the data of the first file through a sampling algorithm.
- the feature information of the first file is determined according to the sampling data and file information of the first file.
- the file information includes information such as file type and file size.
- the feature information includes fingerprint information and/or file identification ID.
- the feature information of the file is unique, and the feature information of the file is unique for each file.
- the feature information of the first file is determined in response to an instruction to close the first file.
- the process of determining the feature information of the file can be performed during the file closing operation after the writing operation is completed, which is beneficial to reduce system overhead.
- the feature information of the first file is determined; according to the feature information of the first file, through the index directory, it is determined whether there is a third file in the index directory, the file name of the third file and the feature of the first file The information is the same, and the third file is associated with the storage address of the second file.
- the index directory can be updated so that the index directory includes files that have been written to the disk, which is beneficial to more accurately judging whether there is a duplicate file in the system.
- prompt information is generated, and the prompt information includes one or more of the following: a reminder of deleted duplicate files, storage capacity released by deleting duplicate files, number of deleted duplicate files, and file types of duplicate files.
- a record log is generated, and the record log includes one or more of the following contents: data in the index directory, the storage location corresponding to the first file identifier, data in the first storage space, and the duplicate files deleted. Freed storage capacity, number of deduplicated files, deduplicated file types.
- an instruction is obtained, which indicates enabling the file deduplication function; in response to the instruction, an operation of obtaining a write request is performed.
- a file deduplication function switch can be provided to the user, and the user only needs to turn on the switch to realize automatic file deduplication, and the user does not need to participate in the file deduplication process, which optimizes user experience.
- the overall process of implementing the file deduplication method in the first aspect may be embedded in the main process of the file access process.
- the embodiment of the present application provides a file search method, and the file search method is implemented by a terminal device or a device deployed on a cloud.
- the terminal device or the device deployed on the cloud obtains the first file, and determines the characteristic information of the first file; according to the characteristic information of the first file, it is determined whether there is a third file in the index directory, and the file name of the third file and The feature information of the first file is the same, the third file is associated with the storage address of the second file, and the second file is stored in the second storage space.
- the index directory is constructed in the form of files, the third file in the index directory corresponds to the second file stored in the second storage space, and the characteristic information of the second file is used as the file name of the third file, and the The third file is associated with the storage address of the second file, for example, the storage address of the second file may be stored in the third file.
- the storage space required for the index directory stored in the form of a file is small, which greatly reduces the storage overhead; and the search speed of the index directory under this method is faster than that of the prior art, which can greatly improve system performance .
- the characteristic information of the first file is determined according to the sampling data of the first file; wherein, the sampling data is partial data obtained from the data of the first file through a sampling algorithm.
- the first file is stored in the second storage space, and a fourth file is added in the index directory, and the file name of the fourth file is the first
- the feature information of a file the fourth file is associated with the storage address of the first file.
- the link identifier of the first file is associated with the storage address of the second file, and the link identifier of the first file is used to obtain the first file.
- prompt information is generated, and the prompt information includes one or more of the following: a reminder of deleted duplicate files, storage capacity released by deleting duplicate files, number of deleted duplicate files, and file types of duplicate files.
- a record log is generated, and the record log includes one or more of the following contents: data in the index directory, the storage location corresponding to the first file identifier, data in the first storage space, and the duplicate files deleted. Freed storage capacity, number of deduplicated files, deduplicated file types.
- an instruction is obtained, which indicates enabling the file deduplication function; in response to the instruction, an operation of obtaining a write request is performed.
- a file deduplication function switch can be provided to the user, and the user only needs to turn on the switch to realize automatic file deduplication, and the user does not need to participate in the file deduplication process, which optimizes user experience.
- the overall process of executing the file search method of the second aspect may be embedded in the main flow of the file access process.
- the embodiment of the present application provides a file deduplication device, and the file deduplication device includes a file operation module, a file cache module and an information processing module.
- the file operation module is used to obtain the write request, and the write request includes the first file
- the file cache module is used to store the first file in response to the write request, and the first file is stored in the first storage space
- the information processing module is used to determine Whether there is a second file in the second storage space, the second file is the same as the first file, and the second storage space and the first storage space are located in different layers of the storage system.
- the file caching module is further configured to store the first file in the third storage space when the second file does not exist, and perform a buffer operation on the first file in the third storage space ; After the buffer operation is completed, store the first file in the second storage space.
- the file caching module is further configured to perform a buffer operation on the first file in the second storage space when the second file does not exist; after performing the buffer operation, the first file stored in the second storage space.
- the information processing module is further configured to associate the link identifier of the first file with the second file when the second file exists, and the link identifier of the first file is used to obtain the first file;
- the file caching module is also used to delete the first file from the first storage space.
- the feature information includes fingerprint information and/or file ID.
- the feature information of the file is unique, and the feature information of the file is unique for each file.
- the information processing module is further configured to determine the characteristic information of the first file according to the sampled data of the first file, where the sampled data is part of the data obtained from the data of the first file through a sampling algorithm.
- the information processing module is also used to determine the feature information of the first file; according to the feature information of the first file, determine whether there is a third file in the index directory, the file name of the third file is the same as that of the first file The feature information is the same, and the third file is associated with the storage address of the second file in the second storage space.
- the device for deduplicating files also includes a prompt module, the prompt module is used to generate prompt information, and the prompt information includes one or more of the following: prompts for deleted duplicate files, storage released by deleting duplicate files Capacity, number of duplicate files to delete, file type of duplicate files.
- the file deduplication device further includes a generation module, and the generation module is used to generate a record log, and the record log includes one or more of the following contents: data in the index directory, storage corresponding to the first file identifier location, data in primary storage, storage capacity freed by deduplication, number of deduplicated files, file types of deduplicated files.
- the device for deduplication of files further includes an execution module, and the execution module is configured to acquire an instruction indicating to enable the function of deduplication of files; in response to the instruction, perform an operation of acquiring a write request.
- the module for implementing the file deduplication method provided in the above third aspect and any possible design thereof can also realize the beneficial effects of the file deduplication method provided in the first aspect.
- the embodiment of the present application provides a file search device, and the file search device includes a file operation module and an information processing module.
- the file operation module is used to obtain the first file
- the information processing module is used to determine the characteristic information of the first file
- the file operation module is also used to determine whether there is a third file in the index directory according to the characteristic information of the first file, and the first
- the file names of the three files are the same as the feature information of the first file, and the third file is associated with the storage address of the second file in the second storage space.
- the information processing module is used to determine the feature information of the first file, including:
- the sampling data of the first file the feature information of the first file is determined; the sampling data is part of the data obtained from the data of the first file through a sampling algorithm.
- the file search device further includes a file cache module, and the file cache module is used to store the first file in the second storage space when the third file does not exist in the index directory, and store the first file in the index A fourth file is added to the directory, the file name of the fourth file is the feature information of the first file, and the fourth file is associated with the storage address of the first file.
- the information processing module is further configured to associate the link identifier of the first file with the second file when there is a third file in the index directory, and the link identifier of the first file is used to obtain the second file.
- a file; the file cache module is also used to delete the first file from the first storage space.
- the module for implementing the file search method provided in the above fourth aspect and any possible design thereof can also realize the beneficial effects of the file search method provided in the second aspect.
- the embodiment of the present application provides a device, and the device may be a terminal device or a device deployed on a cloud.
- the device includes one or more processors and memory; the memory is coupled with one or more processors, and the memory stores a computer program, and when the one or more processors execute the computer program, the device performs the following operations:
- the second file It is determined whether a second file exists in the second storage space, the second file is the same as the first file, and the first storage space and the second storage space are located at different layers of the storage system.
- the second storage space For the introduction of the first storage space, the second storage space, the sampling data of the first file, the feature information of the first file, the link identifier of the first file associated with the second file, the generation of prompt information, and the generation of recording logs, please refer to The corresponding description in the first aspect will not be repeated here.
- the embodiment of the present application provides a device, and the device may be a terminal device or a device deployed on a cloud.
- the device includes one or more processors and memory; the memory is coupled with one or more processors, and the memory stores a computer program, and when the one or more processors execute the computer program, the device performs the following operations:
- the file name of the third file is the same as the feature information of the first file, and the third file is associated with the storage address of the second file in the second storage space .
- the embodiment of the present application provides a computer-readable storage medium, the above-mentioned computer-readable storage medium stores a computer program, and the above-mentioned computer program is executed by a processor to realize the above-mentioned first aspect or second aspect and its possible realization The method described in any one of the methods.
- the embodiment of the present application provides a chip system
- the chip system includes a processor, and may also include a memory, which is used to implement the method described in the first aspect or the second aspect of the terminal device or the terminal device deployed on the cloud the functionality of the device.
- the system-on-a-chip may consist of chips, or may include chips and other discrete devices.
- the embodiments of the present application provide a computer program product, including instructions, which, when the instructions are run on a computer, cause the computer to execute any one of the first aspect or the second aspect and possible implementations thereof the method described.
- FIG. 1a is a schematic flow diagram of a user manually performing a file deduplication function
- Figure 1b is a schematic diagram of a file abnormality after the user manually performs the file deduplication function
- FIG. 2 is a schematic diagram of a hardware structure of a terminal device provided in an embodiment of the present application
- FIG. 3 is a schematic diagram of a software structure of a terminal device provided in an embodiment of the present application.
- Figure 4a is a modular flow chart for implementing a method for deduplication of files provided by the embodiment of the present application
- Fig. 4b is another modular flow chart for implementing a file deduplication method provided by the embodiment of the present application.
- FIG. 5 is a schematic diagram of an index directory provided by the embodiment of the present application.
- FIG. 6 is a schematic flow diagram of implementing a file deduplication function for an application program in an Android system terminal provided by an embodiment of the present application;
- FIG. 7a is a schematic diagram of a process for performing a write operation in the first storage space provided by an embodiment of the present application.
- FIG. 7b is a schematic diagram of another process for performing a write operation in the first storage space provided by the embodiment of the present application.
- FIG. 8 is a schematic diagram of determining feature information based on sampled data provided by an embodiment of the present application.
- FIG. 9 is a schematic diagram of associating a link identifier of a file with the same file provided by an embodiment of the present application.
- FIG. 10 is a schematic diagram of a link correspondence relationship provided by the embodiment of the present application.
- FIG. 11 is a schematic diagram of an output file access authorization interface provided by the embodiment of the present application.
- FIG. 12 is a schematic diagram of an external device calling a file deduplication function provided by an embodiment of the present application.
- FIG. 13 is a schematic flowchart of a file deduplication method provided in the embodiment of the present application.
- FIG. 14 is a schematic flowchart of a file search method provided in the embodiment of the present application.
- Fig. 15 is a schematic diagram of a device provided by an embodiment of the present application.
- FIG. 16 is a schematic diagram of a file deduplication device provided in the embodiment of the present application.
- FIG. 17 is a schematic diagram of a file search device provided by an embodiment of the present application.
- words such as “exemplary” or “for example” are used to represent examples, illustrations or illustrations, and any embodiment or design described as “exemplary” or “for example” should not be interpreted It is more preferred or more advantageous than other embodiments or design solutions.
- the use of words such as “exemplary” or “for example” is intended to present related concepts in a specific manner for easy understanding.
- the mobile phone cleaning tool can provide a user entry, and the user can scan and identify duplicate files in the terminal device after manual activation, obtain the scanning result, and provide the scanning result to the user. Users manually confirm and delete duplicate files one by one.
- FIG. 1a shows a process when a user manually performs a file deduplication function.
- the display interface of the terminal device will display information such as storage space occupied by the system, junk files, duplicate files, etc. at present.
- FIG. 1a shows a situation where the file is abnormal after the user manually operates and executes the file deduplication function. Since the user directly deletes the duplicate files when cleaning the duplicate files, when the user opens the social software interaction window to search for pictures again, the interaction window cannot display the original picture normally.
- APFS application program interface
- API application interface
- APFS Apple file system
- APFS has a copy-on-write feature. If a user operation is to copy a file stored on APFS and copy it to another folder on the same APFS file system, APFS will create a new file marked "copy-on-write" and point to all the original files. storage.
- APFS does not try to determine whether an existing file or a file copied from an external source matches any file already on the file system.
- the solution needs to provide an API, which needs to be modified in cooperation with the application ecology, which greatly limits the application scenarios.
- an embodiment of the present application provides a file deduplication method, which can effectively remove duplicate files and reduce storage space occupation; and when the file deduplication method is applied to a terminal device, the The application is insensitive and does not require users to perform complex operations, reducing the processing overhead of the system.
- the file deduplication method provided in the embodiment of the present application can be applied to a terminal device, or deployed in a device on the cloud.
- the method for deduplication of files may also be applied to a scenario of deduplication of files on the cloud controlled by a terminal device.
- the exemplary terminal devices provided in the following embodiments of the present application are firstly introduced below.
- FIG. 2 shows a schematic structural diagram of a terminal device 100 .
- the terminal device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, and an antenna 2 , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone jack 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen 194, and A subscriber identification module (subscriber identification module, SIM) card interface 195 and the like.
- SIM subscriber identification module
- the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, bone conduction sensor 180M, etc.
- the structure shown in the embodiment of the present application does not constitute a specific limitation on the terminal device 100 .
- the terminal device 100 may include more or fewer components than shown in the figure, or combine certain components, or separate certain components, or arrange different components.
- the illustrated components can be realized in hardware, software or a combination of software and hardware.
- the processor 110 may include one or more processing units, for example: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processing unit (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural network processor (neural-network processing unit, NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
- application processor application processor, AP
- modem processor graphics processing unit
- GPU graphics processing unit
- image signal processor image signal processor
- ISP image signal processor
- controller video codec
- digital signal processor digital signal processor
- baseband processor baseband processor
- neural network processor neural-network processing unit
- the controller can generate an operation control signal according to the instruction opcode and timing signal, and complete the control of fetching and executing the instruction.
- a memory may also be provided in the processor 110 for storing instructions and data.
- the memory in processor 110 is a cache memory.
- the memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated access is avoided, and the waiting time of the processor 110 is reduced, thereby improving the efficiency of the system.
- processor 110 may include one or more interfaces.
- the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transmitter (universal asynchronous receiver/transmitter, UART) interface, mobile industry processor interface (mobile industry processor interface, MIPI), general-purpose input and output (general-purpose input/output, GPIO) interface, subscriber identity module (subscriber identity module, SIM) interface, and /or universal serial bus (universal serial bus, USB) interface, etc.
- I2C integrated circuit
- I2S integrated circuit built-in audio
- PCM pulse code modulation
- PCM pulse code modulation
- UART universal asynchronous transmitter
- MIPI mobile industry processor interface
- GPIO general-purpose input and output
- subscriber identity module subscriber identity module
- SIM subscriber identity module
- USB universal serial bus
- the MIPI interface can be used to connect the processor 110 with peripheral devices such as the display screen 194 and the camera 193 .
- MIPI interface includes camera serial interface (camera serial interface, CSI), display serial interface (display serial interface, DSI), etc.
- the processor 110 communicates with the camera 193 through a CSI interface to realize the shooting function of the terminal device 100 .
- the processor 110 communicates with the display screen 194 through the DSI interface to realize the display function of the terminal device 100 .
- the GPIO interface can be configured by software.
- the GPIO interface can be configured as a control signal or as a data signal.
- the GPIO interface can be used to connect the processor 110 with the camera 193 , the display screen 194 , the wireless communication module 160 , the audio module 170 , the sensor module 180 and so on.
- the GPIO interface can also be configured as an I2C interface, I2S interface, UART interface, MIPI interface, etc.
- the USB interface 130 is an interface conforming to the USB standard specification, specifically, it can be a Mini USB interface, a Micro USB interface, a USB Type C interface, and the like.
- the USB interface 130 can be used to connect a charger to charge the terminal device 100, and can also be used to transmit data between the terminal device 100 and peripheral devices. It can also be used to connect headphones and play audio through them. This interface can also be used to connect other terminal devices, such as AR devices.
- the interface connection relationship between the modules shown in the embodiment of the present application is only a schematic illustration, and does not constitute a structural limitation of the terminal device 100 .
- the terminal device 100 may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.
- the terminal device 100 implements a display function through a GPU, a display screen 194, an application processor, and the like.
- the GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering.
- Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
- the display screen 194 is used to display images, videos and the like.
- the display screen 194 includes a display panel.
- the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active matrix organic light emitting diode or an active matrix organic light emitting diode (active-matrix organic light emitting diode, AMOLED), flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light emitting diodes (quantum dot light emitting diodes, QLED), etc.
- the terminal device 100 may include 1 or N display screens 194, where N is a positive integer greater than 1.
- the external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the terminal device 100.
- the external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. Such as saving music, video and other files in the external memory card.
- the internal memory 121 may be used to store computer-executable program codes including instructions.
- the internal memory 121 may include an area for storing programs and an area for storing data.
- the stored program area can store an operating system, at least one application program required by a function (such as a sound playing function, an image playing function, etc.) and the like.
- the storage data area can store data created during the use of the terminal device 100 (such as audio data, phonebook, etc.) and the like.
- the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (universal flash storage, UFS) and the like.
- the processor 110 executes various functional applications and data processing of the terminal device 100 by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.
- the software system of the terminal device 100 may adopt a layered architecture, an event-driven architecture, a micro-kernel architecture, a micro-service architecture, or a cloud architecture.
- an Android system with a layered architecture is taken as an example to illustrate the software structure of the terminal device 100 .
- the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate through software interfaces.
- the Android system is divided into four layers, which are respectively the application program layer, the application program framework layer, the Android runtime (Android runtime) and the system library, and the kernel layer from top to bottom.
- the application layer can consist of a series of application packages.
- the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, short message and multi-screen agent.
- the application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer.
- the application framework layer includes some predefined functions.
- the application framework layer can include window managers, content providers, view systems, phone managers, resource managers, notification managers and multi-screen frameworks, etc.
- a window manager is used to manage window programs.
- the window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, capture the screen, etc.
- Content providers are used to store and retrieve data and make it accessible to applications.
- Said data may include video, images, audio, calls made and received, browsing history and bookmarks, phonebook, etc.
- the view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on.
- the view system can be used to build applications.
- a display interface can consist of one or more views.
- a display interface including a text message notification icon may include a view for displaying text and a view for displaying pictures.
- the phone manager is used to provide the communication function of the terminal device 100 .
- the management of call status including connected, hung up, etc.).
- the resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and so on.
- the notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and can automatically disappear after a short stay without user interaction.
- the notification manager is used to notify the download completion, message reminder, etc.
- the notification manager can also be a notification that appears on the top status bar of the system in the form of a chart or scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialog window. For example, a text message is displayed in the status bar, a prompt sound is issued, the terminal device vibrates, and the indicator light flashes, etc.
- the multi-screen framework is used to notify the "multi-screen agent” of the application layer of each event that the terminal device 100 establishes a connection with the large-screen device, and can also be used to assist the "multi-screen agent” in response to the instructions of the "multi-screen agent” of the application layer. "Multi-screen agent" to obtain data information.
- the Android Runtime includes core library and virtual machine. The Android runtime is responsible for the scheduling and management of the Android system.
- the core library consists of two parts: one part is the function function that the java language needs to call, and the other part is the core library of Android.
- the application layer and the application framework layer run in virtual machines.
- the virtual machine executes the java files of the application program layer and the application program framework layer as binary files.
- the virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.
- a system library can include multiple function modules. For example: surface manager (surface manager), media library (media libraries), 3D graphics processing library, 2D graphics engine, etc.
- the surface manager is used to manage the display subsystem and provides the fusion of 2D and 3D layers for multiple applications.
- the media library supports playback and recording of various commonly used audio and video formats, as well as still image files, etc.
- the media library can support multiple audio and video encoding formats.
- the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing, etc.
- 2D graphics engine is a drawing engine for 2D drawing.
- the kernel layer is the layer between hardware and software.
- the kernel layer includes at least a display driver, a camera driver, an audio driver, and a sensor driver.
- Fig. 4a is a modular flow chart of a method for implementing file deduplication provided by the embodiment of the present application.
- FIG. 4a is described by taking the internal modularization process of the terminal device as an example. It can be understood that when the file deduplication method provided by the embodiment of the present application is applied to the cloud, or is applied to the interaction scenario between the terminal and the cloud, there is also a modular process similar to FIG. 4a.
- the existing file access process in the terminal device includes: when the application initiates a file access request, the system directly writes the file in the file access request to the file cache in the VFS through a write operation (write), and then writes the file access The file in the request is written to the file system.
- the modular process for implementing the file deduplication method shown in FIG. 4a mainly includes a file operation module, a file cache module, an information processing module, a file index module, and a VFS.
- the file cache module shown in Figure 4a is a new cache module in the memory space, which is used to intercept the write operation of the system and cache the file in the write operation; and combine the information processing module and file
- the indexing module realizes the calculation of feature information for the cached files, judges whether the file is a duplicate file according to the feature information, and deduplicates the duplicate files online.
- the non-duplicate files will continue to be written into the VFS, and then written into the file system/block device layer/driver/ In the flash memory, complete the file access process.
- the file cache module shown in Figure 4a is mainly used to perform file comparison and file deduplication operations, and the cache area operations in the file access process (such as setting flags, writing checks, and space allocation) are still performed by the VFS. file cache to execute.
- FIG. 4b is a modular flow chart of another method for implementing file deduplication provided by the embodiment of the present application.
- FIG. 4b is described by taking the internal modularization process of the terminal device as an example. It can be understood that when the file deduplication method provided by the embodiment of the present application is applied to the cloud, or is applied to the interaction scenario between the terminal and the cloud, there is also a modular process similar to FIG. 4b.
- the file cache module shown in Figure 4b has enhanced the original file cache, such as adding functions such as calculating feature information for cached files, file comparison, and file deduplication.
- buffer operations in the file access process are also performed by the file cache module shown in Figure 4b, but the order of execution is the same as that of existing There is a delay compared to the file access process. That is to say, the file cache in the VFS shown in FIG. 4b will not perform write operations (for example, no buffer operations will be performed).
- the file deduplication method process provided by the embodiment of the present application can be embedded in the existing file access process, and does not require an independent background thread, which is beneficial Reduce system write overhead.
- the embodiment of the present application creates a new file caching module, which is used to realize deduplication of online files.
- File operation module used to intercept the file access request of the application, call the file cache module to cache data, call the information processing module to identify duplicate files, and combine the file cache module and information processing module to remove duplicate files or save non-duplicate files.
- File cache module used to build an independent self-built file cache space, and cache intercepted files through the self-built file cache space. For example, use the method shown in Figure 4a to create a new cache space in the existing memory space, cache and store the intercepted file data; or use the self-built file cache space to replace the files in the VFS file cache in the way shown in Figure 4b Cache, used to store intercepted file data.
- Information processing module used to obtain file data from the file cache module and calculate feature information of the file, and also to initiate a feature information retrieval request or a request for adding feature information to the file index module.
- File indexing module used to construct and maintain the index directory, and retrieve target characteristic information in the index directory.
- the index directory can be regarded as a kind of database, and the index directory does not occupy memory.
- File directory used to record files stored in the file system.
- the directory items in the file directory include but are not limited to the file name, the link identifier of the file, the number of repetitions of the file, and the like.
- File characteristic information information used to indicate that each file is unique.
- the feature information of the file may include but not limited to fingerprint, file ID and so on.
- fingerprint 1 of file 1 and fingerprint 2 of file 2 are different, that is, fingerprint 1 is used to identify file 1
- fingerprint 2 is used to identify file 2.
- file 1 and file 2 have the same content (including but not limited to file 1 and file 2 have the same content and the same file name, and file 1 and file 2 have the same content but different file names)
- file 1 and file 2 File 2 has the same fingerprint (for example, both are fingerprint 1).
- Index directory a data access mode, creating a directory in the system as an index directory.
- the index directory in this embodiment of the present application may be an index table of feature information.
- the index directory is constructed and maintained by the file index module in an indexing manner based on the file directory.
- the index directory includes one or more feature information indexes, for example, includes multiple fingerprint indexes.
- Each fingerprint index corresponds to a file in an index directory
- the file name is the fingerprint
- the link identifier (inode) corresponding to the file indicates the inode of the file corresponding to the fingerprint.
- FIG. 5 is a schematic diagram of an index directory provided by the embodiment of the present application.
- the system includes file A, file B and file C, the link identifier of file A is inode1, the link identifier of file B is inode2, and the link identifier of file C is inode3.
- file A first calculate the feature information of file A (that is, calculate the fingerprint of file A), generate fingerprint A1, and fingerprint A1 points to the link identifier inode1 of file A, then generate a fingerprint index in the index directory : Fingerprint A1-inode1.
- other fingerprint indexes in the index directory are generated: fingerprint B2-inode2, fingerprint C3-inode3, etc., as shown in FIG. 5 .
- the location of the file can be obtained directly through the link identifier when searching the index directory, which is beneficial to realize more efficient file search.
- FIG. 6 is a schematic flow diagram of implementing a file deduplication function for an application program in a terminal device using an Android system according to an embodiment of the present application.
- the terminal device can execute the file deduplication method during the file writing process.
- the specific process is provided by the file operation module, information processing module, file cache
- the interaction between the module and the file index module includes the following steps:
- the file operation module obtains the write request, and the write request includes the first file.
- the file operation module calls the file cache module to store the first file in the first storage space.
- the file operation module when the file operation module detects the write request of the application, it can intercept the write request, and cache the first file in the write request to the newly added file A cache module (the first storage space). In the file cache module, operations such as calculating feature information, comparing duplicate files, and removing duplicate files are performed, as shown in Figure 7a. After the file cache module executes the file deduplication operation, it uses the standard write function system call to cache the file in the write request to the VFS (the third storage space), and continues to perform the buffer operation in the VFS. Wherein, the cache area operation in Fig.
- FIG. 7a refers to the write request operation not performed in the file cache module, including but not limited to setting flag bits, write check and space allocation, data write-back and other operations.
- the cache area operation in Figure 7a is the same as the cache area operation in the existing write request. For example, a file is divided into multiple pages (page), and each page is executed to set flags, write checks and space allocation, and data Write back and other operations. When multiple pages of the same file are executed with the above buffer operations, the file will be written to the disk, and the system will release the memory occupied by the file.
- the process shown in Figure 7a adopts the two-time cache serial mode, and embeds the interception cache, calculation and deduplication functions in the existing cache; according to the characteristic information of the file, the deduplication operation is performed on the duplicate file, and no longer sends to the system Continue to write duplicate files, discard duplicate files directly from memory; continue to write to the system for non-duplicate files.
- the file operation module when the file operation module detects a write request from the application, the file operation module defines the system and calls the caching (caching) function, and first builds a self-built file cache (section One storage space); through the file cache module, based on the copy_from_user function, the intercepted first file is cached to the self-built file cache at one time, as shown in Figure 7b.
- one-time caching refers to caching all the pages of the same file to the self-built file cache instead of caching each page one by one.
- buffer operations are deferred and simplified.
- the buffer operation includes setting flags M times, writing check and space allocation once, and writing data back N times.
- the characteristic information of the cached files can be calculated in the file caching module shown in FIG. 7b, so as to judge whether the cached files are duplicate files. If it is a duplicate file, discard the duplicate file from the memory; if it is a non-duplicate file, continue to write to the system.
- operations such as buffer operation, feature information calculation, and duplicate file removal may be performed during the closing operation.
- the closing operation is a file operation performed after the writing operation.
- the writing operation such as writing the file into the self-built file cache
- the system can perform the closing operation.
- the closing operation continue to execute the file shown in Figure 7b
- Operations such as cache area operations, feature information calculation, and duplicate file removal can help reduce system write operation overhead.
- the information processing module determines the feature information of the first file through a sampling algorithm. Specifically, the information processing module adopts a sample hash algorithm to obtain sample data of the first file, and determine feature information of the first file according to the sample data of the first file. It can be seen that the information processing module only needs to sample a small amount of file data to obtain feature information, which is beneficial to reduce system overhead.
- the information processing module may also determine the characteristic information of the first file according to the sampling data of the first file and the file information of the first file.
- the feature information may include but not limited to fingerprint information, file ID, etc.
- the file information may include but not limited to file type, file size, etc. It can be understood that the characteristic information of the first file calculated and determined in combination with the sampled data of the first file and the file information of the first file can better reflect the uniqueness of the first file.
- FIG. 8 is a schematic diagram of sampling and calculating characteristic information provided by an embodiment of the present application.
- the first storage space can be regarded as data in a tree structure, and files are stored in pages.
- the information processing module can obtain the sampling data of the file through the sampling hash algorithm.
- the partial data of sampling page1, page3 and page5 respectively constitute the first segment cyclic redundancy check (cyclic redundancy check, CRC), the middle segment CRC and the tail segment CRC of the sampled data, as shown in FIG. 8 .
- CRC cyclic redundancy check
- middle segment CRC middle segment CRC
- tail segment CRC of the sampled data
- feature information is determined, for example, it is also called a fingerprint (fingerprint, FP) of the file.
- the information processing module keeps the overhead of calculating characteristic information basically stable through sampling calculation, thereby reducing the impact of sampling and calculating characteristic information on the writing performance of the storage system.
- the information processing module judges whether there is a second file in the second storage space according to the first file, and the second file is the same as the first file.
- the specific judging method includes: the information processing module determines the feature information of the first file, and determines whether there is a third file in the index directory according to the feature information of the first file, and the file name of the third file is the same as the first file.
- the feature information of one file is the same, and the third file is associated with the storage address of the second file in the second storage space.
- the feature information of the second file is the same as that of the first file in the second storage space, it means that the second file is the same as the first file, and the first file is a duplicate file.
- feature information is unique information. When the feature information of the first file is the same as that of the second file, it can be determined that the first file and the second file are the same file.
- the file operation module associates the link identifier of the first file with the second file, and the link identifier of the first file is used to obtain the first file. That is to say, when the first file is a duplicate file, the link identifier of the first file is associated with the second file, so that when the first file is searched, the second file identical to the first file can be obtained. After the link identifier of the first file is associated with the second file, even if the first file is deleted, the same file (that is, the second file) can be found through the link identifier of the first file, thereby ensuring the accuracy of the file orientation path sex.
- FIG. 9 is a schematic diagram of an operation process for duplicate files provided by the embodiment of the present application.
- the left part in FIG. 9 is a file access list, which shows the files included in the write request and the link identifiers of the files.
- the file access list includes two columns, the first column is the file name, and the second column is the link identifier (inode) of the file.
- the link identifier of the file is used to obtain the file.
- the right part of FIG. 9 shows some directory entries of the file directory (including the link identifier of the file and the number of repetitions of writing the file). It can be understood that the file directory is stored in the second storage space. For example, inode1 of file A included in the write request.
- the terminal device stores the file A in the first storage space, and judges whether there is a second file in the second storage space, and the second file is the same as the file A.
- the specific judgment method for example, the information processing module judges whether there is a second file in the second storage space according to the characteristic information of the file A, and the characteristic information of the second file is the same as the characteristic information of the file A. If there is no second file, it means that file A is not a duplicate file.
- the file included in the write request again is file D, and the link identifier of file D is inode1.
- the terminal device stores the file D in the first storage space, and judges whether there is a second file in the second storage space, and the second file is the same as the file D.
- the specific judgment method for example, the information processing module judges whether there is a second file in the second storage space according to the characteristic information of the file D, and the characteristic information of the second file is the same as the characteristic information of the file D. If the feature information of file A and file D is the same, it means that file D is the same as file A, and file D is a duplicate file.
- the file operation module associates the link identifier of file D with the link identifier of file A. For example, inode1 of file D points to repeated inode1. At this time, the number of file write repetitions corresponding to inode1 is updated to 2, as It is shown in the second row and the second column of the table on the right side of Fig. 9 .
- FIG. 10 shows a link correspondence after file deduplication.
- the number of repetitions of inode1 is 2, which means that the same files are all linked to inode1.
- the file system only needs to store the same file once.
- the duplicate files will eventually be discarded from the memory, and no external storage write operations will be generated, so that low-overhead file deduplication can be completed in the file access path.
- the link correspondence shown in FIG. 10 still includes file D, so it is indifferent to upper-layer applications.
- the operations of the file index module on the index directory may include but not limited to creating fingerprints, inserting fingerprints, retrieving fingerprints, deleting fingerprints, and the like.
- creating an index directory a file in the index directory is created according to the characteristic information of the file, and the file name is a fingerprint.
- non-duplicate files insert a file into the index directory according to the characteristic information of the non-duplicate files, and the file name is the fingerprint of the non-duplicate files.
- the above steps may specifically be:
- the file operation module judges whether the current write request is a write request sent by social software according to the application ID of the process; if it is a write request sent by social software, the file operation module intercepts it The write request calls the file cache module to create a unique cache space (first storage space) for the target file in the kernel for caching its write data.
- the information processing module calls the sampling data of the first file in the first storage space to determine the characteristic information of the first file; and Whether there is feature information of the second file in the index directory is searched, and the feature information of the second file is the same as the feature information of the first file. If the same characteristic information is retrieved in the index directory, it is determined that the first file is a duplicate file, and the file operation module executes the operation of removing duplicate files as shown in FIG. 9 .
- the file operation module calls the first file in the first storage space to replace the cached data in the second storage space in the file system, And set the flag bit, so that the data of the first file can be synchronized back to the flash memory by the background thread of the file system.
- Table 1 is a storage space comparison table provided by the embodiment of the present application. Among them, Table 1 shows the comparison of the space occupied by the non-deduplication device and the space occupied by the deduplication device after multiple operations. Among them, multiple operations may include but are not limited to: using social software to send multiple times (video/PPT/picture files, etc.), using a browser to save files to system storage multiple times, calling video/PPT/picture multiple times from one application to Other applications (such as saving pictures from social software to the gallery, calling files from the gallery to social software).
- Table 1 Storage space comparison table
- the operation process shown in FIG. 6 is the operation of the internal system of the terminal device, which is invisible to the user.
- the terminal device can also display the effect of file deduplication to users through interface display or voice prompts.
- the terminal device disables the file deduplication function by default, and the file deduplication function needs to be enabled after user authorization.
- the specific implementation manner may be to obtain an instruction, which indicates to enable the file deduplication function; in response to the instruction, perform an operation of obtaining a write request.
- the terminal device provides a switch button for the file deduplication function in related operations such as system settings, or prompts the user whether to enable the file deduplication function during the installation and upgrade of a new system. If the user decides to enable the file deduplication function, the user can turn on the switch button of the file deduplication function in the system settings; for the terminal device, the user's operation is converted into an instruction, which instructs to enable the file deduplication function. In response to this instruction, an operation of acquiring a write request is performed.
- the terminal device may output a user prompt.
- the user prompts may include but are not limited to: the prompt system can automatically realize application transparency in real time (or at regular intervals), without user participation, and with extremely low overhead deduplication to implement functions related to storage saving, as shown in Figure 11.
- the terminal device can output user prompts through voice broadcast, and the broadcast system can automatically realize the file deduplication function in real time (or regularly) to the user.
- the terminal device can generate prompt information, which may include but not limited to: prompts for deleted duplicate files, storage capacity released by deleting duplicate files, number of deleted duplicate files, duplicate file file type, etc.
- prompt information may include but not limited to: prompts for deleted duplicate files, storage capacity released by deleting duplicate files, number of deleted duplicate files, duplicate file file type, etc.
- the file deduplication prompt information is output in the interface where the user authorizes the file deduplication function to be enabled.
- the file deduplication prompt information includes but is not limited to: the prompt system presents statistics based on accumulation, year, month, day, etc. No need to participate) Automatically optimize the storage space of 20GB, optimize 1000 groups of files with the same content, and the category is video, etc., as shown in Figure 11.
- the operation process shown in Figure 6 is the operation of the internal system of the terminal device.
- the terminal device can also generate a record log.
- the record log includes but is not limited to: the data in the index directory, the first A storage location corresponding to a file identifier, data in the first storage space, storage capacity released by deleting duplicate files, number of deleted duplicate files, and file types of deleted duplicate files.
- the terminal device can generate a record log of the file deduplication function.
- the record log includes data in the index directory (for example, the respective characteristic information and file addresses of one or more files included in the index directory, which can directly provide the characteristic information value and the file address value without displaying the data structure of the index directory),
- the specific value of the storage capacity released by deleted duplicate files for example, the storage capacity released by deleted duplicate files is 6GB
- the number of deleted duplicate files for example, 1000 groups of deleted duplicate files
- the terminal device provides an API to the external device, so that the external device can call the file deduplication function through the API.
- the terminal device provides a debugging API so that external devices can call the file deduplication function, such as calling the file operation module and information processing module through the API, so that the external device can execute File deduplication function, as shown in Figure 12.
- the external device in this implementation manner can be, for example, a server. When the server calls the file deduplication function through the API, automatic file deduplication can be realized on the server, and duplicate files can be effectively removed.
- FIG. 13 is a schematic flow diagram of a file deduplication method provided in an embodiment of the present application.
- the process of the file deduplication method is executed by a terminal device or a device deployed on the cloud, and includes the following steps:
- the write request is used to request to write a file
- the method of requesting to write a file may be that an application program initiates a file access request, for example, a write operation is performed through a control signal such as a pwrite function.
- the first file included in the write request may be cached.
- the first file included in the write request may be cached.
- the first storage space and the second storage space are located at different layers of the storage system, which means that the first storage space and the second storage space are different in levels.
- the first storage space is a memory space (such as a cache)
- the second storage space is an external storage space (such as a disk). That is to say, during the file access process, the first file in the write request is temporarily stored in the memory space and not written into the external storage space, which is beneficial to reduce the overhead of writing to the external storage space. And after judging whether the first file is a duplicate file, if it is a duplicate file, the first file is directly deleted from the memory space to realize online file deduplication.
- the characteristic information of the file is determined by sampling part of the data of the file.
- the terminal device determines feature information of the first file according to the sampling data of the first file. For a specific implementation manner, refer to a method for determining characteristic information by sampling data shown in FIG. 8 , which will not be repeated here.
- the first file in the absence of the second file, the first file is stored in the third storage space, and a buffer operation is performed on the first file in the third storage space; after the buffer operation is completed, Store the first file in the second storage space.
- the first storage space refers to the cache space occupied by the file cache module
- the third storage space refers to the file cache in the VFS.
- the data structure of the first storage space is the same as the data structure of the third storage space.
- the first storage space adopts a cache data structure, and operations of caching files can be performed in the first storage space;
- the third storage space also adopts a cache data structure, and operations of caching files can also be performed in the third storage space.
- the first storage space includes the cache space occupied by the file cache module and the file cache in the VFS.
- the first storage space includes the cache space occupied by the file cache module and the file cache in the VFS.
- there is only one data copy in the whole deduplication operation process for the specific implementation mode, refer to the corresponding descriptions in FIG. 4b and FIG. 7b , which will not be repeated here.
- the cache area operation is performed, the first file is written from the memory space to the external storage space to complete the file access process.
- the link identifier of the first file is associated with the second file, and the first file is deleted from the first storage space.
- the link identifier of the first file is used to acquire the first file.
- the file name of the third file is the same as the feature information of the first file
- the third file The file is associated with the storage address of the second file in the second storage space.
- the index directory is shown in FIG. 5 .
- the feature information of the first file is calculated as fingerprint A1.
- the fingerprint A1 exists in the index directory. It means that the file names of the first file and the third file are the same, so it can be deduced that the file A associated with the third file is the same file as the first file, that is, the first file is a duplicate file.
- the link identifier of the first file is associated with the second file.
- FIG. 9 refer to a manner of file association shown in FIG. 9 , which will not be repeated here.
- the first file is written into the file system according to a normal file access process.
- a fourth file is newly created in the index directory
- the file name of the fourth file is the characteristic information of the first file
- the fourth file and the first file are in the The storage address in the second storage space is associated. That is to say, when the first file is not a duplicate file, a new fingerprint can be inserted into the index directory, thereby facilitating subsequent judgment of other files by the terminal device.
- the intercepted write request includes the fifth file
- the file deduplication method further includes the following steps:
- Prompt information is generated, and the prompt information includes one or more of the following: a prompt of deleted duplicate files, storage capacity released by deleting duplicate files, number of deleted duplicate files, and file types of duplicate files.
- a prompt of deleted duplicate files storage capacity released by deleting duplicate files
- number of deleted duplicate files number of deleted duplicate files
- file types of duplicate files file types of duplicate files.
- the file deduplication method further includes the following steps:
- Generate a record log which includes one or more of the following: data in the index directory, storage location corresponding to the first file identifier, data in the first storage space, storage capacity released by deleting duplicate files, deleting duplicate files The number of duplicate files and the file types of the deleted files.
- the file deduplication method further includes the following steps:
- the instruction instructs to enable the file deduplication function
- the embodiment of the present application provides a file deduplication method.
- the file deduplication method stores the first file in the write request in the first storage space by obtaining the write request, and judges whether there is a second file in the second storage space. file, the second file is the same as the first file.
- the method can effectively remove duplicate files of the terminal device and reduce storage space occupation; it is insensitive to applications and does not require users to perform complicated operations, thereby reducing system processing overhead.
- the same second file can also be queried through the link identifier of the first file, so that the access process of the file is not affected.
- FIG. 14 is a schematic flow chart of a file search method provided in an embodiment of the present application.
- the file search method can also be executed by a terminal device or a device deployed on the cloud, and includes the following steps:
- the first file in this embodiment may be a file included in the write request.
- the first file included in the write request is acquired.
- the first file may also be a file already written in the file system.
- one or more files in the file system are detected in the offline mode, and respective characteristic information of the one or more files are respectively determined.
- the characteristic information of the first file is determined according to the sampling data of the first file.
- the sampled data is partial data obtained from the data of the first file through a sampling algorithm.
- determining the characteristic information of the first file and the method for obtaining sampled data in the embodiment in FIG. 6 and FIG. 8 and details will not be repeated here. It can be understood that acquiring the characteristic information of the first file by sampling is beneficial to reduce data processing overhead.
- the third file is a file in the index directory, and the third file is associated with the storage address of the second file in the second storage space, which means that the second file pointed to by the third file has been written into the disk, and is the system files that already exist in .
- indexing the directory it can be found whether a file identical to the first file already exists in the system.
- the first file when the third file does not exist in the index directory, the first file is stored in the second storage space, and a fourth file is added to the index directory, and the file name of the fourth file is the first file
- the characteristic information of the fourth file is associated with the storage address of the first file.
- the feature information of the first file is calculated as fingerprint D4.
- the fingerprint D4 By searching the index directory as shown in FIG. 5 , it is determined that the fingerprint D4 does not exist in the index directory. It means that the same file as the first file does not exist in the system, and the first file is a non-duplicate file.
- a fourth file is inserted into the index directory as shown in FIG. 5 , the file name of the fourth file is fingerprint D4, and the fourth file points to the storage address of the first file in the second storage space.
- the link identifier of the first file is associated with the second file, and the first file is deleted from the first storage space.
- the link identifier of the first file is used to obtain the first file.
- An embodiment of the present application provides a file search method.
- the file search method acquires a first file and determines the characteristic information of the first file; according to the characteristic information of the first file, it is determined whether there is a third file in the index directory, and the third file
- the file name of is the same as the feature information of the first file.
- the method of searching through the index directory is conducive to simplifying the process of searching for files. And, when the first file is a duplicate file, and after the duplicate file is deleted, if you need to access the corresponding file, you can access the second file (the same file as the first file) linked to the feature information of the first file, so that Keep normal file access.
- the device or device provided by the embodiment of the present application may include a hardware structure and/or a software module, and may be realized in the form of a hardware structure, a software module, or a hardware structure plus a software module the above functions. Whether one of the above-mentioned functions is executed in the form of a hardware structure, a software module, or a hardware structure plus a software module depends on the specific application and design constraints of the technical solution.
- the division of modules in the embodiments of the present application is schematic, and is only a logical function division. There may be other division methods in actual implementation.
- each functional module in each embodiment of the present application can be integrated into a processing In the controller, it can also be physically present separately, or two or more modules can be integrated into one module.
- the above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules.
- FIG. 15 is a device 1500 provided by an embodiment of the present application, which is used to implement the file deduplication function or file search function in the above method embodiments.
- the device may be a terminal device or a device deployed on the cloud, or a device in the terminal device or the device deployed on the cloud, or a device that can be matched and used with the terminal device or the device deployed on the cloud.
- the device may be a system on a chip.
- the device 1500 includes at least one processor 1502, configured to implement the functions of the terminal device or the device deployed on the cloud in the file deduplication method or the file search method provided in the embodiment of the present application.
- the processor 1502 may store the first file in the first storage space in response to the write request.
- Device 1500 may also include at least one memory 1503 for storing program instructions and/or data.
- the memory 1503 is coupled to the processor 1502 .
- the coupling in the embodiments of the present application is an indirect coupling or a communication connection between devices, units or modules, which may be in electrical, mechanical or other forms, and is used for information exchange between devices, units or modules.
- Processor 1502 may cooperate with memory 1503 .
- Processor 1502 may execute program instructions stored in memory 1503 . At least one of the at least one memory may be included in the processor.
- the device 1500 may further include a communication interface 1501, which may be, for example, a transceiver, an interface, a bus, a circuit, or a device capable of implementing a sending and receiving function.
- the communication interface 1501 is used to communicate with other devices through a transmission medium, so that the devices used in the device 1500 can communicate with other devices.
- the other device may be a terminal.
- the processor 1502 uses the communication interface 1501 to send and receive data, and is used to implement the method executed by the terminal device or the device deployed on the cloud described in the embodiment corresponding to FIG. 13 or FIG. 14 .
- the embodiment of the present application does not limit the specific connection medium among the communication interface 1501, the processor 1502, and the memory 1503. In the embodiment of the present application, in FIG.
- the bus 15 is represented by a thick line in FIG. 15, and the connection mode between other components is only for schematic illustration. , is not limited.
- the bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 15 , but it does not mean that there is only one bus or one type of bus.
- the processor may be a general-purpose processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, and may implement or Execute the methods, steps and logic block diagrams disclosed in the embodiments of the present application.
- a general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the methods disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.
- the memory may be a non-volatile memory, such as a hard disk (hard disk drive, HDD) or a solid-state drive (solid-state drive, SSD), etc., and may also be a volatile memory (volatile memory), such as Random-access memory (RAM).
- a memory is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
- the memory in the embodiment of the present application may also be a circuit or any other device capable of implementing a storage function, and is used for storing program instructions and/or data.
- Figure 16 shows a file deduplication device 1600 provided by the embodiment of the present application.
- the file deduplication device can be a terminal device or a device deployed on the cloud, or it can be a terminal device or a device deployed on the cloud.
- the file deduplication device may include a module corresponding to one-to-one execution of the methods/operations/steps/actions described in the example corresponding to Figure 13, and the module may be a hardware circuit, software, or Hardware circuit combined with software implementation.
- the device may include a file operation module 1601 , a file cache module 1602 , and an information processing module 1603 .
- the file operation module 1601 is configured to obtain a write request, where the write request includes the first file.
- the file caching module 1602 is configured to store the first file in response to the write request, and the first file is stored in the first storage space.
- the information processing module 1603 is configured to determine whether there is a second file in the second storage space, the second file is the same as the first file, and the second storage space and the first storage space are located at different layers of the storage system.
- the file caching module 1602 is also used for:
- the first file is stored in the second storage space.
- the file caching module 1602 is also used for:
- the first file is stored in the second storage space.
- the information processing module 1603 is further configured to associate the link identifier of the first file with the second file if the second file exists, and the link identifier of the first file is used to obtain the first file;
- the file caching module 1602 is further configured to delete the first file from the first storage space.
- the information processing module 1603 is also used to:
- the sampling data of the first file the feature information of the first file is determined; the sampling data is part of the data obtained from the data of the first file through a sampling algorithm.
- the information processing module 1603 is also used to:
- the file name of the third file is the same as the feature information of the first file, and the third file is related to the storage address of the second file in the second storage space couplet.
- the file deduplication apparatus 1600 further includes a generation module 1604, and the generation module 1604 is used to generate prompt information, and the prompt information includes one or more of the following: prompt of deleted duplicate files, storage capacity released by deleting duplicate files, Delete the number of duplicate files, the file type of duplicate files.
- the generation module 1604 is also used to generate a record log, which includes one or more of the following: data in the index directory, storage location corresponding to the first file identifier, data in the first storage space, duplicate deletion The storage capacity freed by the files, the number of deduplicated files, the file types of deduplicated files.
- the file deduplication apparatus 1600 further includes an execution module 1605, and the execution module 1605 is configured to obtain an instruction, the instruction indicates enabling the file deduplication function; in response to the instruction, perform an operation of obtaining a write request.
- Figure 17 shows a file search device 1700 provided by the embodiment of the present application.
- the file search device may be a terminal device or a device deployed on the cloud, or a device in a terminal device or a device deployed on the cloud. Or it is a device that can be matched with terminal devices or devices deployed on the cloud.
- the file search device may include a one-to-one corresponding module for executing the methods/operations/steps/actions described in the example corresponding to Figure 14, and the module may be a hardware circuit, software, or hardware Circuit combined with software implementation.
- the device may include a file operation module 1701 and an information processing module 1702 . Exemplarily, the file operation module 1701 is used to acquire the first file and determine the characteristic information of the first file.
- the information processing module 1702 is configured to determine feature information of the first file.
- the information processing module 1702 is also used to determine whether there is a third file in the index directory according to the feature information of the first file.
- the file name of the third file is the same as the feature information of the first file.
- the storage addresses of the two storage spaces are associated.
- the information processing module 1702 is used to determine the feature information of the first file, including:
- the sampling data of the first file the feature information of the first file is determined; the sampling data is part of the data obtained from the data of the first file through a sampling algorithm.
- the file search apparatus 1700 further includes a file cache module 1703, and the file cache module 1703 is configured to store the first file in the second storage space and store the first file in the index directory when the third file does not exist in the index directory.
- a fourth file is added, the file name of the fourth file is the feature information of the first file, and the fourth file is associated with the storage address of the first file.
- the information processing module 1702 is further configured to associate the link identifier of the first file with the second file when the third file exists in the index directory, and the link identifier of the first file is used to obtain the first file;
- the file caching module 1703 is further configured to delete the first file from the first storage space.
- the technical solutions provided by the embodiments of the present application may be fully or partially implemented by software, hardware, firmware or any combination thereof.
- software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
- the computer program product includes one or more computer instructions.
- the computer program instructions When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
- the computer may be a general computer, a special computer, a computer network, a network device, a terminal device or other programmable devices.
- the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
- the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media.
- the available medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a digital video disc (digital video disc, DVD)), or a semiconductor medium.
- the various embodiments may refer to each other, for example, the methods and/or terms between the method embodiments may refer to each other, such as the functions and/or terms between the device embodiments Or terms may refer to each other, for example, functions and/or terms between the apparatus embodiment and the method embodiment may refer to each other.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
La présente invention concerne, selon un mode de réalisation, un procédé et un appareil de déduplication de fichiers et un dispositif. Dans le procédé, pendant l'écriture de fichier, en réponse à une demande d'écriture, un fichier dans la demande d'écriture est stocké temporairement dans un premier espace de stockage et le fichier dans la demande d'écriture est comparé à un fichier dans un second espace de stockage, de telle sorte qu'il est déterminé si le fichier dans la demande d'écriture est un fichier répété ou non. Au moyen du procédé, des fichiers répétés peuvent être automatiquement retirés pendant l'écriture de fichier et l'occupation de l'espace de stockage est réduite ; un utilisateur n'a pas besoin d'initier activement une demande de déduplication de fichiers, de telle sorte que le surdébit de performance est réduit.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2021/127162 WO2023070462A1 (fr) | 2021-10-28 | 2021-10-28 | Procédé et appareil de déduplication de fichiers et dispositif |
CN202180103614.0A CN118120212A (zh) | 2021-10-28 | 2021-10-28 | 一种文件去重方法、装置和设备 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2021/127162 WO2023070462A1 (fr) | 2021-10-28 | 2021-10-28 | Procédé et appareil de déduplication de fichiers et dispositif |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023070462A1 true WO2023070462A1 (fr) | 2023-05-04 |
Family
ID=86160400
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/127162 WO2023070462A1 (fr) | 2021-10-28 | 2021-10-28 | Procédé et appareil de déduplication de fichiers et dispositif |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN118120212A (fr) |
WO (1) | WO2023070462A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118410020A (zh) * | 2024-03-18 | 2024-07-30 | 荣耀终端有限公司 | 一种文件处理方法及电子设备 |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118074869B (zh) * | 2024-02-04 | 2024-10-25 | 深圳市奇迅新游科技股份有限公司 | 重复数据的处理方法、终端设备及计算机可读存储介质 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101630290A (zh) * | 2009-08-17 | 2010-01-20 | 成都市华为赛门铁克科技有限公司 | 重复数据处理方法和装置 |
CN103177111A (zh) * | 2013-03-29 | 2013-06-26 | 西安理工大学 | 重复数据删除系统及其删除方法 |
CN103324552A (zh) * | 2013-06-06 | 2013-09-25 | 西安交通大学 | 两阶段单实例去重数据备份方法 |
US9189414B1 (en) * | 2013-09-26 | 2015-11-17 | Emc Corporation | File indexing using an exclusion list of a deduplicated cache system of a storage system |
CN105630834A (zh) * | 2014-11-07 | 2016-06-01 | 中兴通讯股份有限公司 | 一种实现重复数据删除的方法及装置 |
CN106649676A (zh) * | 2016-12-15 | 2017-05-10 | 北京锐安科技有限公司 | 一种基于hdfs存储文件的去重方法及装置 |
US9679040B1 (en) * | 2010-05-03 | 2017-06-13 | Panzura, Inc. | Performing deduplication in a distributed filesystem |
-
2021
- 2021-10-28 CN CN202180103614.0A patent/CN118120212A/zh active Pending
- 2021-10-28 WO PCT/CN2021/127162 patent/WO2023070462A1/fr active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101630290A (zh) * | 2009-08-17 | 2010-01-20 | 成都市华为赛门铁克科技有限公司 | 重复数据处理方法和装置 |
US9679040B1 (en) * | 2010-05-03 | 2017-06-13 | Panzura, Inc. | Performing deduplication in a distributed filesystem |
CN103177111A (zh) * | 2013-03-29 | 2013-06-26 | 西安理工大学 | 重复数据删除系统及其删除方法 |
CN103324552A (zh) * | 2013-06-06 | 2013-09-25 | 西安交通大学 | 两阶段单实例去重数据备份方法 |
US9189414B1 (en) * | 2013-09-26 | 2015-11-17 | Emc Corporation | File indexing using an exclusion list of a deduplicated cache system of a storage system |
CN105630834A (zh) * | 2014-11-07 | 2016-06-01 | 中兴通讯股份有限公司 | 一种实现重复数据删除的方法及装置 |
CN106649676A (zh) * | 2016-12-15 | 2017-05-10 | 北京锐安科技有限公司 | 一种基于hdfs存储文件的去重方法及装置 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118410020A (zh) * | 2024-03-18 | 2024-07-30 | 荣耀终端有限公司 | 一种文件处理方法及电子设备 |
Also Published As
Publication number | Publication date |
---|---|
CN118120212A (zh) | 2024-05-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110018998B (zh) | 一种文件管理方法、系统及电子设备和存储介质 | |
KR101644666B1 (ko) | 장치와 웹 서비스 간에 브라우저 캐시를 동기화하는 프로그래밍 모델 | |
US9122582B2 (en) | File system for maintaining data versions in solid state memory | |
US11836112B2 (en) | Path resolver for client access to distributed file systems | |
US9778860B2 (en) | Re-TRIM of free space within VHDX | |
JP5886447B2 (ja) | ロケーション非依存のファイル | |
JP2016505960A (ja) | 互換性を保つオフロード・トークン・サイズの拡大 | |
US11132145B2 (en) | Techniques for reducing write amplification on solid state storage devices (SSDs) | |
WO2023070462A1 (fr) | Procédé et appareil de déduplication de fichiers et dispositif | |
CN114185494B (zh) | 内存匿名页的处理方法、电子设备及可读存储介质 | |
WO2021008425A1 (fr) | Procédé de demarrage de système et dispositif associé | |
US20220253252A1 (en) | Data processing method and apparatus | |
WO2023066182A1 (fr) | Procédé et appareil de traitement de fichier, dispositif, et support de stockage | |
CN113806300B (zh) | 数据存储方法、系统、装置、设备及存储介质 | |
JP2014071904A (ja) | コンピュータシステム及びコンピュータシステムのデータ管理方法 | |
US11650748B1 (en) | Method of delayed execution of eBPF function in computational storage | |
EP4120060A1 (fr) | Procédé et appareil de stockage de données, et procédé et appareil de lecture de données | |
CN111930684A (zh) | 基于hdfs的小文件处理方法、装置、设备及存储介质 | |
WO2023071043A1 (fr) | Procédé et appareil de compatibilité d'agrégation de fichiers, dispositif informatique et support de stockage | |
US20230376242A1 (en) | System for computational storage with hardware-assistance | |
CN113934691B (zh) | 访问文件的方法、电子设备及可读存储介质 | |
CN115495020A (zh) | 文件处理方法、装置、电子设备和可读存储介质 | |
CN116661645B (zh) | 显示应用卡片的方法、电子设备及可读存储介质 | |
CN117708072B (zh) | 文件复制方法、终端设备及芯片系统 | |
US11892951B2 (en) | Key packing for flash key value store operations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21961828 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202180103614.0 Country of ref document: CN |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21961828 Country of ref document: EP Kind code of ref document: A1 |