CN116414782A

CN116414782A - Method for identifying repeated file and electronic equipment

Info

Publication number: CN116414782A
Application number: CN202310688769.6A
Authority: CN
Inventors: 江加国
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2023-06-12
Filing date: 2023-06-12
Publication date: 2023-07-11
Anticipated expiration: 2043-06-12
Also published as: CN116414782B

Abstract

The application is applicable to the technical field of terminals and provides a method for identifying repeated files and electronic equipment. The method for identifying the repeated file comprises the following steps: scanning files in the storage space; dividing the scanned files with the same file size into the same group to obtain one or more first file groups; removing a first file group comprising single files, calculating head-to-tail page hash values of all files in the remaining first file groups, and dividing files with the same head-to-tail page hash values in the same first file group into the same group to obtain one or more second file groups; removing a second file group comprising single files, and calculating a file hash value of each file in the remaining second file group; duplicate files in the remaining second set of files are identified based on the file hash values of the individual files in the remaining second set of files. According to the scheme, the calculated amount of the terminal equipment when the repeated file is identified can be reduced, and the repeated file identification speed of the terminal equipment is improved.

Description

Method for identifying repeated file and electronic equipment

Technical Field

The present disclosure relates to the field of terminal technologies, and in particular, to a method and an electronic device for identifying a duplicate file.

Background

Currently, in order to meet the increasing data storage demands of people, the storage space of terminal equipment provided by manufacturers of terminal equipment for users is also increased. With the increasing storage space of the terminal device and the increasing data stored on the terminal, repeated files in the terminal device are more and more, so that the use efficiency of the storage space is reduced, and the data management efficiency of the terminal device is reduced. Based on the method, a way for improving the data management efficiency and the use efficiency of the storage space of the terminal equipment can be provided for a user by identifying and displaying the repeated files in the terminal equipment.

The common duplicate file identification method is to scan the files in the storage space through a third party application program, calculate the hash values of the scanned files, compare the hash values of the scanned files, and identify whether the files are duplicate files based on whether the hash values of the files are the same. However, since the calculation amount of the hash value of the file is generally large, it takes a lot of time to calculate the hash value of each scanned file, which not only reduces the duplicate file identification speed of the terminal device but also greatly increases the power consumption of the terminal device.

Disclosure of Invention

The embodiment of the application provides a method for identifying repeated files and electronic equipment, which can reduce the calculated amount when the terminal equipment identifies the repeated files, improve the repeated file identification speed of the terminal equipment and reduce the power consumption of the terminal equipment.

In a first aspect, an embodiment of the present application provides a method for identifying a duplicate file, including:

scanning files in the storage space;

dividing the scanned files with the same file size into the same group to obtain one or more first file groups;

removing a first file group comprising single files, calculating head-to-tail page hash values of all files in the remaining first file groups, and dividing files with the same head-to-tail page hash values in the same first file group into the same group to obtain one or more second file groups;

removing a second file group comprising single files, and calculating a file hash value of each file in the remaining second file group;

duplicate files in the remaining second set of files are identified based on the file hash values of the individual files in the remaining second set of files.

Optionally, in the case that the terminal device only supports the built-in flash memory or the solid state disk, but does not support the external expansion memory card, the storage space may only include the storage space provided by the built-in flash memory or the solid state disk of the terminal device. Optionally, in the case that the terminal device supports both the built-in flash memory or the solid state disk and the external expansion memory card, the storage space may include a storage space provided by the built-in flash memory or the solid state disk of the terminal device and a storage space provided by the external expansion memory card.

Scanning files in the storage space may specifically refer to traversing and checking the files in the storage space to obtain file attributes of each file in the storage space.

The file attributes may include, for example, file name and file type.

The first file group may include only one file, or may include a plurality of files. The file sizes of all files in the same first file group are equal. The file sizes of the files in the different first file groups are not equal.

The first file group including a single file refers to a first file group having a number of files of 1.

After the terminal equipment rejects the first file groups comprising the single files, each of the remaining first file groups is a first file group comprising a plurality of files, i.e. the number of the remaining first file groups is greater than 1.

The hash value of the first and last pages of the file may include a hash value of the first page and a hash value of the last page of the file.

Wherein, the first page of the file refers to the first memory page of the file, and the last page of the file refers to the last memory page of the file.

The equal hash value of the first and last pages of the file means that the hash value of the first and last memory pages of the file are equal.

The second file group may include only one file or may include a plurality of files. The hash values of the head and the tail pages of all files in the same second file group are equal, and the hash values of the head and the tail pages of the files in different second file groups are unequal.

The second file group including a single file refers to a second file group having a number of files of 1.

After the terminal equipment rejects the second file group comprising the single file, each remaining second file group is a second file group comprising a plurality of files, namely, the number of the remaining second file groups is greater than 1.

The file hash value pointer of the file hashes values of all memory pages of the file. The hash value of the file differs from the hash value of the first memory page or the hash value of the last memory page in that the hash value of the first memory page or the hash value of the last memory page is the hash value of a single memory page of the file, and the hash value of the file is the hash value corresponding to the entire contents of the file.

Alternatively, only one thread may be created to calculate the file hash value for each file in all the remaining second file groups.

Alternatively, multiple threads may be created to calculate the file hash value for each file in the remaining respective second file group.

According to the method for identifying the repeated files, all scanned files are respectively grouped once and secondarily based on the file sizes and the head and tail page hash values of the scanned files, and file groups comprising single files are removed, so that some non-repeated files can be filtered, and files which are possibly repeated files can be roughly screened out; one or more threads are then created to calculate file hash values for each file in the remaining set of files and identify duplicate files in the remaining set of files based on the file hash values for each file in the remaining set of files. Because the terminal equipment filters out some non-repeated files, the calculation amount of the terminal equipment for calculating the file hash value of the non-repeated files can be saved, the integral calculation amount of the terminal equipment when the terminal equipment identifies the repeated files is reduced, and the repeated file identification speed of the terminal equipment can be improved. In addition, the file hash value of each file in the rest file groups is calculated through a plurality of threads, so that the file hash value calculation speed of the terminal equipment can be improved, and the repeated file identification speed of the terminal equipment is further improved. The improvement of the repeated file identification speed can reduce the power consumption of the terminal equipment when the repeated file is identified, and save the memory resource of the terminal equipment.

In an optional implementation manner of the first aspect, the calculating a file hash value of each file in the remaining second file group includes:

creating a plurality of threads;

and distributing the files in the remaining second file group to each thread according to the principle that the calculated amount of each thread is equal or approximately equal, so that the total file sizes of the files distributed to each thread are equal or approximately equal.

The total file size of the file to which the thread is assigned refers to the sum of the file sizes of all the files to which the thread is assigned.

According to the method for identifying the repeated file, provided by the embodiment of the application, the file hash value of each file in the residual file group is calculated through a plurality of threads, so that the file hash value calculation speed of each file remaining in the repeated file identification process can be improved, and the repeated file identification speed of the terminal equipment is further improved.

In an optional implementation manner of the first aspect, the creating a plurality of threads includes:

and under the condition that the total file size of the files included in all the remaining second file groups is smaller than or equal to the first file size threshold value, creating a preset number of threads, wherein the preset number is constant and is larger than 1.

The first file size threshold may be determined according to a preset number and a maximum calculated amount of each thread. Specifically, the first file size threshold may be a product of a preset number and a maximum calculated amount per thread.

creating R threads, r=s+up_round ((D1-YZ)/D2), in case the total file size of the files comprised by all remaining second file groups is larger than the first file size threshold;

wherein S is the preset number of threads, D1 is the total file size of the files included in all the remaining second file groups, YZ is the first file size threshold, D2 is the maximum calculated amount of each thread, and up_round () is an upward rounding function.

In the case where (D1-YZ)/D2 is a non-integer, up_round ((D1-YZ)/D2) is the integer part of the result obtained by dividing D1-YZ by D2 plus 1; in the case where (D1-YZ)/D2 is an integer, up_round ((D1-YZ)/D2) is the result obtained by dividing D1-YZ by D2.

According to the method for identifying the repeated file, the number of threads for calculating the file hash value is dynamically adjusted based on the total file size of the repeated file to be identified, so that the method can still ensure the rapid repeated file identification speed under the condition that the total file size of the repeated file is large, and the application range of the method is enlarged.

In an optional implementation manner of the first aspect, the allocating files in the remaining second file group to each thread according to a rule that a calculation amount of each thread is equal or approximately equal includes:

determining a first ratio of the total file size of the files included in all the remaining second file groups to the total number of threads as an average calculated amount of each thread, wherein the average calculated amount is smaller than or equal to the maximum calculated amount of the threads;

based on the average calculated amount, taking a single second file group as a minimum allocation unit, and sequentially allocating the rest second file groups into threads according to the sequence from small total file sizes of the second file groups to large total file sizes of the second file groups, so that the total file size of the files allocated to each thread is equal to or approximately equal to the average calculated amount.

Wherein the total file size of the files included in all the remaining second file groups refers to the sum of the file sizes of all the files included in all the remaining second file groups.

The total number of threads refers to the total number of threads created by the terminal device for calculating file hash values of the files included in all the remaining second file groups.

The principle that the terminal equipment allocates files for each thread by taking a single second file group as a minimum allocation unit is to ensure that the total file size of the files allocated to each thread is equal to or approximately equal to the average calculated amount of each thread.

Wherein the total file size of the file to which the thread is assigned refers to the sum of the file sizes of all the files to which the thread is assigned.

When the terminal equipment distributes files for each thread, the terminal equipment can distribute file groups of small files first and redistribute file groups of large files. Wherein, the file group of the small file refers to the file group with relatively smaller file size of the single file in the file group, and the file group of the large file refers to the file group with relatively larger file size of the single file in the file group.

Specifically, the terminal device may first calculate the total file size of each second file group, and then sequentially group the remaining second file groups into each thread according to the order of the total file sizes of the second file groups from small to large based on the average calculated amount of each thread and the total file sizes of the second file groups. Wherein the total file size of the second file group refers to the sum of the file sizes of all files included in the second file group.

The files are distributed to each thread by taking the whole second file group as the minimum distribution unit, and although the total file size of the files distributed to each thread cannot be guaranteed to be just equal to the average calculated amount of the threads, since the average calculated amount of the threads is usually in the magnitude of GB, and the error between the average calculated amounts is usually in the magnitude of KB, the error does not generally have great influence on the overall file identification speed of all threads.

In this implementation manner, since the files in the same second file group are more likely to be duplicate files, the accuracy of identifying duplicate files can be improved by processing the same second file group into the same thread.

calculating a second ratio of the total number of files included in all remaining second file groups to the total number of threads;

assigning a file of the second ratio to each thread if the second ratio is an integer;

under the condition that the second ratio is a non-integer, firstly, distributing N files for each thread, and then distributing one file for each of M threads in the plurality of threads; where n=floor (X1/Y1), m=x1mod y1, X1 is the total number of files included in all remaining second file groups, Y1 is the total number of threads, floor is a downward rounding function, and mod is a remainder taking function;

and carrying out file allocation adjustment on the target threads needing the file allocation adjustment based on the total file size of the files allocated to each thread, so that the total file sizes of the files allocated to each thread are equal or approximately equal.

floor (X1/Y1) is the integer part of the result obtained by dividing X1 by Y1.

X1 mod Y1 is the remainder of the result obtained by dividing X1 by Y1.

In this implementation, the number of files to which different threads are assigned differs by at most 1.

The target thread that needs to perform file allocation adjustment is a thread in which the total file size of the allocated file is larger than the average value of the total file sizes of the threads. The average value of the total file sizes of the threads may be a ratio of the total file sizes of all files included in the remaining second file group to the total number of threads, in other words, the average value of the total file sizes of the threads, that is, the average calculated amount of the threads.

According to the method for identifying the repeated file, all files in the remaining second file group are distributed to each thread in average, so that the consumed time when each thread calculates the file hash value is approximately equal, the terminal equipment can complete the calculation process of the file hash value in the fastest time, and the repeated file identification speed of the terminal equipment is improved.

Determining the ratio of the sum of the total file sizes of the files to which all threads are allocated to the total number of threads as an average value of the total file sizes of the threads;

calculating the difference value between the total file size of the files allocated to each thread and the average value;

determining the thread with the difference value larger than a first difference value threshold as a target thread;

and carrying out file allocation adjustment on the target threads, so that the difference value between the total file size of the files finally allocated to each target thread and the average value is smaller than or equal to the first difference value threshold value.

The sum of the total file sizes of the files to which all threads are allocated, i.e. the total file size of all files comprised by the remaining second file group.

The difference between the total file size of the file to which the thread is assigned and the average of the total file sizes of the threads is greater than or equal to 0. Specifically, if the total file size of the file to which a certain thread is allocated is smaller than the average value of the total file sizes of the threads, the difference between the total file size of the file to which the thread is allocated and the average value of the total file sizes of the threads refers to the absolute value of the difference between the total file size of the file to which the thread is allocated and the average value of the total file sizes of the threads; if the total file size of the file to which a thread is assigned is greater than the average of the total file sizes of the threads, then the difference between the total file size of the file to which the thread is assigned and the average of the total file sizes of the threads refers to the difference between the total file size of the file to which the thread is assigned and the average of the total file sizes of the threads. Wherein the difference between the total file size of the file to which the thread is assigned and the average of the total file sizes of the threads refers to the total file size of the file to which the thread is assigned minus the average of the total file sizes of the threads.

The first difference threshold may be determined based on the data processing rate of each thread and the predetermined time differential. For example, the first difference threshold may be a product of a data processing rate of each thread and a preset time differential.

The data processing rate of a thread may, for example, calculate a rate for a file hash value of the thread. The preset time differential may be, for example, a maximum differential allowed between the time spent by different threads in calculating the hash value of the file.

The target thread may include a first type of thread having a total file size of the assigned files that is greater than an average of the total file sizes of the threads, and a second type of thread having a total file size of the assigned files that is less than an average of the total file sizes of the threads.

Optionally, if only one or more files are screened from the first type of threads and are reassigned to the second type of threads, the difference between the total file size of the files finally assigned to each target thread and the average value of the total file sizes of the threads is smaller than or equal to the first difference threshold, and the terminal device may only screen one or more files from the first type of threads and reassign the files to the second type of threads.

Optionally, if only one or more files are screened out from the first type of threads and are reassigned to the second type of threads, and the difference between the total file size of the files finally assigned to each target thread and the average value of the total file sizes of the threads is not smaller than or equal to the first difference threshold, the terminal device may screen out one or more files from the second type of threads and reassign the one or more files to the first type of threads in addition to screening out one or more files from the first type of threads and reassigning the one or more files to the second type of threads.

In an optional implementation manner of the first aspect, the identifying, based on the file hash values of the files in the remaining second file group, duplicate files in the remaining second file group includes:

and identifying the files with the same file hash values in the remaining second file group as repeated files.

After the terminal device identifies all the repeated files in the storage space, the terminal device can display all the identified repeated files and cleaning controls for the repeated files. Therefore, the user can select to clean the repeated file according to the actual requirement. After the repeated files are cleaned, the use efficiency and the data management efficiency of the storage space of the terminal equipment can be improved.

Optionally, in the repeated file identification process, the terminal device may display the number of the identified repeated files in real time. Thus, the user can know the repeated file identification progress of the terminal equipment and the number of the repeated files contained in the storage space in real time.

In a second aspect, an embodiment of the present application provides an electronic device, including: one or more processors; one or more memories; the one or more memories store one or more computer-executable programs comprising instructions that, when executed by the one or more processors, cause the electronic device to perform steps in a method of identifying duplicate files as described in any of the implementations of the first aspect above.

In a third aspect, embodiments of the present application provide a computer-readable storage medium storing a computer-executable program which, when invoked by a computer, causes the computer to perform the steps of a method of identifying duplicate files as described in any one of the implementations of the first aspect above.

In a fourth aspect, embodiments of the present application provide a computer-executable program product which, when run on an electronic device, causes the electronic device to perform the steps in the method of identifying duplicate files described in any of the implementations of the first aspect above.

In a fifth aspect, embodiments of the present application provide a chip system, including a processor, where the processor is coupled to a memory, and the processor executes a computer executable program stored in the memory to implement steps in a method for identifying duplicate files according to any implementation of the first aspect. The chip system can be a single chip or a chip module composed of a plurality of chips.

It will be appreciated that the advantages of the second to fifth aspects may be found in the relevant description of the first aspect, and are not described here again.

Drawings

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a method for identifying duplicate files according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a process for screening scanned documents according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a document reading method according to an embodiment of the present disclosure;

FIG. 5 is a flowchart of a specific implementation of S204 in a method for identifying duplicate files according to an embodiment of the present application;

FIG. 6 is a flowchart of a specific implementation of S2042 in a method for identifying duplicate files according to an embodiment of the present application;

FIG. 7 is a flowchart of a specific implementation of S2042 in a method for identifying duplicate files according to another embodiment of the present application;

fig. 8 is a schematic software architecture of an electronic device according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a user interface involved in an implementation process of a method for identifying duplicate files according to an embodiment of the present application;

fig. 10 is a schematic diagram of interaction timing between each software module in a terminal device when the terminal device implements a method for identifying a duplicate file according to an embodiment of the present application.

Detailed Description

It should be noted that the terms used in the implementation section of the embodiments of the present application are only used to explain the specific embodiments of the present application, and are not intended to limit the present application. In the description of the embodiments of the present application, unless otherwise indicated, "/" means or, for example, a/B may represent a or B; "and/or" herein is merely an association relationship describing a relationship, meaning that there may be three relationships, e.g., a and/or B, may mean: a exists alone, A and B exist together, and B exists alone. In addition, in the description of the embodiments of the present application, unless otherwise indicated, "a plurality" means two or more, and "at least one", "one or more" means one, two or more.

The terms "first" and "second" are used below for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a definition of "a first", "a second" feature may explicitly or implicitly include one or more of such features.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

The storage space of a terminal device (e.g., a mobile phone or computer, etc.) refers to the available space provided by the terminal device for storing an operating system, application programs, media files, documents, or other user data. Under the condition that the terminal equipment does not support an external expansion memory card, the memory space of the terminal equipment can only comprise the memory space of the flash memory or the solid state disk built in the terminal equipment, and under the condition, the size of the memory space of the terminal equipment can refer to the capacity of the flash memory or the solid state disk built in the terminal equipment. Under the condition that the terminal equipment supports the external expansion memory card, the memory space of the terminal equipment can also comprise the memory space of the external expansion memory card besides the memory space of the built-in flash memory or the solid state disk, and under the condition, the size of the memory space of the terminal equipment can be the sum of the capacity of the built-in flash memory or the solid state disk of the terminal equipment and the capacity of the external expansion memory card.

Currently, in order to meet the increasing data storage demands of people, the storage space of terminal equipment provided by manufacturers of terminal equipment for users is also increased. With the increasing storage space of the terminal devices and the increasing data stored on the terminals, there are more and more duplicate files in the terminal devices. Repeated files are typically generated in the following scenario:

Scenario 1, forwarding or sharing of files may produce duplicate files. Specifically, when a user forwards or shares a file to others on a terminal device, the terminal device typically creates a copy of the file and sends the copy of the file to the intended recipient, so that both the original file and the copy of the file are stored in the terminal device, resulting in the creation of duplicate files.

Scenario 2, in which a duplicate file is generated during the running of the application. Specifically, some applications in the terminal device may generate temporary files, log files, cache files, and the like at runtime. If these applications do not have an appropriate file management mechanism, the temporary files, log files, cache files, etc. may be repeatedly generated at different time nodes during the running process of the application, resulting in the generation of repeated files.

Scene 3, in which repeated files are generated in the running process of the operating system. Specifically, in the running process of the operating system, when multiple application programs need to share the same file, the operating system creates a copy of the shared file under the storage paths corresponding to the multiple application programs, so as to cause generation of repeated files.

The increase of repeated files in the terminal device not only reduces the use efficiency of the storage space of the terminal device, but also reduces the data management efficiency of the terminal device. Based on the above, in order to improve the data management efficiency of the terminal device and the use efficiency of the storage space of the terminal device, a method for identifying the duplicate file can be provided, and the duplicate file in the terminal device is identified and displayed to the user, so that the user can select whether to clean the duplicate file based on the actual requirement, thereby providing a way for improving the data management efficiency of the terminal device and the use efficiency of the storage space for the user.

In general, the speed of identification of duplicate files in storage space can be affected by, but is not limited to, several factors:

1) Capacity and usage of storage space. Specifically, the larger the capacity of the storage space, the larger the usage, and the more files stored in the storage space are represented, so that the more the total number of duplicate files in the storage space may be, the larger the calculation amount when duplicate files are identified may be, and the more the number of duplicate files are finally identified.

2) File size of a single file. Specifically, the larger the file size of a single file, the longer the calculation time of the file hash value, and the slower the identification speed of the duplicate file.

The common duplicate file identification method is to scan files in a storage space of the terminal device through a third party application program, calculate hash values of the scanned files, compare the hash values of the scanned files, and identify whether the files are duplicate files based on whether the hash values of the files are identical. For example, if the hash values of two files are equal, it means that the two files are identical files; if the hash values of the two files are different, it means that the two files are different files.

However, when calculating the hash value of the file, it is generally necessary to read each content chunk of the file one by one, and calculate the hash value corresponding to each content chunk separately, and finally obtain the hash value of the file. It can be seen that the hash value of the file is computationally intensive. Therefore, a great deal of time is consumed for calculating the hash value of each scanned file, so that the repeated file identification speed of the terminal equipment is reduced, the power consumption of the terminal equipment is increased, and more memory resources of the terminal equipment are occupied.

In addition, the file access rights of the third party application program are generally limited, that is, the third party application program can only scan or read user files (such as files under the data/user directory) and cannot scan all files in the storage space, for example, cannot scan all library files in the storage space, so that the file identification range of the common duplicate file identification method is smaller, and therefore, the user cannot clean all duplicate files in the terminal device, that is, the common duplicate file identification method can result in incomplete cleaning of the duplicate files in the storage space.

In view of this, the embodiments of the present application provide a method and an electronic device for identifying duplicate files, by first grouping all scanned files once and twice based on the file size and the hash value of the head and tail pages of each scanned file, and rejecting file groups including a single file, so that some non-duplicate files can be filtered, that is, files that may be duplicate files can be roughly screened out; one or more threads are then created to calculate file hash values for each file in the remaining set of files and identify duplicate files in the remaining set of files based on the file hash values for each file in the remaining set of files. Because the terminal equipment filters out some non-repeated files, the calculation amount of the terminal equipment for calculating the file hash value of the non-repeated files can be saved, the integral calculation amount of the terminal equipment when the terminal equipment identifies the repeated files is reduced, and the repeated file identification speed of the terminal equipment can be improved. In addition, the file hash value of each file in the rest file groups is calculated through a plurality of threads, so that the file hash value calculation speed of the terminal equipment can be improved, and the repeated file identification speed of the terminal equipment is further improved. The improvement of the repeated file identification speed can reduce the power consumption of the terminal equipment when the repeated file is identified, and save the memory resource of the terminal equipment.

The method for identifying duplicate files provided by the embodiment of the application can be applied to electronic devices such as mobile phones, tablet computers, wearable devices, vehicle-mounted devices, augmented reality (augmented reality, AR)/Virtual Reality (VR) devices, notebook computers, ultra-mobile personal computer (UMPC), netbooks, personal digital assistants (personal digital assistant, PDA) and the like, and the embodiment of the application does not limit the specific types of the electronic devices.

The structure of an electronic device to which the method for identifying a duplicate file provided in the embodiment of the present application is applicable will be described below by taking the electronic device as an example of a mobile phone. Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a camera 193, a display 194, and a subscriber identity module (subscriber identification module, SIM) card interface 195, etc. Among them, the sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, and a bone conduction sensor 180M, etc.

The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.

In some embodiments, the processor 110 may include one or more interfaces. The interfaces may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a SIM interface, and/or a USB interface, among others.

The I2C interface is a bi-directional synchronous serial bus comprising a serial data line (SDA) and a serial clock line (derail clock line, SCL). In some embodiments, the processor 110 may contain multiple sets of I2C buses. The processor 110 may be coupled to the touch sensor 180K, charger, flash, camera 193, etc., respectively, through different I2C bus interfaces. For example: the processor 110 may be coupled to the touch sensor 180K through an I2C interface, such that the processor 110 communicates with the touch sensor 180K through an I2C bus interface to implement a touch function of the electronic device 100.

The I2S interface may be used for audio communication. In some embodiments, the processor 110 may contain multiple sets of I2S buses. The processor 110 may be coupled to the audio module 170 via an I2S bus to enable communication between the processor 110 and the audio module 170. In some embodiments, the audio module 170 may transmit an audio signal to the wireless communication module 160 through the I2S interface, to implement a function of answering a call through the bluetooth headset.

PCM interfaces may also be used for audio communication to sample, quantize and encode analog signals. In some embodiments, the audio module 170 and the wireless communication module 160 may be coupled through a PCM bus interface. In some embodiments, the audio module 170 may also transmit audio signals to the wireless communication module 160 through the PCM interface to implement a function of answering a call through the bluetooth headset. Both the I2S interface and the PCM interface may be used for audio communication.

The UART interface is a universal serial data bus for asynchronous communications. The bus may be a bi-directional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is typically used to connect the processor 110 with the wireless communication module 160. For example: the processor 110 communicates with a bluetooth module in the wireless communication module 160 through a UART interface to implement a bluetooth function. In some embodiments, the audio module 170 may transmit an audio signal to the wireless communication module 160 through a UART interface, to implement a function of playing music through a bluetooth headset.

The MIPI interface may be used to connect the processor 110 to peripheral devices such as a display 194, a camera 193, and the like. The MIPI interfaces include camera serial interfaces (camera serial interface, CSI), display serial interfaces (display serial interface, DSI), and the like. In some embodiments, processor 110 and camera 193 communicate through a CSI interface to implement the photographing functions of electronic device 100. The processor 110 and the display 194 communicate via a DSI interface to implement the display functionality of the electronic device 100.

The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal or as a data signal. In some embodiments, a GPIO interface may be used to connect the processor 110 with the camera 193, the display 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like. The GPIO interface may also be configured as an I2C interface, an I2S interface, a UART interface, an MIPI interface, etc.

The USB interface 130 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 130 may be used to connect a charger to charge the electronic device 100, and may also be used to transfer data between the electronic device 100 and a peripheral device. And can also be used for connecting with a headset, and playing audio through the headset. The interface may also be used to connect other electronic devices, such as AR devices, etc.

It should be understood that the interfacing relationship between the modules illustrated in the embodiments of the present application is only illustrative, and does not limit the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also use different interfacing manners, or a combination of multiple interfacing manners in the foregoing embodiments.

The charge management module 140 is configured to receive a charge input from a charger. The charger can be a wireless charger or a wired charger. In some wired charging embodiments, the charge management module 140 may receive a charging input of a wired charger through the USB interface 130. In some wireless charging embodiments, the charge management module 140 may receive wireless charging input through a wireless charging coil of the electronic device 100. The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142.

The power management module 141 is used for connecting the battery 142, and the charge management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 to power the processor 110, the internal memory 121, the display 194, the camera 193, the wireless communication module 160, and the like. The power management module 141 may also be configured to monitor battery capacity, battery cycle number, battery health (leakage, impedance) and other parameters. In other embodiments, the power management module 141 may also be provided in the processor 110. In other embodiments, the power management module 141 and the charge management module 140 may be disposed in the same device.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The

antennas

1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device 100 may be used to cover a single or multiple communication bands. Different antennas may also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed into a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 may provide a solution for wireless communication including 2G/3G/4G/5G, etc., applied to the electronic device 100. The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (low noise amplifier, LNA), etc. The mobile communication module 150 may receive electromagnetic waves from the antenna 1, perform processes such as filtering, amplifying, and the like on the received electromagnetic waves, and transmit the processed electromagnetic waves to the modem processor for demodulation. The mobile communication module 150 can amplify the signal modulated by the modem processor, and convert the signal into electromagnetic waves through the antenna 1 to radiate. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be provided in the same device as at least some of the modules of the processor 110.

The modem processor may include a modulator and a demodulator. The modulator is used for modulating the low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low frequency baseband signal to the baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs sound signals through an audio device (not limited to the speaker 170A, the receiver 170B, etc.), or displays images or video through the display screen 194. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be provided in the same device as the mobile communication module 150 or other functional module, independent of the processor 110.

The wireless communication module 160 may provide solutions for wireless communication including wireless local area network (wireless local area networks, WLAN) (e.g., wireless fidelity (wireless fidelity, wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field wireless communication technology (near field communication, NFC), infrared technology (IR), etc., as applied to the electronic device 100. The wireless communication module 160 may be one or more devices that integrate at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, modulates the electromagnetic wave signals, filters the electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, frequency modulate it, amplify it, and convert it to electromagnetic waves for radiation via the antenna 2.

In some embodiments, antenna 1 and mobile communication module 150 of electronic device 100 are coupled, and antenna 2 and wireless communication module 160 are coupled, such that electronic device 100 may communicate with a network and other devices through wireless communication techniques. The wireless communication techniques may include the Global System for Mobile communications (global system for mobile communications, GSM), general packet radio service (general packet radio service, GPRS), code division multiple access (code division multiple access, CDMA), wideband code division multiple access (wideband code division multiple access, WCDMA), time division code division multiple access (time-division code division multiple access, TD-SCDMA), long term evolution (long term evolution, LTE), BT, GNSS, WLAN, NFC, FM, and/or IR techniques, among others. The GNSS may include a global satellite positioning system (global positioning system, GPS), a global navigation satellite system (global navigation satellite system, GLONASS), a beidou satellite navigation system (beidou navigation satellite system, BDS), a quasi zenith satellite system (quasi-zenith satellite system, QZSS) and/or a satellite based augmentation system (satellite based augmentation systems, SBAS).

The electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 194 is used to display images, videos, and the like. The display 194 includes a display panel. The display panel may employ a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (AMOLED) or an active-matrix organic light-emitting diode (matrix organic light emitting diode), a flexible light-emitting diode (flex), a mini, a Micro led, a Micro-OLED, a quantum dot light-emitting diode (quantum dot light emitting diodes, QLED), or the like. In some embodiments, the electronic device 100 may include 1 or N display screens 194, N being a positive integer greater than 1.

The electronic device 100 may implement photographing functions through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.

The ISP is used to process data fed back by the camera 193. For example, when photographing, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electric signal, and the camera photosensitive element transmits the electric signal to the ISP for processing and is converted into an image visible to naked eyes. ISP can also optimize the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in the camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then transferred to the ISP to be converted into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, YUV, or the like format. In some embodiments, electronic device 100 may include 1 or N cameras 193, N being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals. For example, when the electronic device 100 selects a frequency bin, the digital signal processor is used to fourier transform the frequency bin energy, or the like.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 may play or record video in a variety of encoding formats, such as: dynamic picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4, etc.

The NPU is a neural-network (NN) computing processor, and can rapidly process input information by referencing a biological neural network structure, for example, referencing a transmission mode between human brain neurons, and can also continuously perform self-learning. Applications such as intelligent awareness of the electronic device 100 may be implemented through the NPU, for example: image recognition, face recognition, speech recognition, text understanding, etc.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to enable expansion of the memory capabilities of the electronic device 100. The external memory card communicates with the processor 110 through an external memory interface 120 to implement data storage functions. For example, files such as music, video, etc. are stored in an external memory card.

The internal memory 121 may be used to store computer executable program code including instructions. The internal memory 121 may include a storage program area and a storage data area. The storage program area may store an application program (such as a sound playing function, an image playing function, etc.) required for at least one function of the operating system, etc. The storage data area may store data created during use of the electronic device 100 (e.g., audio data, phonebook, etc.), and so on. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like. The processor 110 performs various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.

The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or a portion of the functional modules of the audio module 170 may be disposed in the processor 110.

The speaker 170A, also referred to as a "horn," is used to convert audio electrical signals into sound signals. The electronic device 100 may listen to music, or to hands-free conversations, through the speaker 170A.

A receiver 170B, also referred to as a "earpiece", is used to convert the audio electrical signal into a sound signal. When electronic device 100 is answering a telephone call or voice message, voice may be received by placing receiver 170B in close proximity to the human ear.

Microphone 170C, also referred to as a "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can sound near the microphone 170C through the mouth, inputting a sound signal to the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, and may implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may also be provided with three, four, or more microphones 170C to enable collection of sound signals, noise reduction, identification of sound sources, directional recording functions, etc.

The earphone interface 170D is used to connect a wired earphone. The headset interface 170D may be a USB interface 130 or a 3.5mm open mobile electronic device platform (open mobile terminal platform, OMTP) standard interface, a american cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

The pressure sensor 180A is used to sense a pressure signal, and may convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. The pressure sensor 180A is of various types, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a capacitive pressure sensor comprising at least two parallel plates with conductive material. The capacitance between the electrodes changes when a force is applied to the pressure sensor 180A. The electronic device 100 determines the strength of the pressure from the change in capacitance. When a touch operation is applied to the display screen 194, the electronic apparatus 100 detects the touch operation intensity according to the pressure sensor 180A. The electronic device 100 may also calculate the location of the touch based on the detection signal of the pressure sensor 180A. In some embodiments, touch operations that act on the same touch location, but at different touch operation strengths, may correspond to different operation instructions. For example: and executing an instruction for checking the short message when the touch operation with the touch operation intensity smaller than the first pressure threshold acts on the short message application icon. And executing an instruction for newly creating the short message when the touch operation with the touch operation intensity being greater than or equal to the first pressure threshold acts on the short message application icon.

The gyro sensor 180B may be used to determine a motion gesture of the electronic device 100. In some embodiments, the angular velocity of electronic device 100 about three axes (i.e., x, y, and z axes) may be determined by gyro sensor 180B. The gyro sensor 180B may be used for photographing anti-shake. For example, when the shutter is pressed, the gyro sensor 180B detects the shake angle of the electronic device 100, calculates the distance to be compensated by the lens module according to the angle, and makes the lens counteract the shake of the electronic device 100 through the reverse motion, so as to realize anti-shake. The gyro sensor 180B may also be used for navigating, somatosensory game scenes.

The air pressure sensor 180C is used to measure air pressure. In some embodiments, electronic device 100 calculates altitude from barometric pressure values measured by barometric pressure sensor 180C, aiding in positioning and navigation.

The magnetic sensor 180D includes a hall sensor. The electronic device 100 may detect the opening and closing of the flip cover using the magnetic sensor 180D. In some embodiments, when the electronic device 100 is a flip machine, the electronic device 100 may detect the opening and closing of the flip according to the magnetic sensor 180D. And then according to the detected opening and closing state of the leather sheath or the opening and closing state of the flip, the characteristics of automatic unlocking of the flip and the like are set.

The acceleration sensor 180E may detect the magnitude of acceleration of the electronic device 100 in various directions (typically three axes). The magnitude and direction of gravity may be detected when the electronic device 100 is stationary. The electronic equipment gesture recognition method can also be used for recognizing the gesture of the electronic equipment, and is applied to horizontal and vertical screen switching, pedometers and other applications.

A distance sensor 180F for measuring a distance. The electronic device 100 may measure the distance by infrared or laser. In some embodiments, the electronic device 100 may range using the distance sensor 180F to achieve quick focus.

The proximity light sensor 180G may include, for example, a Light Emitting Diode (LED) and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode. The electronic device 100 emits infrared light outward through the light emitting diode. The electronic device 100 detects infrared reflected light from nearby objects using a photodiode. When sufficient reflected light is detected, it may be determined that there is an object in the vicinity of the electronic device 100. When insufficient reflected light is detected, the electronic device 100 may determine that there is no object in the vicinity of the electronic device 100. The electronic device 100 can detect that the user holds the electronic device 100 close to the ear by using the proximity light sensor 180G, so as to automatically extinguish the screen for the purpose of saving power. The proximity light sensor 180G may also be used in holster mode, pocket mode to automatically unlock and lock the screen.

The ambient light sensor 180L is used to sense ambient light level. The electronic device 100 may adaptively adjust the brightness of the display 194 based on the perceived ambient light level. The ambient light sensor 180L may also be used to automatically adjust white balance when taking a photograph. Ambient light sensor 180L may also cooperate with proximity light sensor 180G to detect whether electronic device 100 is in a pocket to prevent false touches.

The fingerprint sensor 180H is used to collect a fingerprint. The electronic device 100 may utilize the collected fingerprint feature to unlock the fingerprint, access the application lock, photograph the fingerprint, answer the incoming call, etc.

The temperature sensor 180J is for detecting temperature. In some embodiments, the electronic device 100 performs a temperature processing strategy using the temperature detected by the temperature sensor 180J. For example, when the temperature reported by temperature sensor 180J exceeds a threshold, electronic device 100 performs a reduction in the performance of a processor located in the vicinity of temperature sensor 180J in order to reduce power consumption to implement thermal protection. In other embodiments, when the temperature is below another threshold, the electronic device 100 heats the battery 142 to avoid the low temperature causing the electronic device 100 to be abnormally shut down. In other embodiments, when the temperature is below a further threshold, the electronic device 100 performs boosting of the output voltage of the battery 142 to avoid abnormal shutdown caused by low temperatures.

The touch sensor 180K, also referred to as a "touch device". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The touch sensor 180K is for detecting a touch operation acting thereon or thereabout. The touch sensor may communicate the detected touch operation to the application processor to determine the touch event type. Visual output related to touch operations may be provided through the display 194. In other embodiments, the touch sensor 180K may also be disposed on the surface of the electronic device 100 at a different location than the display 194.

The bone conduction sensor 180M may acquire a vibration signal. In some embodiments, bone conduction sensor 180M may acquire a vibration signal of a human vocal tract vibrating bone pieces. The bone conduction sensor 180M may also contact the pulse of the human body to receive the blood pressure pulsation signal. In some embodiments, bone conduction sensor 180M may also be provided in a headset, in combination with an osteoinductive headset. The audio module 170 may analyze the voice signal based on the vibration signal of the sound portion vibration bone block obtained by the bone conduction sensor 180M, so as to implement a voice function. The application processor may analyze the heart rate information based on the blood pressure beat signal acquired by the bone conduction sensor 180M, so as to implement a heart rate detection function.

The keys 190 include a power-on key, a volume key, etc. The keys 190 may be mechanical keys. Or may be a touch key. The electronic device 100 may receive key inputs, generating key signal inputs related to user settings and function controls of the electronic device 100.

The motor 191 may generate a vibration cue. The motor 191 may be used for incoming call vibration alerting as well as for touch vibration feedback. For example, touch operations acting on different applications (e.g., photographing, audio playing, etc.) may correspond to different vibration feedback effects. The motor 191 may also correspond to different vibration feedback effects by touching different areas of the display screen 194. Different application scenarios (such as time reminding, receiving information, alarm clock, game, etc.) can also correspond to different vibration feedback effects. The touch vibration feedback effect may also support customization.

The indicator 192 may be an indicator light, may be used to indicate a state of charge, a change in charge, a message indicating a missed call, a notification, etc.

The SIM card interface 195 is used to connect a SIM card. The SIM card may be inserted into the SIM card interface 195, or removed from the SIM card interface 195 to enable contact and separation with the electronic device 100. The electronic device 100 may support 1 or N SIM card interfaces, N being a positive integer greater than 1. The SIM card interface 195 may support Nano SIM cards, micro SIM cards, and the like. The same SIM card interface 195 may be used to insert multiple cards simultaneously. The types of the plurality of cards may be the same or different. The SIM card interface 195 may also be compatible with different types of SIM cards. The SIM card interface 195 may also be compatible with external memory cards. The electronic device 100 interacts with the network through the SIM card to realize functions such as communication and data communication. In some embodiments, the electronic device 100 employs esims, i.e.: an embedded SIM card. The eSIM card can be embedded in the electronic device 100 and cannot be separated from the electronic device 100.

It is to be understood that the structure illustrated in fig. 1 does not constitute a specific limitation on the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Referring to fig. 2, a schematic flowchart of a method for identifying duplicate files is provided in an embodiment of the present application. As shown in FIG. 2, the method for identifying duplicate files may include S201-S205, which are described in detail below:

s201, scanning files in the storage space.

For example, the flash memory or the solid state disk built in the terminal device or the external expansion memory card of the terminal device may be a memory using a magnetic disk as a storage medium.

In this embodiment, scanning the files in the storage space may specifically refer to traversing and checking the files in the storage space to obtain file attributes of each file in the storage space.

The file attributes may include, for example, file name, file type, file size, creation time, modification time, and the like.

By way of example, file types may include, but are not limited to, text files, image files, audio files, video files, library files, or installation files, among others. More specifically, the image files may include, for example, different types of image files such as joint photographic experts group (joint photographic experts group, JPEG), portable network graphics (portable network graphics, PNG), or graphics interchange format (graphics interchange format, GIF). The audio files may include, for example, different types of audio files such as a moving picture experts group audio third layer (moving picture experts group audio layer III, MP 3), waveform audio file format (waveform audio file format, WAV), or free lossless audio encoder (free lossless audio code, FLAC). The video files may include, for example, different types of video files such as dynamic picture experts group 4 (moving picture experts group-4, mp 4), audio video staggering (audio video interleave, AVI), or Windows media video (Windows media video, WMV).

In particular implementations, different types of files may be distinguished by their suffix names. Wherein the suffix name of the file may be composed of the character ".", and a file type identifier, for example, ".txt" typically represents a text file, ".jpg" typically represents a JPEG-type image file. The suffix name of a file is typically located in the last part of the file name.

In an alternative implementation manner, in order to increase the scanning speed of the files, the terminal device may only acquire the file size and the file type of each file in the storage space when scanning the files in the storage space.

In another alternative implementation manner, in order to further increase the scanning speed of the files, the terminal device may only acquire the file size of each file in the storage space when scanning the files in the storage space.

S202, dividing the files with the same file size in the scanned files into the same group to obtain one or more first file groups.

In this embodiment, the first file group may include only one file, or may include a plurality of files. The file sizes of all files in the same first file group are equal. The file sizes of the files in the different first file groups are not equal.

Referring to fig. 3, a schematic diagram of a process for screening scanned files according to an embodiment of the present application is provided. For example, the list of file names of all the files scanned by the terminal device from the storage space may be 31 shown in fig. 3, and after the terminal device groups all the scanned files based on the file sizes of the scanned files, the obtained one or more first file groups may be 31_1 to 31—n respectively, where n is a positive integer. Illustratively, each file included in the first file group 31_1 is equal in file size, for example, 321 Kilobytes (KB); the file sizes of each file included in the second first file group 31_2 are equal, for example, 500KB; the file sizes of the files included in the first file group 31_1 are not equal to the file sizes of the files included in the second first file group 31_2.

Optionally, in another embodiment of the present application, in order to improve accuracy of duplicate file identification of the terminal device, the terminal device may divide files with equal file sizes and identical file types in the scanned files into the same group, to obtain one or more first file groups. In this embodiment, all files in the same first file group have the same file size and the same file type. The files in the different first file groups are not equal in file size and/or are not of the same file type.

And S203, eliminating the first file group comprising the single file, calculating the head-to-tail page hash values of all files in the rest first file groups, and dividing the files with the same head-to-tail page hash values in the same first file group into the same group to obtain one or more second file groups.

Wherein the first file group including a single file refers to a first file group having a number of files of 1.

Optionally, if each first file group is obtained based on file size division, if a certain first file group includes only one file, it is indicated that there are no other files with the same file size as the file in the storage space, that is, there are no other files repeated with the file in the storage space, so that the terminal device can directly reject the file without performing file hash value calculation operation on the file, so that the calculation amount of the terminal device can be reduced, and the identification speed of repeated files is improved.

Optionally, if each first file group is obtained based on file size and file type division, if only one file is included in a certain first file group, it is indicated that there are no other files with the same file size and file type as the file in the storage space, that is, there are no other files repeated with the file in the storage space, so that the terminal device can directly reject the file, and does not perform file hash value calculation operation on the file, so that not only can the calculation amount of the terminal device be reduced, but also the identification speed of the repeated file can be improved, and the accuracy of the repeated file identification can be improved.

For example, please continue to refer to fig. 3, assuming that the n-1 st first file group 31_n-1 and the n-th first file group 31_n are both first file groups including a single file, the terminal device eliminates the n-1 st first file group 31_n-1 and the n-th first file group 31_n, and then the remaining first file groups 31_1 to 31_n-2 include a plurality of files.

In the embodiment of the application, the hash value of the first page and the last page of the file may include the hash value of the first page and the hash value of the last page of the file.

It will be appreciated that when the terminal device obtains the content of each file in the storage space, the terminal device needs to load the file into the content from the storage space. Because the minimum unit of the terminal device for memory management is a memory page, when the file is loaded into the memory, as shown in fig. 4, the terminal device will divide the content in the file into a plurality of memory pages with the same size, and then sequentially read each memory page of the file from the storage space, and further read the entire content in the file.

Illustratively, the memory page may be 4KB or 8KB in size, etc.

It should be noted that, the hash values of the first and last pages of the file are equal, which specifically means that the hash value of the first and last memory pages of the file are equal. Illustratively, if the hash value of the first memory page of the first file1 is equal to the hash value of the first memory page of the second file2, and the hash value of the last memory page of the first file1 is equal to the hash value of the last memory page of the second file2, which indicates that the hash value of the head and tail pages of the first file1 is equal to the hash value of the head and tail pages of the second file 2.

In a specific embodiment, the terminal device may first calculate the hash value of the first memory page of each file in each remaining first file group, and compare the hash values of the first memory pages of each file in the same first file group. Optionally, for each file with the hash value of the first memory page being equal, the terminal device may calculate the hash value of the last memory page of the files, compare the hash values of the last memory page of the files, and finally divide the files with the hash values of the first memory page and the hash values of the last memory page being equal into the same second file group. Alternatively, for each file whose hash value of the first memory page is not equal, the terminal device may directly remove the file. In this embodiment, the terminal device only needs to calculate the hash value of the last memory page of each file with the same hash value of the first memory page, and does not need to calculate the hash value of the last memory page of each file with the unequal hash value of the first memory page, so that the calculation amount of the terminal device can be reduced, and the duplicate file identification speed of the terminal device can be improved.

In another specific embodiment, the terminal device may first calculate the hash value of the last memory page of each file in each remaining first file group, and compare the hash values of the last memory pages of each file in the same first file group. Optionally, for each file with the same hash value of the last memory page, the terminal device may calculate the hash value of the first memory page of the files, compare the hash values of the first memory page of the files, and finally divide the files with the same hash value of the first memory page and the hash value of the last memory page into the same second file group. Alternatively, for each file whose hash value of the last memory page is not equal, the terminal device may directly remove the file. In this embodiment, the terminal device only needs to calculate the hash value of the first memory page of each file with the same hash value of the last memory page, and does not need to calculate the hash value of the first memory page of each file with the unequal hash value of the last memory page, so that the calculation amount of the terminal device can be reduced, and the duplicate file identification speed of the terminal device can be improved.

It will be appreciated that files from different first file groups are not necessarily duplicate files, as the file sizes of the files included in the different first file groups are different. Based on this, the comparison of the head-to-tail hash values of the files is performed between the files in the same file group, and the comparison of the head-to-tail hash values is not performed between the files from different first file groups.

In this embodiment, the second file group may include only one file, or may include a plurality of files. The hash values of the head and the tail pages of all files in the same second file group are equal, and the hash values of the head and the tail pages of the files in different second file groups are unequal. The file sizes of the files included in all the second file groups from the same first file group are equal.

For example, referring to fig. 3, after the terminal device performs secondary grouping on each file in each remaining first file group based on the head-to-tail page hash value of each file in the remaining first file groups (i.e. the first file groups 31_1 to 31—n-2), the obtained second file group may include: the first second document set 32_1_1, the second document set 32_1_2, the third second document set 32_2_1, the fourth second document set 32_3_1, the fifth second document set 32_3_2, … …, the mth second document set 32_n-2_1. Wherein, the first second file set 32_1_1 and the second file set 32_1_2 are from the first file set, the third second file set 32_2_1 is from the second first file set, the fourth second file set 32_3_1 and the fifth second file set 32_3_2 are from the third first file set, and the mth second file set 32_n-2_1 is from the nth-2 second file set. Illustratively, the first and second file groups 32_1_1 and 32_1 have equal head and tail hash values, the second and third file groups 32_1 and 32_2 have equal head and tail hash values, respectively. The hash values of the files in the first second file group 32_1_1, the hash values of the files in the second file group 32_1_2, and the hash values of the files in the third second file group 32_2_1 are each unequal. Further, the file sizes of the respective files in the first second file group 32_1_2 are equal to the file sizes of the respective files in the second file group 32_1_2. Wherein m is a positive integer.

For example, the terminal device may calculate the hash value of the first memory page and the hash value of the last memory page of the file using a preset hash function. By way of example, the preset hash function may include a message digest algorithm 5 (message digest algorithm, MD5) function, a secure hash algorithm 1 (secure hash algorithm, SHA-1) function, or a secure hash algorithm 256-bit (secure hash algorithm 256-bit, SHA-256) function, and the like, and the specific type of the preset hash function is not particularly limited in the embodiments of the present application.

S204, eliminating the second file group comprising the single file, and calculating the file hash value of each file in the remaining second file group.

Wherein the second file group including a single file refers to a second file group having a number of files of 1.

It can be understood that if a certain second file group includes only one file, it indicates that there are no other files in the storage space that are equal to the hash value of the first memory page and the hash value of the last memory page of the file, that is, that there are no files in the storage space that are identical to the content of the file, that is, there are no other files in the storage space that are repeated with the file, so that the terminal device can directly reject the file, without performing file hash value calculation operation on the file, so that the calculation amount of the terminal device can be further reduced, and the recognition speed of the repeated file can be improved.

After the terminal equipment rejects the second file group comprising the single file, each remaining second file group is a second file group comprising a plurality of files, namely, the number of the remaining second file groups is greater than 1. For example, with continued reference to fig. 3, assuming that one file is included in each of the second file set 32_1_2, the fourth second file set 32_3_1 and the fifth second file set 32_3_2, the terminal device eliminates each of the remaining second file sets (i.e., the first second file set 32_1_1, the third second file set 32_2_1, the … … and the mth second file set 32_n-2_1) after eliminating the second file set 32_1_2, the fourth second file set 32_3_1 and the fifth second file set 32_3_2.

Wherein the file hash value pointer of the file is a hash value of all memory pages of the file. The hash value of the file differs from the hash value of the first memory page or the hash value of the last memory page in that the hash value of the first memory page or the hash value of the last memory page is the hash value of a single memory page of the file, and the hash value of the file is the hash value corresponding to the entire contents of the file.

In a specific embodiment, the terminal device may calculate the file hash value of the file using a preset hash function. For the specific type of the hash function, reference may be made to the relevant description in the foregoing embodiments, which will not be repeated here.

It should be noted that, since the calculation of the file hash value is the prior art, the specific calculation process of the file hash value for the file may refer to the related description in the prior art, and will not be described in detail here.

In one embodiment of the present application, the terminal device may create only one thread to calculate the file hash value of each file in all the remaining second file groups. Because the terminal device has filtered some non-duplicate files based on the file size and the head-to-tail page hash values in S202 and S203, only one thread is created to calculate the file hash value of each file in all remaining second file groups, so that compared with the common duplicate file identification method which needs to calculate the file hash values of all scanned files, the calculation amount of the terminal device can be reduced to a certain extent, and the duplicate file identification speed of the terminal device can be improved.

In another embodiment of the present application, in order to further increase the duplicate file identification speed of the terminal device, the terminal device may create a plurality of threads to calculate a file hash value of each file in the remaining respective second file group. Based on this, S204 may specifically include S2041 to S2042 shown in fig. 5, which is described in detail below:

S2041, a plurality of threads are created.

In an alternative implementation, the terminal device may create a preset number of threads in case the total file size of the files comprised by all remaining second file groups is smaller than or equal to the first file size threshold.

Wherein the preset number is a constant and the preset number is greater than 1. For example, the preset number may be 3, that is, in a case where the total file size of the files included in the second file group is less than or equal to the first file size threshold, the terminal device may create 3 threads to calculate the file hash value of each of the remaining second file groups.

The first file size threshold may be determined according to a preset number and a maximum calculated amount of each thread. Specifically, the first file size threshold may be a product of a preset number and a maximum calculated amount per thread. Illustratively, assuming a maximum computational effort per thread of 2.5 Gigabytes (GB), the first file size threshold may be 2.5×3=7.5 GB.

In another optional implementation manner, in a case that the total file size of the files included in all the remaining second file groups is greater than the first file size threshold, the terminal device may dynamically increase the threads on the basis of the preset number of threads according to the total file size of the files included in all the remaining second file groups.

Specifically, the terminal device may create R threads, r=s+up_round ((D1-YZ)/D2). Wherein S is a preset number, D1 is a total file size of the files included in all the remaining second file groups, YZ is a first file size threshold, D2 is a maximum calculation amount of each thread, and up_round () is an upward rounding function. That is, in the case where (D1-YZ)/D2 is a non-integer, up_round ((D1-YZ)/D2) is the integer part of the result obtained by dividing D1-YZ by D2 plus 1; in the case where (D1-YZ)/D2 is an integer, up_round ((D1-YZ)/D2) is the result obtained by dividing D1-YZ by D2.

For example, assuming a preset number of threads of 3, the first file size threshold is 7.5GB and the maximum computational load per thread is 2.5GB. Based on this, in the case where the total file size of the files included in all the remaining second file groups is 11GB, the terminal device can create 3+up_round ((11-7.5)/2.5) =5 threads; in the case where the total file size of the files included in all the remaining second file groups is 10GB, the terminal device may create 3+up_round ((10-7.5)/2.5) =4 threads.

S2042, distributing the files in the remaining second file group to each thread according to the principle that the calculated amount of each thread is equal or approximately equal, so that the total file sizes of the files distributed to each thread are equal or approximately equal.

In an alternative implementation manner, the terminal device allocates the files in the remaining second file group to each thread according to the principle that the calculation amounts of each thread are equal or approximately equal, which may specifically include S601 to S602 shown in fig. 6, which is described in detail as follows:

s601, determining a first ratio of the total file size of the files included in all the remaining second file groups to the total number of threads as an average calculated amount of each thread.

It will be appreciated that since the terminal device creates a plurality of threads based on the principle that the actual calculation amount of each thread is smaller than or equal to the maximum calculation amount of the thread in S2041, the average calculation amount of each thread calculated in this step is smaller than or equal to the maximum calculation amount of the thread.

S602, based on the average calculated amount of each thread, using a single second file group as a minimum allocation unit, allocating the remaining second file groups to each thread in sequence from small to large according to the total file size of each second file group, so that the total file size of the file allocated to each thread is equal to or approximately equal to the average calculated amount.

In this implementation manner, the terminal device allocates a file for each thread by using the entire second file group as a minimum allocation unit.

Wherein the total file size of the file to which the thread is assigned refers to the sum of the file sizes of all the files to which the thread is assigned. For example, assuming that each thread is allocated to 5 files, the file sizes of the 5 files to which a certain thread is allocated are 300KB, 400KB, 600KB, 900KB, and 500KB, respectively, then the total file size of the files to which the thread is allocated is 300+400+600+900+500=2700kb.

Optionally, when the terminal device allocates files for each thread, the terminal device may allocate a file group of small files first and reallocate a file group of large files. Wherein, the file group of the small file refers to the file group with relatively smaller file size of the single file in the file group, and the file group of the large file refers to the file group with relatively larger file size of the single file in the file group.

For example, assuming that the average calculation amount of each thread is 1GB, the sorting result of sorting all the second file groups in the order of the total file sizes of the respective second file groups from small to large is: the first second file group 32_1_1 (total file size 200 Megabytes (MB)), the third second file group 32_2_1 (total file size 324 MB), the seventh second file group 32_5_1 (total file size 495 MB), … …, and the mth second file group 32—n-2_1 (total file size 1 GB). Since the total file size of the first second file group 32_1_1, the total file size of the third second file group 32_2_1, and the total file size of the seventh second file group 32_5_1 are 1019MB, which is different from the average calculation amount of each thread by only 5MB, and the total file size of each second file group arranged after the seventh second file group 32_5_1 is necessarily larger than 495MB, the terminal device can allocate the first second file group 32_1_1, the third second file group 32_2_1, and the seventh second file group 32_5_1 to the first thread, and allocate files for the second thread from the second file group arranged after the seventh second file group 32_5_1, and so on.

It will be appreciated that assigning files to each thread with the entire second file group as the minimum allocation unit, although it cannot be guaranteed that the total file size of the files to which each thread is assigned is just equal to the average calculated amount of the threads, since the average calculated amount of the threads is typically on the order of GB, and the error between the average calculated amounts is typically on the order of KB, this error will not typically have a significant effect on the overall file identification speed of all threads.

In addition, since the whole second file group is used as the minimum allocation unit for file allocation, the file allocation time can be saved, and the identification time of repeated files can be shortened.

In another alternative implementation manner, the terminal device allocates the files in the remaining second file group to each thread according to the principle that the calculation amounts of each thread are equal or approximately equal, which may specifically include S701 to S704 shown in fig. 7, which is described in detail as follows:

S701, calculating a second ratio of the total number of files included in all remaining second file groups to the total number of threads.

S702, under the condition that the second ratio is an integer, a second ratio file is allocated to each thread.

In this implementation, the number of files to which each thread is assigned is equal.

For example, if the total number of files included in all remaining second file groups is 333 and the total number of threads is 3, the terminal device may allocate 333/3=111 files to each thread.

S703, when the second ratio is a non-integer, N files are allocated to each thread, and then one file is allocated to each M threads in the plurality of threads.

Where n=floor (X1/Y1), m=x1mod Y1. X1 is the total number of files included in all remaining second file groups, and Y1 is the total number of threads; floor is a downward rounding function, floor (X1/Y1) is an integer part of a result obtained by dividing X1 by Y1; mod is a remainder function, and X1 mod Y1 is a remainder portion of a result obtained by dividing X1 by Y1.

For example, if the total number of files included in all the remaining second file groups is 335 and the total number of threads created by the terminal device is 3, the terminal device may allocate floor (333/3) =111 files for each thread first, and then allocate one file for each of any two threads in the plurality of threads.

S702 and S703 are mutually exclusive steps, and the terminal device does not execute S703 when S702 is executed, and the terminal device does not execute S702 when S703 is executed.

It will be appreciated that assigning files to each thread based only on the total number of files and the total number of threads included in all remaining second file groups may result in an imbalance in the total file size of the files to which each thread is assigned, for example, some threads may be assigned to more small files and some threads may be assigned to more large files, which may result in inconsistent amounts of computation by different threads, and thus inconsistent time taken for different threads to compute the file hash values of all files to which they are assigned, and in particular, threads assigned to more small files may take only a short time to compute the file hash values of all files to which they are assigned, while threads assigned to more large files may take a longer time to compute the file hash values of all files to which they are assigned. However, since the time taken to calculate the file hash values of all the files included in the remaining second file group is determined by the time taken by the thread that has last calculated the file hash values of all the files to which it is assigned, assigning files to each thread based only on the total number of files included in the remaining second file group cannot guarantee that the terminal device has calculated the file hash values of all the files included in the remaining second file group in the fastest time.

Based on this, after the terminal device allocates the files for each thread based on the second ratio, it may determine whether there is a target thread that needs to perform file allocation adjustment according to the total file size of the files to which each thread is allocated.

The target thread that needs to perform file allocation adjustment is a thread in which the total file size of the allocated file is larger than the average value of the total file sizes of the threads. The average value of the total file sizes of the threads may be a ratio of the total file sizes of all files included in the remaining second file group to the total number of threads, in other words, the average value of the total file sizes of the threads, that is, the average calculated amount of the threads. Illustratively, assuming that the total file size of all files included in the remaining second file group is 2.4GB and the total number of threads is 3, the average value of the total file sizes of threads is 2.4/3=0.8 GB.

In an alternative implementation, the terminal device may execute S704 when the terminal device determines that there is a target thread that needs to make a file allocation adjustment according to the total file size of the file to which each thread is allocated.

In another alternative implementation, the terminal device may execute S205 when the terminal device determines that there is no target thread that needs to make a file allocation adjustment according to the total file size of the files to which each thread is allocated.

S704, based on the total file size of the files allocated to each thread, file allocation adjustment is performed on the target thread, so that the total file sizes of the files finally allocated to each thread are equal or approximately equal.

The purpose of the terminal device for carrying out file allocation adjustment on the target threads is to make the total file sizes of the files finally allocated to each thread equal or approximately equal so as to achieve the effect of balancing the calculation amount of each thread.

In a specific implementation, S704 may include steps 1 to 4, which are described in detail below:

and step 1, determining the ratio of the sum of the total file sizes of the files to which all threads are allocated to the total number of the threads as an average value of the total file sizes of the threads.

And 2, calculating the difference value between the total file size of the file allocated to each thread and the average value of the total file sizes of the threads.

In this embodiment, a difference between a total file size of a file to which a thread is allocated and an average value of the total file sizes of the threads is greater than or equal to 0. Specifically, if the total file size of the file to which a certain thread is allocated is smaller than the average value of the total file sizes of the threads, the difference between the total file size of the file to which the thread is allocated and the average value of the total file sizes of the threads refers to the absolute value of the difference between the total file size of the file to which the thread is allocated and the average value of the total file sizes of the threads; if the total file size of the file to which a thread is assigned is greater than the average of the total file sizes of the threads, then the difference between the total file size of the file to which the thread is assigned and the average of the total file sizes of the threads refers to the difference between the total file size of the file to which the thread is assigned and the average of the total file sizes of the threads. Wherein the difference between the total file size of the file to which the thread is assigned and the average of the total file sizes of the threads refers to the total file size of the file to which the thread is assigned minus the average of the total file sizes of the threads.

And step 3, determining the thread with the difference value larger than the first difference value threshold value as a target thread.

For example, the first difference threshold may be determined based on the data processing rate of each thread and a preset time differential. For example, the first difference threshold may be a product of a data processing rate of each thread and a preset time differential.

The data processing rate of the thread may be, for example, a rate calculated for a file hash value of the thread. The preset time differential may be, for example, a maximum differential allowed between the time spent by different threads in calculating the hash value of the file.

And 4, performing file allocation adjustment on the target threads, so that the difference value between the total file size of the files finally allocated to each target thread and the average value of the total file sizes of the threads is smaller than or equal to a first difference value threshold.

In an alternative implementation, if only one or more files are screened from the first type of threads and are reassigned to the second type of threads, the difference between the total file size of the files to which each target thread is finally assigned and the average value of the total file sizes of the threads is smaller than or equal to the first difference threshold, and then the terminal device may only screen one or more files from the first type of threads and reassign them to the second type of threads.

Illustratively, assume that the average of the thread total file size is 2400KB. The file sizes of 5 files allocated to a certain first type thread are respectively 200KB, 300KB, 500KB, 700KB and 900KB, namely the total file size of the files allocated to the first type thread is 2600KB; the file sizes of the 5 files to which a certain second class thread is allocated are 100KB, 200KB, 400KB, 600KB and 900KB, respectively, i.e. the total file size of the files to which the second class thread is allocated is 2200KB. Because the first type thread has files with the file size of 200KB, the terminal equipment can only screen the files with the file size of 200KB from the first type thread and reassign the files to the second type thread, so that the total file sizes of the files finally assigned to the first type thread and the second type thread are 2400KB.

In another alternative implementation manner, if only one or more files are screened from the first type of threads and are reassigned to the second type of threads, the difference between the total file size of the files to which each target thread is finally assigned and the average value of the total file sizes of the threads cannot be smaller than or equal to the first difference threshold, the terminal device may screen one or more files from the first type of threads and reassign one or more files from the second type of threads to the first type of threads.

Illustratively, assume that the average of the thread total file size is 2400KB. The file sizes of 5 files allocated to a certain first type thread are respectively 200KB, 300KB, 400KB, 700KB and 900KB, namely the total file size of the files allocated to the first type thread is 2500KB; the file sizes of 5 files allocated to a certain second class thread are respectively 100KB, 300KB, 400KB, 600KB and 900KB, namely the total file size of the files allocated to the second class thread is 2300KB. Because the first type thread has no file with the file size of 100KB, the terminal equipment needs to screen out the file with the file size of 200KB from the first type thread, reassign the file to the second type thread, screen out the file with the file size of 100K from the second type thread, reassign the file to the first type thread, and enable the total file sizes of the files finally assigned to the first type thread and the second type thread to be 2400KB.

S205, based on the file hash values of the respective files in the remaining second file group, duplicate files in the remaining second file group are identified.

The terminal device may identify the file having the file hash value equal among all the remaining second file groups as a duplicate file.

Optionally, after identifying all the repeated files in the storage space, the terminal device may display all the identified repeated files and a cleaning control for each repeated file. Therefore, the user can select to clean the repeated file according to the actual requirement. After the repeated files are cleaned, the use efficiency and the data management efficiency of the storage space of the terminal equipment can be improved.

In an actual application scenario, the duplicate file identification method can be implemented through one or more software modules in the terminal device. The software system of the electronic device may employ a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. In the embodiment of the application, taking an Android system with a layered architecture as an example, a software structure of an electronic device is illustrated.

Exemplary, please refer to fig. 8, which is a software architecture block diagram of an electronic device according to an embodiment of the present application.

The layered architecture divides the software into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, from top to bottom, an application layer, an application framework layer, an Zhuoyun row (Android run) and system libraries, and a kernel layer, respectively.

The application layer may include a series of application packages.

As shown in fig. 8, the application package may include applications for cameras, gallery, calendar, phone calls, map, navigation, WLAN, bluetooth, music, short messages, system manager applications, etc.

The system manager application may be, for example, an application preloaded in the terminal device at the time of shipment of the terminal device.

In the embodiment of the present application, the difference between the system manager application and the third party security application is that the file access rights of the system manager application are not limited, i.e. the system manager application can access all files in the storage space of the terminal device, and the file access rights of the third party security application are generally limited to a certain extent.

The system manager application can be used for realizing the functions of cleaning acceleration, flow management, harassment interception, electric quantity monitoring, application starting management, virus searching and killing and the like.

The application framework layer provides an application programming interface (application programming interface, API) and programming framework for application programs of the application layer. The application framework layer includes a number of predefined functions.

As shown in fig. 8, the application framework layer may include a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, and the like.

The window manager is used for managing window programs. The window manager can acquire the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like.

The content provider is used to store and retrieve data and make such data accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phonebooks, etc.

The view system includes visual controls, such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, a display interface including a text message notification icon may include a view displaying text and a view displaying a picture.

The telephony manager is for providing communication functions of the electronic device. Such as the management of call status (including on, hung-up, etc.).

The resource manager provides various resources for the application program, such as localization strings, icons, pictures, layout files, video files, and the like.

The notification manager allows the application to display notification information in a status bar, can be used to communicate notification type messages, can automatically disappear after a short dwell, and does not require user interaction. Such as notification manager is used to inform that the download is complete, message alerts, etc. The notification manager may also be a notification in the form of a chart or scroll bar text that appears on the system top status bar, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, a text message is prompted in a status bar, a prompt tone is emitted, the electronic device vibrates, and an indicator light blinks, etc.

Android run time includes a core library and virtual machines. Android run time is responsible for scheduling and management of the Android system.

The core library consists of two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.

The application layer and the application framework layer run in a virtual machine. The virtual machine executes java files of the application program layer and the application program framework layer as binary files. The virtual machine is used for executing the functions of object life cycle management, stack management, thread management, security and exception management, garbage collection and the like.

The system library may include a plurality of functional modules. For example: surface manager (surface manager), media library (Media Libraries), three-dimensional graphics processing library (e.g., openGL ES), 2D graphics engine (e.g., SGL), duplicate file identification module, etc.

The surface manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.

Media libraries support a variety of commonly used audio, video format playback and recording, still image files, and the like. The media library may support a variety of audio and video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, etc.

The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like.

The 2D graphics engine is a drawing engine for 2D drawing.

The repeated file identification module may be configured to identify repeated files in a storage space of the terminal device.

The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.

The workflow of the electronic device software and hardware is illustrated below in connection with capturing a photo scene.

When a touch operation is received by a touch sensor 180K (shown in FIG. 1) of the electronic device, a corresponding hardware interrupt is issued to the kernel layer. The kernel layer processes the touch operation into the original input event (including information such as touch coordinates, time stamp of touch operation, etc.). The original input event is stored at the kernel layer. The application framework layer acquires an original input event from the kernel layer, and identifies a control corresponding to the input event. Taking the touch operation as a touch click operation, taking a control corresponding to the click operation as an example of a control of a camera application icon, the camera application calls an interface of an application framework layer, starts the camera application, further starts a camera driver by calling a kernel layer, and captures a still image or video by a camera 193 (shown in fig. 1).

Based on the software architecture of the electronic device provided in the foregoing embodiment, the embodiment of the present application further provides a detailed description of an interaction process between each software module in the terminal device in the process of implementing the foregoing method for identifying a duplicate file by the terminal device.

For ease of understanding, the user interfaces that may be involved in implementing the method of identifying duplicate files are described before describing the interaction process between the various software modules in the terminal device. For example, referring to fig. 9, a schematic user interface involved in an implementation process of a method for identifying duplicate files according to an embodiment of the present application is shown.

As shown in (a) of fig. 9, a system manager application 91 may be installed in the terminal device. When the user needs to clean up the duplicate files in the storage space of the terminal device, the user may perform a first operation with respect to the system manager application 91. The first operation may be, for example, a click operation. The terminal device may display a main interface 911 of the system manager application 91 as shown in (b) in fig. 9 in response to a first operation performed by the user for the system manager application 91.

As shown in (b) of fig. 9, the main interface 911 of the system manager application 91 may include, for example, a plurality of functional controls such as a cleaning acceleration control 9110, a flow management control, a nuisance interception control, a power management control, an application start management control, and a virus killing control. The user may enter the details interface of the corresponding function by clicking on any one of the functionality controls in the main interface 911 of the system administrator application 91. For example, when the user needs to clean up duplicate files in the storage space, the user may perform a second operation with respect to the clean up acceleration control 9110. The second operation may be, for example, a click operation. The terminal device may display a detail interface 912 corresponding to the cleaning acceleration function as shown in (c) of fig. 9 in response to the second operation.

Optionally, after entering the detail interface 912 corresponding to the cleaning acceleration function, the terminal device may automatically perform the duplicate file identification operation, that is, perform the steps in the method for identifying duplicate files shown in fig. 2.

Optionally, after entering the detail interface 912 corresponding to the cleaning acceleration function, the terminal device may execute the repeated file identification operation after detecting the third operation executed by the user on the detail interface 912 corresponding to the cleaning acceleration function. The third operation may be, for example, a pull-down operation.

For example, referring to fig. 10, a schematic diagram of interaction timing between each software module in a terminal device when the terminal device implements a method for identifying duplicate files according to an embodiment of the present application is shown.

The terminal equipment is also provided with a repeated file identification module, and the repeated file identification module can be positioned at a system layer of a software architecture of the terminal equipment. Based on this, when the terminal device implements the method for identifying the duplicate file, the interaction between each software module in the terminal device may include S1001 to S1008 as shown in fig. 10, which is described in detail as follows:

s1001, the system manager application transmits a duplicate file identification instruction to the duplicate file identification module in response to a target operation input by the user.

The target operation may be a second operation performed by the user with respect to the cleaning acceleration control 9110 in fig. 9 (b), or may be a third operation performed by the user with respect to the detail interface 912 corresponding to the cleaning acceleration function shown in fig. 9 (c).

S1002, the repeated file identification module scans the files in the storage space after receiving the repeated file identification instruction.

It should be noted that, the step is the same as step S201 in the embodiment corresponding to fig. 2, and specific reference may be made to the description related to step S201 in the embodiment corresponding to fig. 2, which is not repeated here. At the time of reference, the execution body in step S201 is only required to be thinned to the duplicate file identification module.

S1003, the repeated file identification module divides the files with the same file size in the scanned files into the same group to obtain one or more first file groups.

It should be noted that, the step is the same as step S202 in the embodiment corresponding to fig. 2, and specific reference may be made to the description related to step S202 in the embodiment corresponding to fig. 2, which is not repeated here. In reference, the execution body in step S202 is only required to be refined to the duplicate file identification module.

S1004, the repeated file identification module eliminates a first file group comprising single files, calculates the hash values of the head and tail pages of all files in the remaining first file groups, and divides the files with the same hash values of the head and tail pages in the same first file group into the same group to obtain one or more second file groups.

It should be noted that, the step is the same as step S203 in the embodiment corresponding to fig. 2, and specific reference may be made to the description of step S203 in the embodiment corresponding to fig. 2, which is not repeated here. At the time of reference, the execution subject in step S203 is only required to be refined to the duplicate file identification module.

S1005, the repeated file identification module eliminates the second file group comprising the single file and calculates the file hash value of each file in the remaining second file group.

It should be noted that, the step is the same as step S204 in the embodiment corresponding to fig. 2, and specific reference may be made to the description related to step S204 in the embodiment corresponding to fig. 2, which is not repeated here. In reference, the execution body in step S204 is only required to be refined to the duplicate file identification module.

S1006, the repeated file identification module identifies repeated files in the remaining second file group based on the file hash values of the files in the remaining second file group.

It should be noted that, the step is the same as step S205 in the embodiment corresponding to fig. 2, and specific reference may be made to the description related to step S205 in the embodiment corresponding to fig. 2, which is not repeated here. At the time of reference, the execution body in step S205 is only required to be refined to the duplicate file identification module.

S1007, the duplicate file identification module returns the duplicate file identification result to the system manager application.

For example, the duplicate file identification result may include information such as the number of duplicate files, the total file size of the duplicate files, the name of the duplicate files, the storage address of the duplicate files, and the modification time of the duplicate files.

S1008, the system manager application displays the duplicate file identification result.

In an alternative implementation, as shown in (c) in fig. 9, after receiving the duplicate file identification result from the duplicate file identification module, the system manager application may display information such as the number of duplicate files and the total file size of the duplicate files in the detail interface 912 corresponding to the cleaning acceleration function. In addition, the terminal device may further set the duplicate file cleaning control 9121 in the detail interface 912 corresponding to the cleaning acceleration function to an available state. Based on this, the terminal device can display the repeated file detail interface 913 shown in (d) in fig. 9 upon detecting the fourth operation performed by the user for the repeated file cleaning control 9121. The user can clean up the duplicate files in the duplicate file detail interface 913 according to the actual requirements.

It should be noted that, because the duplicate file identification module in the embodiment of the present application is a software module of the software system of the terminal device, the duplicate file identification module has the authority to access all files in the storage space, and based on this, the duplicate file identification module scans the files in the storage space and performs duplicate file identification operation on the scanned files, so that the identification range of duplicate files can be enlarged, and the user can clean all duplicate files in the storage space.

Based on the same technical concept, the embodiment of the application also provides an electronic device, which may include: one or more processors; one or more memories; the one or more memories store one or more computer programs comprising instructions that, when executed by the one or more processors, cause the electronic device to perform one or more steps of any of the method embodiments described above.

Based on the same technical idea, the embodiments of the present application further provide a computer-readable storage medium storing a computer-executable program, which when called by a computer, causes the computer to perform one or more steps of any of the method embodiments described above.

Based on the same technical concept, the embodiments of the present application further provide a chip system, including a processor, where the processor is coupled to a memory, and the processor executes a computer executable program stored in the memory, so as to implement one or more steps of any of the method embodiments described above. The chip system can be a single chip or a chip module composed of a plurality of chips.

Based on the same technical idea, the embodiments of the present application further provide a computer executable program product, which when run on an electronic device, causes the electronic device to perform one or more steps of any of the method embodiments described above.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference may be made to related descriptions of other embodiments. It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application in any way.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Those of ordinary skill in the art will appreciate that implementing all or part of the above-described method embodiments may be accomplished by a computer program to instruct related hardware, the program may be stored in a computer readable storage medium, and the program may include the above-described method embodiments when executed. And the aforementioned storage medium includes: ROM or random access memory RAM, magnetic or optical disk, etc.

The foregoing is merely a specific implementation of the embodiments of the present application, but the protection scope of the embodiments of the present application is not limited thereto, and any changes or substitutions within the technical scope disclosed in the embodiments of the present application should be covered by the protection scope of the embodiments of the present application. Therefore, the protection scope of the embodiments of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of identifying duplicate files, comprising:

scanning files in the storage space;

2. The method of claim 1, wherein the calculating a file hash value for each file in the remaining second set of files comprises:

creating a plurality of threads;

3. The method of identifying duplicate files of claim 2, wherein said creating a plurality of threads comprises:

4. The method of identifying duplicate files of claim 2, wherein said creating a plurality of threads comprises:

Creating R threads, r=s+up_round ((D1-YZ)/D2), in case the total file size of the files included in all remaining second file groups is greater than the first file size threshold;

5. The method for identifying duplicate files according to claim 2, wherein the assigning files in the remaining second file group to each thread according to the rule that the calculation amounts of each thread are equal or approximately equal comprises:

6. The method for identifying duplicate files according to claim 2, wherein the assigning files in the remaining second file group to each thread according to the rule that the calculation amounts of each thread are equal or approximately equal comprises:

7. The method for identifying duplicate files according to claim 6, wherein said performing a file allocation adjustment on a target thread requiring a file allocation adjustment based on a total file size of the file to which each thread is allocated comprises:

8. The method for identifying duplicate files according to any one of claims 1-7, wherein said identifying duplicate files in the remaining second set of files based on file hash values of individual files in the remaining second set of files comprises:

9. An electronic device, comprising:

one or more processors;

one or more memories;

the one or more memories store one or more computer-executable programs comprising instructions that, when executed by the one or more processors, cause the electronic device to perform the steps in the method of identifying duplicate files of any of claims 1-8.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer executable program which, when called by a computer, causes the computer to perform the steps of the method of identifying duplicate files according to any one of claims 1-8.

11. A system on a chip, comprising a processor coupled to a memory for storing computer program instructions that, when executed by the processor, cause the system on a chip to perform the steps of the method of identifying duplicate files of any one of claims 1-8.