CN115499426A - Method, device, equipment and medium for transmitting mass small files - Google Patents

Method, device, equipment and medium for transmitting mass small files Download PDF

Info

Publication number
CN115499426A
CN115499426A CN202210910755.XA CN202210910755A CN115499426A CN 115499426 A CN115499426 A CN 115499426A CN 202210910755 A CN202210910755 A CN 202210910755A CN 115499426 A CN115499426 A CN 115499426A
Authority
CN
China
Prior art keywords
directory
target
subdirectory
size
subset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210910755.XA
Other languages
Chinese (zh)
Inventor
张婷
屠志丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianyi Cloud Technology Co Ltd
Original Assignee
Tianyi Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianyi Cloud Technology Co Ltd filed Critical Tianyi Cloud Technology Co Ltd
Priority to CN202210910755.XA priority Critical patent/CN115499426A/en
Publication of CN115499426A publication Critical patent/CN115499426A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/06Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/104Peer-to-peer [P2P] networks
    • H04L67/1074Peer-to-peer [P2P] networks for supporting data block transmission mechanisms
    • H04L67/1078Resource delivery mechanisms
    • H04L67/108Resource delivery mechanisms characterised by resources being split in blocks or fragments

Abstract

The invention provides a method, a device, equipment and a medium for transmitting a large number of small files. The transmission method of the mass small files comprises the following steps: the server is scanned to obtain directory information of a first target directory in which massive small files are stored, wherein the directory information at least comprises a directory tree structure of the first target directory. And splitting the first target directory based on the number of concurrent transmissions of the server and the directory tree structure to obtain a first number of directory subsets, wherein each directory subset comprises at least one target subdirectory. And concurrently transmitting the first number of directory subsets to the receiving device so that the receiving device can obtain a large number of small files. The transmission method of the massive small files can fully utilize the concurrency capability of the server in the transmission process, and further is beneficial to shortening the transmission time of transmitting the massive small files and improving the transmission efficiency.

Description

Method, device, equipment and medium for transmitting mass small files
Technical Field
The invention relates to the technical field of data transmission, in particular to a method, a device, equipment and a medium for transmitting massive small files.
Background
The problem that a Central Processing Unit (CPU) and a bandwidth are low in utilization rate and long in transmission time exists in a full-transmission scene of massive small files.
In the related art, to solve the above technical problem, the files are packaged and transmitted in advance, and then are transmitted in a decompression manner. However, when the method is used for transmission, the disk space of the server at the source end and the destination end is easily occupied excessively, and a great amount of time and cost are consumed in the decompression process after transmission, so that the transmission efficiency is low.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method, an apparatus, a device, and a medium for transmitting a large number of small files, so as to solve the problem of low efficiency in transmitting a large number of small files.
According to a first aspect, an embodiment of the present invention provides a method for transmitting a large number of small files, where the method includes:
the method comprises the steps that a server is scanned to obtain directory information of a first target directory in which massive small files are stored, wherein the directory information at least comprises a directory tree structure of the first target directory;
splitting the first target directory based on the number of concurrent transmissions of the server and the directory tree structure to obtain a first number of directory subsets, wherein each directory subset comprises at least one target subdirectory;
and concurrently transmitting the first number of directory subsets to receiving equipment so that the receiving equipment can obtain the mass small files.
In the method, the concurrency capability of the server can be fully utilized in the transmission process, so that the transmission time for transmitting the mass small files is shortened, and the transmission efficiency is improved.
With reference to the first aspect, in a first embodiment of the first aspect, the directory information further includes a directory size of the first target directory; the splitting the first target directory based on the number of concurrent transmissions of the server and the directory tree structure to obtain a first number of directory subsets, includes:
determining a set reference size of the directory subset according to a quotient between the directory size and the number of concurrent transmissions;
determining a split reference size of the directory subset according to a product between the set reference size and a specified split factor;
respectively determining the directory size of each subdirectory in the directory tree structure;
and splitting the first target directory based on the set reference size, the split reference size and the directory size of each subdirectory to obtain a first number of directory subsets, wherein each directory subset comprises at least one target subdirectory.
With reference to the first embodiment of the first aspect, in a second embodiment of the first aspect, the splitting the first target directory based on the set reference size, the split reference size, and the directory size of each sub-directory to obtain a first number of directory subsets includes:
respectively determining the node type of each subdirectory in the directory tree structure;
determining target subdirectories according to the node type and/or the directory size of each subdirectory to obtain a second number of target subdirectories, wherein the target subdirectories are subdirectories of which the node type is a leaf node type, or the directory size is smaller than the splitting reference size;
and combining the second number of target subdirectories to obtain a first number of directory subsets based on the set reference size and the directory size of each target subdirectory.
With reference to the second embodiment of the first aspect, in a third embodiment of the first aspect, the first number of directory subsets comprises a first directory subset and/or a second directory subset; the combining the second number of target subdirectories based on the set reference size and the directory size of each target subdirectory to obtain a first number of directory subsets, comprising:
sorting the second number of target subdirectories from large to small according to the directory size of each target subdirectory;
traversing the second number of target subdirectories according to the sorting result, and if the directory size of the current target subdirectory is larger than or equal to the set reference size, dividing the current target subdirectory separately to form a first directory subset;
and if the directory size of the current target subdirectory is smaller than the set reference size, determining a second target subdirectory combined with the current target subdirectory from the rest target subdirectories based on the difference value between the directory size of the current target subdirectory and the set reference size to obtain a second directory subset.
With reference to the third embodiment of the first aspect, in a fourth embodiment of the first aspect, the determining, from remaining target subdirectories, a second target subdirectory combined with the current target subdirectory based on a difference between the directory size of the current target subdirectory and the set reference size, to obtain a second directory subset, includes:
based on the difference, traversing whether a first sub-directory with the directory size equal to the difference exists in the remaining target sub-directories;
if the remaining target subdirectories comprise a first subdirectory, determining that the first subdirectory is a second target subdirectory, and combining the first subdirectory and the current target subdirectory to obtain a second directory subset;
if the first subdirectory is not included in the remaining target subdirectory, acquiring a second subdirectory with the size which is the closest to the difference value from the remaining target subdirectory based on the difference value;
obtaining a first variance value according to the square of the difference between the size of the second subdirectory and the variance value;
obtaining a second variance value according to the square of the difference between the directory size of the current target subdirectory and the difference value;
and determining the second target subdirectory from the rest of target subdirectories based on the comparison result between the set reference size and the first variance value and the second variance value, and combining the second target subdirectory with the current target subdirectory to obtain a second directory subset.
With reference to the first aspect, in a fifth embodiment of the first aspect, the method further comprises:
in response to the received transmission failure instruction, determining at least one third directory subset to be transmitted again concurrently, wherein the third directory subset is a directory subset in which concurrent transmission fails in the first number of directory subsets;
re-concurrently transmitting the at least one third subset of directories to the receiving device.
In the method, the transmission condition can be monitored in real time in the process of concurrent transmission, and then the third directory subset which fails in concurrent transmission can be retransmitted when the third directory subset exists, so that the effectiveness of transmission is guaranteed, and the transmission efficiency is improved.
With reference to the first aspect, in a sixth embodiment of the first aspect, the method further comprises:
if the concurrent transmission is finished, reading a second target directory received by the receiving equipment;
and carrying out consistency check on the second target directory and the first target directory to determine whether the second target directory is complete.
In the method, the consistency check can be automatically performed on the second target directory received by the receiving device after the concurrent transmission is finished, so that the integrity of data transmission is ensured, and the directory subset which is not transmitted or fails to be transmitted can be timely found.
According to a second aspect, an embodiment of the present invention provides a device for transmitting a large number of small files, where the device includes:
the scanning unit is used for scanning a server to obtain directory information of a first target directory in which massive small files are stored, wherein the directory information at least comprises a directory tree structure of the first target directory;
the splitting unit is used for splitting the first target directory based on the number of concurrent transmissions of the server and the directory tree structure to obtain a first number of directory subsets, wherein each directory subset comprises at least one target subdirectory;
a transmission unit, configured to concurrently transmit the first number of directory subsets to a receiving device, so that the receiving device obtains the massive small files.
According to a third aspect, the embodiments of the present invention further provide a computer device, which includes a memory and a processor, where the memory and the processor are communicatively connected to each other, the memory stores computer instructions, and the processor executes the computer instructions, so as to execute the method for transmitting the mass small files according to any one of the first aspect and the optional embodiments thereof.
According to a fourth aspect, the embodiments of the present invention further provide a computer-readable storage medium, where computer instructions are stored, and the computer instructions are configured to cause the computer to execute the method for transmitting the mass small files according to the first aspect and any one of the optional embodiments thereof.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a method for transmitting a large number of small files according to an exemplary embodiment.
Fig. 2 is a schematic diagram of a proposed target tree structure according to an exemplary embodiment.
FIG. 3 is a diagram of a proposed sort result according to an example embodiment.
Fig. 4 is a diagram illustrating a partitioning result of a directory subset according to an exemplary embodiment.
Fig. 5 is a flow chart of another proposed method for transferring a large number of small files according to an example embodiment.
Fig. 6 is a flowchart of another method for transmitting a large amount of small files according to an exemplary embodiment.
Fig. 7 is a flowchart of yet another method for transferring a large number of small files according to an exemplary embodiment.
Fig. 8 is a block diagram of a transmission apparatus for a large amount of small files according to an exemplary embodiment.
Fig. 9 is a hardware configuration diagram of a computer device according to an exemplary embodiment.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the related technology, when a large amount of small files are transmitted, the files are packaged and transmitted in advance, and then the transmission is carried out in a decompression mode. However, when the method is used for transmission, the disk space of the server at the source end and the destination end is easily occupied excessively, and a great amount of time and cost are consumed in the decompression process after transmission, so that the transmission efficiency is low.
In order to solve the above problems, an embodiment of the present invention provides a method for transmitting a large amount of small files, where the method is used in a computer device, and an execution main body of the method may be a transmission apparatus for transmitting a large amount of small files, and the apparatus may be implemented as part or all of the computer device in a software, hardware, or a combination of software and hardware, where the computer device may be a terminal, a client, or a server, and the server may be one server or a server cluster formed by multiple servers, and the terminal in this embodiment of the present invention may be other intelligent hardware devices such as a smart phone, a personal computer, a tablet computer, a wearable device, and an intelligent robot. In the following method embodiments, the execution subject is a computer device as an example.
The computer device of this embodiment is applied to an application scenario in which full transmission is performed on a large number of small files, for example: I/O intensive scenarios. According to the transmission method of the massive small files, the massive small files can be split based on the directory information of the first target directory for storing the massive small files and the number of concurrent transmissions of the server, and then the concurrent transmissions are carried out according to the splitting result, so that the concurrent capability of the server can be fully utilized in the transmission process, the transmission time is shortened, and the transmission efficiency is improved.
Fig. 1 is a flowchart of a method for transmitting a large number of small files according to an exemplary embodiment. As shown in fig. 1, the transmission method of a large number of small files includes the following steps S101 to S103.
In step S101, the server is scanned to obtain directory information of a first target directory in which a large number of small files are stored.
In the embodiment of the present invention, the server may be understood as a source terminal. Wherein, the massive small files (TB/PB level file amount) to be sent are stored in the server in the form of a directory. In order to transmit the massive small files required by the receiving equipment to the receiving equipment, the server is scanned to acquire directory information of a first target directory in which the massive small files are stored. The first target directory may be understood as a target directory stored in the server to be transmitted to the receiving device, and the directory information at least includes a directory tree structure of the first target directory. In an example, scanning can be performed while the server is in an idle state to avoid situations that affect the server to service other devices during the scanning process.
In step S102, the first target directory is split based on the number of concurrent transmissions of the server and the directory tree structure, so as to obtain a first number of directory subsets.
In the embodiment of the invention, the number of concurrent transmissions is obtained based on the concurrent capability of the server during file transmission. The incidence relation between different small files can be determined according to the target tree structure of the first target directory, so that the integrity of the files can be guaranteed in the splitting process.
Therefore, in order to improve the transmission efficiency, the first target directory is split based on the concurrent transmission number and the directory tree structure of the server, so that a directory subset with a first number is obtained, and in the concurrent transmission process, the transmission can be performed by taking the directory subset as a unit, so that the number of small files transmitted at one time is reduced, and the time occupied by one-time transmission is shortened. The number of concurrent transmissions of the server may be determined based on the number of threads that can be provided by the thread pool. The first number may be understood to be the total number of directory subsets.
In step S103, a first number of directory subsets are concurrently transmitted to the receiving device, so that the receiving device obtains a large number of small files.
By the embodiment, the concurrency capability of the server can be fully utilized in the transmission process, so that the transmission time for transmitting massive small files is shortened, and the transmission efficiency is improved.
The following example will specifically illustrate the process of obtaining the first number of directory subsets.
In the present invention, the target information further includes a directory size of the first target directory. The size of the disk space occupied by the first target directory can be determined by the directory size. In order to fully utilize the concurrency capability of the server and determine the size of the data volume of execution transmission of each thread in an ideal state, the set reference size of the directory subset is determined according to the quotient value between the size of the directory and the quantity of concurrent transmission. For example: if the number of concurrent transmissions is N and the directory size is X, then the set reference size for each directory subset
Figure BDA0003773892870000091
In one example, the number of concurrent transmissions may be determined according to the number of cores of a Central Processing Unit (CPU) of the server. For example: the number of concurrent transmissions N = N × the number of CPU cores, where N may take the value 2.
According to the directory tree structure, the directory sizes of different sub-directories may be different, so that in order to balance the set size of each directory subset and avoid the overlarge set size difference of different directory subsets, the splitting reference size of the directory subset is determined according to the product of the set reference size and the designated splitting factor, so as to determine the number of sub-directories in each directory subset according to the splitting reference size. Wherein, the smaller the designated splitting factor theta (0 < ≦ 1), the more average the set size of each directory subset, and the more complex the splitting process. In one example, the prescribed splitting factor θ may be 0.8. In another example, the specified splitting factor may be determined based on a resource occupancy ceiling of the server.
The directory size of each subdirectory in the directory tree structure is determined separately. And further splitting the first target directory based on the determined set reference size, the split reference size and the directory size of each subdirectory to obtain a first number of directory subsets. Wherein each directory subset includes at least one target subdirectory therein.
In one embodiment, the first number of directory subsets may be obtained as follows: according to the directory tree structure shown in fig. 2, the node type of each sub-directory in the directory tree structure is determined. The node types comprise leaf node types, child node types and root node types. Different letters denote different subdirectories, and the number under each subdirectory denotes the directory size of each subdirectory. And determining the target subdirectories according to the node type of each subdirectory, the directory size of the subdirectory or the node type and the directory size of the subdirectory to obtain a second number of target subdirectories, and combining the second number of target subdirectories based on the set reference size and the directory size of each target subdirectory to obtain a first number of directory subsets. The target subdirectory is a subdirectory of which the node type is a leaf node type, or the size of the subdirectory is smaller than the size of the splitting reference. The second number may be understood as the total number of target subdirectories.
In the present invention, the first number of directory subsets may be composed of a plurality of first directory subsets, may be composed of a plurality of second directory subsets, and may be composed of a plurality of first directory subsets and second directory subsets. Wherein the first subset of directories is to be understood as a subset of directories comprising only one target subdirectory. The second subset of directories may be understood as a subset of directories comprising at least two target subdirectories.
Specifically, in combination with the target tree structure shown in fig. 2, the second number of target subdirectories are sorted from large to small according to the directory size of each target subdirectory, so as to obtain the sorting result shown in fig. 3. And traversing a second number of target subdirectories according to the sorting result, and if the directory size of the current target subdirectory is larger than or equal to the set standard size, independently dividing the current target subdirectory to form a first directory subset. If the directory size of the current target subdirectory is smaller than the set reference size, determining a second target subdirectory combined with the current target subdirectory from the rest target subdirectories based on the difference value between the directory size of the current target subdirectory and the set reference size to obtain a second directory subset, so that the set size of each directory subset is equal to or similar to the set reference size.
In one example, determining a second target subdirectory from the remaining target subdirectories to combine with the current target subdirectory based on the difference between the directory size of the current target subdirectory and the set reference size, and obtaining the second subset of directories may be as follows:
and traversing whether a first sub-directory with the directory size equal to the difference exists in the rest target sub-directories based on the difference. And if the remaining target subdirectories comprise the first subdirectory, determining that the first subdirectory is the second target subdirectory, and combining the first subdirectory and the current target subdirectory to obtain a second subdirectory subset. And if the first subdirectory is not included in the remaining target subdirectory, acquiring a second subdirectory with the smallest directory size and the closest difference from the remaining target subdirectory based on the difference. The square of the difference between the second subdirectory size of the second subdirectory and the difference is determined, resulting in a first variance value. And determining the square of the difference between the directory size of the current target subdirectory and the difference value to obtain a second variance value. And determining a second target subdirectory from the rest of the target subdirectories based on the comparison result between the set reference size and the first variance value and the second variance value, and combining the second target subdirectory with the current target subdirectory to obtain a second directory subset. Wherein the number of second target subdirectories may not be unique.
In an implementation scenario, the directory size of each target subdirectory is compared with the set reference size one by one according to the sorting result shown in fig. 3, and if the current target subdirectory Y is i Directory Size (Y) i ) Greater than or equal to the set reference size
Figure BDA0003773892870000111
Then the target subdirectory Y is i The first subset of directories is formed by dividing the first subset of directories into a set.
If the current target subdirectory Y i Directory Size (Y) i ) Smaller than the set reference size
Figure BDA0003773892870000112
Then determine Size (Y) i ) And aggregate reference size
Figure BDA0003773892870000113
The difference between them delta. If the first subdirectory Y with the same directory size as the difference delta exists in the rest target subdirectories j Then the current target subdirectory Y is added i And a first subdirectory Y j Combined in the same set to form a second subset of directories.
If the first subdirectory Y with the same directory size as the difference delta does not exist in the rest target subdirectories j Then, the second subdirectory Y with the most similar directory size and difference delta is obtained from the rest target subdirectories j+1 . According to the second subdirectory Y j+1 Second subdirectory Size (Y) j+1 ) And the square of the difference between the difference values delta, a first variance value is obtained. According to the current target subdirectory Y i Directory Size (Y) i ) And the square of the difference between the difference values delta, a second variance value is obtained. If the first variance value is less than or equal to the second variance value, the second subdirectory Y is divided into two subdirectories j+1 Determined as the current target subdirectory Y i Forming a second target subdirectory of the second directory subset, and continuously acquiring the second target subdirectory from the rest target subdirectories on the basis of the equality principle until the size and the set of the obtained second directory subset are the sameAnd the reference sizes are equal or similar. If the first variance value is larger than the second variance value, temporarily storing the second variance value according to the Size (Y) of the second subdirectory j+1 ) And a difference Δ 'between the difference Δ, a third subdirectory having a directory size closest to the difference Δ' is acquired from the remaining target subdirectories. The third party difference is obtained based on the square of the difference between the directory size of the third subdirectory and the difference Δ'. According to the second subdirectory Size (Y) j+1 ) And the square of the difference between the difference value Δ' to obtain a fourth difference value. Determining the subdirectory with the minimum variance value as a second target subdirectory according to the third variance value and the fourth variance value, and comparing the second target subdirectory with the current target subdirectory Y i And combining the directory subsets in the same set, and repeating the steps until a second directory subset with the set size equal to or similar to the set reference size is obtained.
The resulting first number of directory subsets may be as shown in fig. 4 based on the first directory subset and the second directory subset. Where each action is a subset of the directory.
Fig. 5 is a flowchart of another proposed method for transferring a large amount of small files according to an exemplary embodiment. As shown in fig. 5, the method for transmitting a large number of small files includes the following steps.
In step S501, a server is scanned to obtain directory information of a first target directory in which a large number of small files are stored.
In step S502, the first target directory is split based on the number of concurrent transmissions of the server and the directory tree structure, resulting in a first number of directory subsets.
In step S503, a first number of directory subsets are concurrently transmitted to the receiving device, so that the receiving device obtains a large number of small files.
In step S504, at least one third subset of directories to be re-concurrently transmitted is determined in response to the received transmission failure instruction.
In the embodiment of the present invention, the server may execute the concurrent transmission task through a data mirroring backup tool (e.g., rsync), and the data mirroring backup tool may monitor the transmission status of each directory subset during the transmission process. And if the existence of the at least one third directory subset with concurrent transmission failure is monitored in the monitoring process, transmitting a transmission failure instruction to the server to inform the server of the transmission failure of the at least one third directory subset. The server determines at least one third subset of directories to be re-concurrently transmitted in response to the received transmission failure instruction. Wherein the third directory subset is a directory subset in which concurrent transmission fails among the first number of directory subsets.
In step S505, at least one third subset of directories is re-concurrently transmitted to the receiving device.
Through the embodiment, the transmission condition can be monitored in real time in the process of concurrent transmission, and then the third directory subset which fails in concurrent transmission can be retransmitted, so that the effectiveness of transmission is guaranteed, and the transmission efficiency is improved.
Fig. 6 is a flowchart of another method for transmitting a large amount of small files according to an exemplary embodiment. As shown in fig. 6, the method for transmitting a large number of small files includes the following steps.
In step S601, the server is scanned to obtain directory information of a first target directory in which a large number of small files are stored.
In step S602, the first target directory is split based on the number of concurrent transmissions of the server and the directory tree structure, so as to obtain a first number of directory subsets.
In step S603, a first number of directory subsets are concurrently transmitted to the receiving device, so that the receiving device obtains a large number of small files.
In step S604, if the concurrent transmission is ended, the second target directory received by the receiving device is read.
In step S605, the second target directory is checked for consistency with the first target directory to determine whether the received second target directory is complete.
In the embodiment of the present invention, consistency check may be performed by comparing the directory size between the second target directory and the first target directory. Consistency checks are performed by comparing the directory size between the second target directory and the first target directory and receiving a timestamp of the second target directory. Or by comparing the difference between the digest of the second target directory (e.g., md 5) and the digest of the first target directory. In one example, if the target sub-directory is a key directory, the consistency check may be performed by comparing the difference between the information digest (e.g., md 5) of the second target directory and the information digest of the first target directory. If the target subdirectory is the secondary directory, consistency check can be performed by comparing the directory size between the second target directory and the first target directory and receiving the timestamp of the second target directory, so that check time is saved.
Through the embodiment, the consistency check can be automatically carried out on the second target directory received by the receiving equipment after the concurrent transmission is finished, so that the integrity of data transmission is ensured, and the directory subset which is not transmitted or fails to be transmitted can be found in time.
In an embodiment, in order to avoid the situation that the resource of the server is excessively occupied in the concurrent transmission process, which affects the performance of the server, the actual resource occupancy rate of the server is detected at regular time in the concurrent transmission process, so as to determine whether to start the resource protection of the server. If the actual resource occupancy rate is greater than the resource occupancy rate upper limit, starting server resource protection, suspending at least one target subdirectory with higher resource occupancy rate in concurrent transmission to ensure the normal operation of the server, and recovering the suspended target subdirectory until the actual resource occupancy rate is less than the resource occupancy rate upper limit. In one example, the upper resource occupancy may include an upper CPU threshold (e.g., 80%), an upper memory threshold (e.g., 80%), and a maximum transfer rate MB/s (e.g., limited to 0.8 bandwidth/8).
In one implementation scenario, the process of transferring a large number of small files may be as shown in FIG. 7. Fig. 7 is a flowchart of yet another method for transferring a large number of small files according to an exemplary embodiment.
In step S701, it is determined whether or not the pre-scanning and the task optimization processing are performed.
In step S702, the server is scanned to obtain directory information of a first target directory in which a large number of small files are stored.
In step S703, the number of concurrent transmissions, the split reference size of the directory subset, and the split reference size are determined.
In step S704, the first target directory is split, resulting in a first number of directory subsets.
In step S705, a first number of subsets of directories are concurrently transmitted to the receiving device.
In step S706, it is determined whether to perform a consistency check according to the concurrent transmission state.
In step S707, if the concurrent transmission is ended, the second target directory received by the receiving device is checked for consistency according to the first target directory.
In step S708, during the concurrent transmission, the actual resource occupancy rate of the server is periodically detected to determine whether to start the server resource protection.
In step S709, if the actual resource occupancy is greater than the resource occupancy upper limit, the server resource protection is started.
Through the embodiment, the resources of the server can be self-adapted, a single target task is optimized into the subtask set with uniform task quantity, the concurrency capability of the server is fully utilized, the transmission time for transmitting a large number of small files is further shortened, and the transmission efficiency is improved. In addition, the method provided by the invention does not need to occupy additional disk space, thereby being beneficial to saving resources.
Based on the same invention concept, the invention also provides a transmission device for the mass small files.
The transmission device for the massive small files comprises a directory scanning module, a self-adaptive task planning module, a concurrent transmission module, a server resource protection module and a consistency checking module. The directory scanning module is used for scanning the first target directory during the idle service period of the server to acquire directory information of the first target directory. The self-adaptive task planning module is used for splitting the first target directory to obtain a first number of directory subsets to be transmitted concurrently. The concurrent transmission module is to transmit a first number of the subset of directories to the receiving device. The server resource protection module is used for protecting server resources and avoiding the resources of the server from being excessively occupied in the concurrent transmission process. The consistency check module is used for carrying out consistency check on the first target directory and the second target directory.
Based on the same inventive concept, the invention also provides another transmission device for the mass small files.
Fig. 8 is a block diagram of a transmission apparatus for a large amount of small files according to an exemplary embodiment. As shown in fig. 8, the transmission apparatus for a large number of small files includes a scanning unit 801, a splitting unit 802, and a transmission unit 803.
A scanning unit 801, configured to scan a server to obtain directory information of a first target directory in which a large number of small files are stored, where the directory information at least includes a directory tree structure of the first target directory;
a splitting unit 802, configured to split the first target directory based on the number of concurrent transmissions of the server and the directory tree structure, to obtain directory subsets of a first number, where each directory subset includes at least one target subdirectory;
a transmitting unit 803, configured to concurrently transmit the first number of directory subsets to a receiving device, so that the receiving device obtains a large amount of small files.
In one embodiment, the directory information further includes a directory size of the first target directory; the splitting unit 802 includes: a first determining unit, configured to determine an aggregate reference size of the directory subset according to a quotient between the directory size and the number of concurrent transmissions. And the second determining unit is used for determining the splitting reference size of the directory subset according to the product between the set reference size and the specified splitting factor. And a third determining unit, configured to determine the directory size of each sub-directory in the directory tree structure. The splitting subunit is configured to split the first target directory based on the set reference size, the splitting reference size, and the directory size of each sub-directory, to obtain a first number of directory subsets, where each directory subset includes at least one target sub-directory.
In another embodiment, splitting the subunit includes: and the fourth determining unit is used for respectively determining the node type of each subdirectory in the directory tree structure. And the fifth determining unit is used for determining the target subdirectories according to the node type and/or the directory size of each subdirectory to obtain a second number of target subdirectories, wherein the target subdirectories are subdirectories of which the node type is a leaf node type or the directory size is smaller than the splitting reference size. And the combining unit is used for combining the second number of target subdirectories to obtain the first number of directory subsets based on the set reference size and the directory size of each target subdirectory.
In yet another embodiment, the first number of subsets of directories comprises a first subset of directories and/or a second subset of directories. The combination unit includes: and the sorting unit is used for sorting the second number of target subdirectories from large to small according to the directory size of each target subdirectory. And the first combination unit is used for traversing a second number of target subdirectories according to the sorting result, and if the directory size of the current target subdirectory is larger than or equal to the set reference size, the current target subdirectory is divided separately to form a first directory subset. And the second combining unit is used for determining a second target subdirectory combined with the current target subdirectory from the rest target subdirectories based on the difference between the directory size of the current target subdirectory and the set reference size to obtain a second directory subset if the directory size of the current target subdirectory is smaller than the set reference size.
In yet another embodiment, the second combination unit includes: and the traversing unit is used for traversing whether the first subdirectory with the directory size equal to the difference exists in the remaining target subdirectories or not based on the difference. And the first sub-directory combining unit is used for determining that the first sub-directory is a second target sub-directory if the remaining target sub-directories comprise the first sub-directory, and combining the first sub-directory with the current target sub-directory to obtain a second directory subset. And the searching unit is used for acquiring a second subdirectory with the directory size closest to the difference value from the remaining target subdirectory based on the difference value if the remaining target subdirectory does not comprise the first subdirectory. And the first calculation unit is used for obtaining a first variance value according to the square of the difference between the size of the second subdirectory and the variance value. And the second calculation unit is used for obtaining a second variance value according to the square of the difference between the directory size of the current target subdirectory and the difference value. And the second subdirectory combination unit is used for determining a second target subdirectory from the rest target subdirectories based on the comparison result between the set reference size and the first variance value and the second variance value, and combining the second target subdirectory with the current target subdirectory to obtain a second directory subset.
In yet another embodiment, the apparatus further comprises: a sixth determining unit, configured to determine, in response to the received transmission failure instruction, at least one third directory subset to be concurrently transmitted again, where the third directory subset is a directory subset that fails to be concurrently transmitted among the first number of directory subsets. A transmitting unit 803, further configured to transmit the at least one third subset of directories to the receiving device again and concurrently.
In yet another embodiment, the apparatus further comprises: and the reading unit is used for reading the second target directory received by the receiving equipment if the concurrent transmission is ended. And the checking unit is used for carrying out consistency check on the second target directory and the first target directory so as to determine whether the second target directory is complete.
The specific limitations and beneficial effects of the transmission device for the mass small files can be referred to the limitations of the transmission method for the mass small files, and are not described herein again. The various modules described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
Fig. 9 is a hardware configuration diagram of a computer device according to an exemplary embodiment. As shown in fig. 9, the apparatus includes one or more processors 960 and a storage 920, where the storage 920 includes a persistent memory, a volatile memory, and a hard disk, and one processor 960 is taken as an example in fig. 9. The apparatus may further include: an input device 930 and an output device 940.
The processor 960, the memory 920, the input device 930, and the output device 940 may be connected by a bus or other means, such as by a bus connection in fig. 9.
Processor 960 may be a Central Processing Unit (CPU). The Processor 960 may also be any chip including a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, or any combination thereof. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 920 is a non-transitory computer readable storage medium, including a persistent memory, a volatile memory, and a hard disk, and can be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the service management method in this embodiment of the present application. The processor 960 executes various functional applications and data processing of the server by running non-transitory software programs, instructions and modules stored in the memory 920, that is, implementing any one of the above methods for transferring a large amount of small files.
The memory 920 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data used as needed or desired, and the like. Further, the memory 920 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 920 may optionally include memory located remotely from processor 960, which may be connected to a data processing device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 930 may receive input numeric or character information and generate key signal inputs related to user settings and function control. The output device 940 may include a display device such as a display screen.
One or more modules are stored in the memory 920 and, when executed by the one or more processors 960, perform the methods illustrated in fig. 1-7.
The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. Details of the technique not described in detail in the present embodiment may be specifically referred to the related description in the embodiments shown in fig. 1 to fig. 7.
Embodiments of the present invention further provide a non-transitory computer storage medium, where a computer-executable instruction is stored in the computer storage medium, and the computer-executable instruction may execute the authentication method in any of the above method embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims (10)

1. A method for transmitting a large number of small files is characterized by comprising the following steps:
the method comprises the steps that a server is scanned to obtain directory information of a first target directory in which massive small files are stored, wherein the directory information at least comprises a directory tree structure of the first target directory;
splitting the first target directory based on the number of concurrent transmissions of the server and the directory tree structure to obtain a first number of directory subsets, wherein each directory subset comprises at least one target subdirectory;
and concurrently transmitting the first number of directory subsets to receiving equipment so that the receiving equipment can obtain the mass small files.
2. The method of claim 1, wherein the directory information further comprises a directory size of the first target directory; the splitting the first target directory based on the number of concurrent transmissions of the server and the directory tree structure to obtain a first number of directory subsets, includes:
determining a set reference size of the directory subset according to a quotient value between the directory size and the concurrent transmission quantity;
determining a split reference size of the directory subset according to a product between the set reference size and a specified split factor;
respectively determining the directory size of each subdirectory in the directory tree structure;
and splitting the first target directory based on the set reference size, the split reference size and the directory size of each subdirectory to obtain a first number of directory subsets, wherein each directory subset comprises at least one target subdirectory.
3. The method of claim 2, wherein splitting the first target directory based on the set reference size, the split reference size, and the directory size of each subdirectory to obtain a first number of directory subsets comprises:
respectively determining the node type of each subdirectory in the directory tree structure;
determining target subdirectories according to the node type and/or the directory size of each subdirectory to obtain a second number of target subdirectories, wherein the target subdirectories are subdirectories of which the node type is a leaf node type, or the directory size is smaller than the splitting reference size;
and combining the second number of target subdirectories to obtain a first number of directory subsets based on the set reference size and the directory size of each target subdirectory.
4. The method of claim 3, wherein the first number of subsets of directories comprises a first subset of directories and/or a second subset of directories; the combining the second number of target subdirectories based on the set reference size and the directory size of each target subdirectory to obtain a first number of directory subsets, comprising:
sorting the second number of target subdirectories from large to small according to the directory size of each target subdirectory;
traversing the second number of target subdirectories according to the sorting result, and if the directory size of the current target subdirectory is larger than or equal to the set reference size, independently dividing the current target subdirectory to form a first directory subset;
and if the directory size of the current target subdirectory is smaller than the set reference size, determining a second target subdirectory combined with the current target subdirectory from the rest target subdirectories based on the difference value between the directory size of the current target subdirectory and the set reference size to obtain a second directory subset.
5. The method of claim 4, wherein determining a second target subdirectory from the remaining target subdirectories to combine with the current target subdirectory based on a difference between the directory size of the current target subdirectory and the set reference size, resulting in a second subset of directories, comprises:
traversing whether a first sub-directory with the directory size equal to the difference exists in the rest target sub-directories based on the difference;
if the remaining target subdirectories comprise a first subdirectory, determining that the first subdirectory is a second target subdirectory, and combining the first subdirectory and the current target subdirectory to obtain a second directory subset;
if the first subdirectory is not included in the remaining target subdirectory, acquiring a second subdirectory with the size which is the closest to the difference value from the remaining target subdirectory based on the difference value;
obtaining a first variance value according to the square of the difference between the size of the second subdirectory and the variance value;
obtaining a second variance value according to the square of the difference between the directory size of the current target subdirectory and the difference value;
and determining the second target subdirectory from the rest of target subdirectories based on the comparison result between the set reference size and the first variance value and the second variance value, and combining the second target subdirectory with the current target subdirectory to obtain a second subdirectory subset.
6. The method of claim 1, further comprising:
in response to the received transmission failure instruction, determining at least one third directory subset to be transmitted again concurrently, wherein the third directory subset is a directory subset in which concurrent transmission fails in the first number of directory subsets;
re-concurrently transmitting the at least one third subset of directories to the receiving device.
7. The method of claim 1 or 6, further comprising:
if the concurrent transmission is finished, reading a second target directory received by the receiving equipment;
and carrying out consistency check on the second target directory and the first target directory to determine whether the second target directory is complete.
8. A transmission device for massive small files is characterized in that the device comprises:
the scanning unit is used for scanning the server to obtain directory information of a first target directory in which massive small files are stored, wherein the directory information at least comprises a directory tree structure of the first target directory;
the splitting unit is used for splitting the first target directory based on the number of concurrent transmissions of the server and the directory tree structure to obtain a first number of directory subsets, wherein each directory subset comprises at least one target subdirectory;
and the transmission unit is used for transmitting the first number of directory subsets to receiving equipment concurrently so that the receiving equipment can obtain the mass small files.
9. A computer device, comprising a memory and a processor, wherein the memory and the processor are communicatively connected, the memory stores computer instructions, and the processor executes the computer instructions to perform the method for transferring mass small files according to any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer instructions for causing the computer to execute the transmission method of mass small files according to any one of claims 1 to 7.
CN202210910755.XA 2022-07-29 2022-07-29 Method, device, equipment and medium for transmitting mass small files Pending CN115499426A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210910755.XA CN115499426A (en) 2022-07-29 2022-07-29 Method, device, equipment and medium for transmitting mass small files

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210910755.XA CN115499426A (en) 2022-07-29 2022-07-29 Method, device, equipment and medium for transmitting mass small files

Publications (1)

Publication Number Publication Date
CN115499426A true CN115499426A (en) 2022-12-20

Family

ID=84466246

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210910755.XA Pending CN115499426A (en) 2022-07-29 2022-07-29 Method, device, equipment and medium for transmitting mass small files

Country Status (1)

Country Link
CN (1) CN115499426A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794182A (en) * 2015-04-10 2015-07-22 中国科学院计算技术研究所 Small file asynchronous pre-reading device and method for parallel network file system
CN108958659A (en) * 2018-06-29 2018-12-07 郑州云海信息技术有限公司 A kind of small documents polymerization, device and the medium of distributed memory system
CN109240999A (en) * 2018-08-24 2019-01-18 浪潮电子信息产业股份有限公司 A kind of automation polymerization packaging method and system based on small documents
US10223377B1 (en) * 2015-03-23 2019-03-05 EMC IP Holding Company LLC Efficiently seeding small files with certain localities
CN110784528A (en) * 2019-10-22 2020-02-11 北京天融信网络安全技术有限公司 File downloading method and device and storage medium
CN112434000A (en) * 2020-11-20 2021-03-02 苏州浪潮智能科技有限公司 Small file merging method, device and equipment based on HDFS
CN113076298A (en) * 2021-04-15 2021-07-06 上海卓钢链科技有限公司 Distributed small file storage system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10223377B1 (en) * 2015-03-23 2019-03-05 EMC IP Holding Company LLC Efficiently seeding small files with certain localities
CN104794182A (en) * 2015-04-10 2015-07-22 中国科学院计算技术研究所 Small file asynchronous pre-reading device and method for parallel network file system
CN108958659A (en) * 2018-06-29 2018-12-07 郑州云海信息技术有限公司 A kind of small documents polymerization, device and the medium of distributed memory system
CN109240999A (en) * 2018-08-24 2019-01-18 浪潮电子信息产业股份有限公司 A kind of automation polymerization packaging method and system based on small documents
CN110784528A (en) * 2019-10-22 2020-02-11 北京天融信网络安全技术有限公司 File downloading method and device and storage medium
CN112434000A (en) * 2020-11-20 2021-03-02 苏州浪潮智能科技有限公司 Small file merging method, device and equipment based on HDFS
CN113076298A (en) * 2021-04-15 2021-07-06 上海卓钢链科技有限公司 Distributed small file storage system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
尹颖;林庆;林涵阳;: "HDFS中高效存储小文件的方法", 计算机工程与设计, no. 02, 16 February 2015 (2015-02-16) *
金海, 官象山, 吴松, 谢超: "分布式存储系统中文件传输优化的设计与实现", 华中科技大学学报(自然科学版), no. 01, 30 January 2005 (2005-01-30) *

Similar Documents

Publication Publication Date Title
KR101752928B1 (en) Swarm-based synchronization over a network of object stores
WO2018028229A1 (en) Data shard storage method, device and system
US8489612B2 (en) Identifying similar files in an environment having multiple client computers
US20160057201A1 (en) File Uploading Method, Client, and Application Server in Cloud Storage, and Cloud Storage System
CN113704354B (en) Data synchronization method and device, computer equipment and storage medium
CN110659151B (en) Data verification method and device and storage medium
CN110737668A (en) Data storage method, data reading method, related device and medium
EP3993346B1 (en) Method and device for distributed data storage
CN111061431B (en) Distributed storage method, server and client
WO2002093846A1 (en) Method of transferring a divided file
CN111131505A (en) Data transmission method, equipment, system, device and medium based on P2P network
CN111600957A (en) File transmission method, device and system and electronic equipment
CN110633168A (en) Data backup method and system for distributed storage system
CN111159195A (en) Data storage control method and equipment in block chain system
CN109412838A (en) Server cluster host node selection method based on hash calculating and Performance Evaluation
CN109951551B (en) Container mirror image management system and method
CN111562889A (en) Data processing method, device, system and storage medium
CN108459926B (en) Data remote backup method and device and computer readable medium
CN116069493A (en) Data processing method, device, equipment and readable storage medium
CN105740435A (en) On-line preview design method of document on the basis of distribution
CN114253698A (en) Resource scheduling method, system, electronic device and storage medium
CN115499426A (en) Method, device, equipment and medium for transmitting mass small files
CN111881086B (en) Big data storage method, query method, electronic device and storage medium
CN113438274A (en) Data transmission method and device, computer equipment and readable storage medium
JP7075077B2 (en) Backup server, backup method, program, storage system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination