WO2024103752A1 - File transmission method, apparatus and system, electronic device, and storage medium - Google Patents

File transmission method, apparatus and system, electronic device, and storage medium Download PDF

Info

Publication number
WO2024103752A1
WO2024103752A1 PCT/CN2023/103618 CN2023103618W WO2024103752A1 WO 2024103752 A1 WO2024103752 A1 WO 2024103752A1 CN 2023103618 W CN2023103618 W CN 2023103618W WO 2024103752 A1 WO2024103752 A1 WO 2024103752A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
slice
metadata
slices
original
Prior art date
Application number
PCT/CN2023/103618
Other languages
French (fr)
Chinese (zh)
Inventor
官祥臻
王钰涵
赵天武
桂林
Original Assignee
工赋(青岛)科技有限公司
卡奥斯工业智能研究院(青岛)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 工赋(青岛)科技有限公司, 卡奥斯工业智能研究院(青岛)有限公司 filed Critical 工赋(青岛)科技有限公司
Publication of WO2024103752A1 publication Critical patent/WO2024103752A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/06Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to data transmission technology, for example, to a file transmission method, device, system, electronic device and storage medium.
  • Kafka is a distributed message publishing and subscription system with the advantages of high throughput, low latency, and high availability.
  • Kafka generally transmits structured data such as logs.
  • the default size of each data transmitted by Kafka does not exceed 1MB, which makes it impossible to transmit relatively large binary files (such as videos, pictures, compressed packages, etc.) through Kafka.
  • the present application provides a file transmission method, device, system, electronic device and storage medium, which can realize the transmission of large files by using a message publishing and subscription system, thereby improving the file transmission efficiency.
  • the present application provides a file transmission method, which is applied to a data consumption end, and the method comprises:
  • a file processing stream is created, and the plurality of file slices are merged according to the metadata of each file slice by using the file processing stream to obtain a target file.
  • the present application provides a file transmission method, which is applied to a data production end, and the method includes:
  • the multiple file slices are sent to the topic partition of the message publishing and subscription system, and each The metadata of each file slice is stored in a distributed database, so that after the data consumer obtains the metadata of each file slice from the distributed database and obtains the multiple file slices from the subject partition, the multiple file slices are merged according to the metadata of each file slice to obtain the target file.
  • the present application provides a file transmission device, which is applied to a data consumption end, and the device includes:
  • a first acquisition module is configured to acquire metadata of each file slice in a plurality of file slices from a distributed database, wherein the plurality of file slices and the metadata of each file slice are obtained by segmenting the original file at the data production end;
  • a second acquisition module is configured to acquire the multiple file slices from a topic partition of a message publishing and subscription system according to the metadata of each file slice;
  • the merging module is configured to create a file processing stream, and use the file processing stream to merge the multiple file slices according to the metadata of each file slice to obtain a target file.
  • the present application provides a file transmission device, which is applied to a data production end, and the device includes:
  • a segmentation module configured to obtain an original file and segment the original file to obtain a plurality of file slices and metadata of each file slice;
  • a sending module configured to send the plurality of file slices to a topic partition of a message publishing and subscription system
  • the storage module is configured to store the metadata of each file slice into a distributed database, so that after the data consumer obtains the metadata of each file slice from the distributed database and obtains the multiple file slices from the subject partition, the multiple file slices are merged according to the metadata of each file slice to obtain the target file.
  • the present application provides a file transfer system, comprising a data consumption end and a data production end for executing the file transfer method described in any embodiment of the present application.
  • the present application provides an electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the file transfer method as described in any embodiment of the present application is implemented.
  • the present application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the file transfer method as described in any embodiment of the present application.
  • FIG1 is a schematic diagram of a flow chart of a file transmission method provided by the present application.
  • FIG2 is another schematic diagram of the process of the file transmission method provided by the present application.
  • FIG3 is another schematic diagram of the process of file transmission provided by the present application.
  • FIG4 is an exemplary flow chart of a file transmission method provided by the present application.
  • FIG5 is a schematic diagram of a structure of a file transmission device provided by the present application.
  • FIG6 is another schematic diagram of the structure of the file transmission device provided by the present application.
  • FIG. 7 is a schematic diagram of the structure of the electronic device provided by the present application.
  • FIG1 is a flowchart of a file transfer method provided by the present application, which can be performed by a file transfer device provided by the present application, and the device can be implemented in software and/or hardware.
  • the device can be integrated into a data consumption end, for example, it can be integrated into an electronic device at the data consumption end, and the electronic device can be a computer. The following embodiments will be described by taking the device integrated into an electronic device at the data consumption end as an example.
  • the files to be transmitted at the data production end may be large files such as videos, pictures, compressed packages, etc.
  • each file before these files are transmitted using the message publishing and subscription system, each file can be sliced according to the message publishing and subscription system's limit on the size of the transmitted data, such as cutting each file into slices smaller than 1MB, thereby obtaining multiple file slices for each file, and each file slice can have metadata, which is descriptive data of the file slice.
  • each file slice After each file is cut into multiple file slices, the multiple file slices can be uploaded to the message publishing and subscription system.
  • the message publishing and subscription system stores data in a classified manner, that is, each file slice has a category in the message publishing and subscription system, which is called a topic. Physically, data of different topics are stored separately.
  • Each topic includes one or more partitions, that is, after each file slice is uploaded to the message publishing and subscription system, it will be stored in the corresponding topic partition. After storage, the offset of the file slice in the topic partition will be obtained.
  • the topic partition stores file slices according to queues, and the offset can indicate the position of the actual position of the file slice relative to the offset of the queue head.
  • the serial number of each file slice can also be recorded, and the serial number of each file slice and the offset of each file slice in the topic partition of the message publishing and subscription system are used as the metadata of the corresponding file slice.
  • the metadata of the file slice can also include other information, such as the file name of the file, the information summary code of the file, etc., which are not limited here.
  • the file slice and the metadata of the file slice can be processed separately, the file slice is pushed to the message publishing and subscription system, and the metadata of the file slice is uploaded to the distributed database.
  • the use of the distributed database to store metadata can facilitate the subsequent fast and efficient query of the metadata of the file slice.
  • the distributed database can be a document database based on distributed file storage.
  • the distributed database can be MongoDB.
  • MongoDB is a product between relational databases and non-relational databases. It is the most feature-rich and relational database among non-relational databases. The data structure it supports is very loose, so it can store relatively complex data types.
  • Step 101 obtain metadata of each file slice in multiple file slices from a distributed database, the multiple file slices and the metadata of each file slice are obtained by the data production end by segmenting the original file. arrive.
  • the original file can be any one of the multiple files transmitted by the data production end, depending on the consumption needs of the data consumption end, that is, the original file is also a large file such as video, picture, compressed package, etc. Since the data production end stores the metadata of each file slice in the distributed database after splitting the file, the metadata of each file slice of the required file can be obtained from the distributed database first during actual consumption.
  • Step 102 According to the metadata of each file slice, a matching file slice is obtained from a topic partition of a message publishing and subscription system to obtain a plurality of file slices.
  • the metadata includes the serial number of the file slice and the offset of the file slice in the topic partition; the serial number of the file slice refers to the position of the file slice in the original binary file. For example, a file size of 1GB is cut into 1024 file slices, and the serial numbers of the file slices can be 1 to 1024 respectively; the offset corresponding to each piece of data included in the message publishing and subscription system is divided into two parts: index and log.
  • the index records the offset information of this data, and the log stores this data information.
  • the consumer side searches for the offset of the file slice in the topic partition according to the serial number of each file slice, and obtains the matching file slice from the topic partition according to the offset of the file slice in the topic partition for consumption.
  • the data consumer side when the data consumer side consumes for the first time, it starts consuming from the file slice with an offset of 0 and consumes until 8, and the offset is recorded at 8. When consuming next time, it can start consuming from the beginning or from the last position.
  • the maximum value that the data consumer side can consume is the maximum offset value written by the data producer side. When the maximum value of consumption is reached, it indicates that multiple file slices have been obtained.
  • Step 103 Create a file processing stream, and use the file processing stream to merge multiple file slices according to the metadata of each file slice to obtain a target file.
  • the file processing stream may include a file reading stream and a file writing stream.
  • the file reading stream may be used to read data from multiple file slices in sequence according to the serial number, and the file writing stream may be used to write the read data into a specified file, thereby merging multiple file slices by encoding the file stream to obtain the target file.
  • the data production end processes the original file slices to obtain multiple file slices, pushes the multiple file slices to the message publishing and subscription system, and saves the metadata corresponding to each file slice to the distributed database; the data consumption end obtains the metadata corresponding to each file slice from the distributed database According to the metadata of each file slice, multiple file slices matching the metadata are obtained from the topic partition of the message publishing and subscription system; a file processing stream is created, and the file processing stream is used to merge multiple file slices according to the metadata of each file slice to obtain the target file. That is, the present application can slice large files, transmit the file slices through the message publishing and subscription system, and store the metadata of the file slices in a distributed database. With the help of file segmentation technology and distributed database, large files can be transmitted using the message publishing and subscription system, thereby improving the efficiency of file transmission.
  • FIG2 is another flowchart of the file transmission method provided by the present application, which illustrates the file transmission method provided by the present application.
  • the method can be integrated in an electronic device at a data consumption end, and the electronic device can be a computer.
  • the following embodiments will be described by taking the device integrated in an electronic device at a data consumption end as an example. As shown in FIG2 , the method can include the following steps:
  • Step 201 obtaining information summary codes of multiple files by querying a distributed database.
  • the message digest code refers to a 128-bit feature code obtained by digitally transforming the original information according to the public message digest algorithm. This feature code is irreversible and highly discrete, which can ensure the uniqueness of the file or file slice.
  • the message digest code can be an md5 code, and each file has a unique md5 code (i.e. file md5, abbreviated as fmd5).
  • Step 202 write the information summary codes of multiple files into a preset set to obtain an original summary code set.
  • An empty data set can be constructed, and the md5 codes of each file queried from the distributed database can be written into the empty data set to obtain an original summary code set, that is, the original summary code set includes multiple fmd5s.
  • Step 203 perform deduplication processing on the original summary code set to obtain a target summary code set.
  • Deduplication can avoid saving duplicate data in the database to cause a large amount of redundant data.
  • a new data set can be obtained, that is, the set of target fmd5 codes.
  • the number of fmd5 codes contained in the set is the number of files transmitted through Kafka. Due to the uniqueness of the target summary code, the data consumer can obtain multiple file slices of a file in the topic partition through the target summary code.
  • Step 204 Identify the information summary code of the original file from the target summary code set.
  • the correspondence information between the information digest code of the file and the file identifier can be pre-stored.
  • the file identifier can be the file name
  • the information digest code can be the md5 code (i.e., fmd5 code) of the file.
  • the information digest code of the original file is identified from the target summary code set according to the correspondence information.
  • the original file refers to the file currently needed by the consumer.
  • the information summary code of the original file can also be determined directly based on the corresponding relationship information between the information summary code of the pre-stored file and the file identifier.
  • Step 205 Obtain metadata of each file slice in the plurality of file slices from a distributed database according to the information summary code of the original file.
  • the metadata of all file slices of the original file are obtained from the distributed database at one time.
  • the metadata of each file slice may also include the following content:
  • Serial number indicates the position of the file slice in the binary original file. For example, if the original file size is 1GB and the file is cut into 1024 file slices, the serial numbers of the 1024 file slices can be 1-1024.
  • Information summary code includes the original file information summary code (fmd5 code) and each file slice information summary code (md5 code).
  • Offset Indicates the location of the file slice in the topic partition of the message publishing and subscription system.
  • End status Indicates whether the current slice is the last slice.
  • File name The name and extension of the file.
  • Step 206 find the offset of the file slice in the subject partition according to the serial number of the file slice.
  • different file slices can be distinguished by serial numbers, and in a topic partition, different file slices are distinguished by offsets. Therefore, after obtaining the metadata of a file slice, the offset of the file slice in the topic partition can be found according to the serial number of the file slice in the metadata.
  • Each message in a topic partition has its own unique offset, which is used to indicate the location information of the message in the partition.
  • Step 207 obtaining the file slice from the subject partition according to the offset of the file slice in the subject partition.
  • Step 208 Verify each file slice in the plurality of file slices according to the information summary code of the file slice included in the metadata.
  • the obtained file slice can be verified according to the serial number first, that is, it can be determined whether the serial number of the obtained file slice is the same as the serial number of the file slice to be obtained. If the serial numbers are the same, the serial number verification passes; next, the file slice can be verified according to the information digest code, that is, the information digest code corresponding to the file slice obtained from the subject partition is calculated, and the information digest code of the file slice in the metadata is compared with the calculated information digest code. If the two are the same, the verification passes, and if they are different, the verification fails, thereby ensuring that the file slice transmitted by the subject partition is correct. Not tampered with.
  • Step 209 determine whether the verification is passed. If the verification is passed, execute step 210. If the verification is not passed, return to execute step 207.
  • Step 210 obtaining the next file slice, and executing steps 208 and 209, until multiple file slices are obtained.
  • Step 211 create a file processing stream, and use the file processing stream to write the data in multiple file slices into a designated file according to the serial number of each file slice to obtain a target file.
  • a file processing stream When merging multiple file slices, a file processing stream can be created.
  • the file processing stream may include a file reading stream and a file writing stream.
  • the file slices may be sorted according to the serial numbers so that they are the same as the order of the files before data transmission.
  • the file reading stream is used to read data from multiple file slices in sequence according to the sorting, and the file writing stream is used to write the read data into a specified file, thereby merging multiple file slices by encoding the file stream to obtain the target file.
  • Step 212 verify the target file according to the information summary code of the original file.
  • the information summary code of the target file can be calculated, and the information summary code of the original file can be compared with the information summary code of the target file. If the information summary codes are consistent, the target file verification passes and the file transfer process is completed.
  • the data consumer obtains the metadata of each file slice in multiple file slices from a distributed database.
  • the multiple file slices and the metadata of each file slice are obtained by the data producer through segmentation processing of the original file; multiple file slices are obtained from the topic partition of the message publishing and subscription system according to the metadata of each file slice; a file processing stream is created, and the file processing stream is used to merge the multiple file slices according to the metadata of each file slice to obtain the target file.
  • the present application can slice large files, transmit the file slices through the message publishing and subscription system, store the metadata of the file slices in a distributed database, and realize the use of message publishing and subscription by means of file segmentation technology and distributed database.
  • the system transfers large files, which improves the efficiency of file transfer.
  • FIG3 is another flowchart of the file transmission method provided by the present application.
  • the method can be integrated in an electronic device at the data production end, and the electronic device can be, for example, a computer.
  • the following embodiments will be described by taking the file transmission device integrated in the electronic device at the data production end as an example. As shown in FIG3, the method can include the following steps:
  • Step 301 obtain an original file, and segment the original file to obtain multiple file slices and metadata of each file slice.
  • Flume is a highly available, highly reliable, distributed system for collecting, aggregating and transmitting massive logs. It is a tool that can collect data resources such as logs and events, and centralize and store these huge amounts of data from various data resources.
  • Flume By pulling the binary file through Flume, you can customize the shard size to adapt to Kafka's message size and performance tuning. For example, you can split the original file into multiple file slices no larger than 1MB to meet Kafka's file transmission conditions, and get the metadata of each file slice.
  • Step 302 Send multiple file slices to the topic partition of the message publishing and subscription system, and store the metadata of each file slice in a distributed database.
  • the metadata includes the information summary code of the file slice, the offset of the file slice in the subject partition, the serial number of the file slice, the end status of the file slice, the full name of the original file and the information summary code of the original file.
  • the end status of the file slice is a flag that is only in the metadata of the last file slice; the full name of the original file and the information summary code of the original file are used to identify the file and ensure the uniqueness of the original file.
  • the file slice is sent to the subject partition of the message publishing and subscription system. Since the sliced files are no larger than 1MB, they can be stably transmitted in the subject partition, and the metadata of each file slice is stored in the distributed database. For subsequent data processing, please refer to the previous embodiment, which will not be repeated here.
  • the scheme of the present application is that the data production end obtains the original file and divides the original file into multiple file slices and metadata of each file slice; the multiple file slices are sent to the topic partition of the message publishing and subscription system, and the metadata of each file slice is stored in the distributed database, so that the data consumption end obtains the metadata of each file slice from the distributed database and obtains the multiple file slices from the topic partition, and then stores the metadata of each file slice in the distributed database according to the metadata of each file slice. Multiple file slices are merged to obtain the target file.
  • the present application can slice large files, transmit the file slices through the message publishing and subscription system, store the metadata of the file slices in a distributed database, and realize the transmission of large files using the message publishing and subscription system with the help of file segmentation technology and distributed database, thereby improving the efficiency of file transmission.
  • Figure 4 is an exemplary flow chart of the file transfer method provided by the present application, and the flow is as follows: the data production end first slices the original file according to the size of each file slice not exceeding 1MB, and saves the metadata of each file slice in a distributed database, and then sends the file slice to the topic partition of the message publishing and subscription system for transmission.
  • the data consumer can query the distributed database to obtain the information digest code of each file, write the obtained information digest code of each file into a preset set and automatically deduplicate, and the number of information digest codes in the deduplicated set (i.e., the target digest code set) is equal to the number of files; according to the correspondence relationship information between the information digest code of the pre-stored file and the file identifier, the information digest code of the original file currently required to be downloaded is identified from the target digest code set, and the metadata of each file slice of the original file is obtained from the distributed database according to the information digest code of the original file, and the offset of the corresponding file slice is found according to the serial number of the file slice in the metadata, and the consumer consumes according to the queried offset, and pulls the file slice from the message publishing and subscription system.
  • the file processing flow is used to write the data in the pulled file slice into the preset file, and the next file slice is continued to be pulled. After all the file slices of the original file are pulled and written into the preset file, the file processing flow is closed to obtain the target file, and then the information digest code of the original file can be compared with the information digest code of the calculated target file to verify the target file. If the verification passes, the file transfer is completed.
  • FIG5 is a schematic diagram of a structure of a file transmission device provided by the present application, which is suitable for executing the file transmission method provided by the present application and is applied to a data consumer.
  • the device may include:
  • the first acquisition module 501 is configured to obtain metadata of each file slice in a plurality of file slices from a distributed database, wherein the plurality of file slices and the metadata of each file slice are obtained by segmenting the original file at the data production end;
  • the second acquisition module 502 is configured to obtain matching file slices from the topic partition of the message publishing and subscription system according to the metadata of each file slice, thereby obtaining the plurality of file slices;
  • the merging module 503 is configured to create a file processing flow, and utilize the file processing flow to generate a matching file slice;
  • the multiple file slices are merged according to the metadata of each file slice to obtain a target file.
  • the device also includes a set acquisition module, which is configured to: obtain information summary codes of multiple files by querying the distributed database, write the information summary codes of the multiple files into a preset set to obtain an original summary code set; and deduplicate the original summary code set to obtain a target summary code set.
  • a set acquisition module configured to: obtain information summary codes of multiple files by querying the distributed database, write the information summary codes of the multiple files into a preset set to obtain an original summary code set; and deduplicate the original summary code set to obtain a target summary code set.
  • the first acquisition module 501 is configured to: identify the information summary code of the original file from the target summary code set; and acquire metadata of each file slice in multiple file slices from the distributed database according to the information summary code of the original file.
  • the metadata includes the serial number of the file slice and the offset of the file slice in the subject partition
  • the second acquisition module 502 is configured to: search for the offset of the file slice in the subject partition according to the serial number of the file slice; and obtain the file slice from the subject partition according to the offset of the file slice in the subject partition until the multiple file slices are obtained.
  • the metadata also includes an information summary code of the corresponding file slice
  • the device also includes: a file slice verification module, configured to verify each of the multiple file slices according to the information summary code of the corresponding file slice included in the metadata before using the file processing flow to merge the multiple file slices according to the metadata of each file slice; a trigger module, configured to trigger the merging module 503 to execute the step of merging the multiple file slices according to the metadata of each file slice using the file processing flow when all the multiple file slices have passed the verification.
  • the merging module 503 is configured to: write the data in the multiple file slices into a designated file according to the serial number of each file slice by using the file processing flow to obtain the target file.
  • the device further includes: a file verification module, configured to verify the target file according to the information summary code of the original file.
  • the device of the present application obtains the metadata of each file slice in multiple file slices from a distributed database.
  • the multiple file slices and the metadata of each file slice are obtained by slicing the original file at the data production end; according to the metadata of each file slice, multiple matching file slices are obtained from the topic partition of the message publishing and subscription system; a file processing stream is created, and the file processing stream is used to merge the multiple file slices according to the metadata of each file slice to obtain the target file. That is, the present application can slice a large file, transmit the file slices through the message publishing and subscription system, and store the metadata of the file slices in the distributed database. In the database, with the help of file segmentation technology and distributed database, large files can be transmitted using the message publishing and subscription system, which improves the efficiency of file transmission.
  • FIG6 is another structural schematic diagram of the file transmission device provided by the present application, which is suitable for executing the file transmission method provided by the present application and is applied to the data production end.
  • the device may include:
  • the segmentation module 601 is configured to obtain an original file and segment the original file to obtain multiple file slices and metadata of each file slice;
  • the sending module 602 is configured to send the multiple file slices to the topic partition of the message publishing and subscription system;
  • the storage module 603 is configured to store the metadata of each file slice in a distributed database, so that after the data consumer obtains the metadata of each file slice from the distributed database and obtains the multiple file slices from the topic partition, the multiple file slices are merged according to the metadata of each file slice to obtain the target file.
  • the device of the present application obtains the original file, and divides the original file into multiple file slices and metadata of each file slice; sends the multiple file slices to the topic partition of the message publishing and subscription system, and stores the metadata of each file slice in a distributed database, so that after the data consumer obtains the metadata of each file slice from the distributed database and obtains the multiple file slices from the topic partition, the multiple file slices are merged according to the metadata of each file slice to obtain the target file. That is, the present application can slice large files, transmit the file slices through the message publishing and subscription system, and store the metadata of the file slices in the distributed database. With the help of file segmentation technology and distributed database, it realizes the transmission of large files using the message publishing and subscription system, and improves the efficiency of file transmission.
  • the present application also provides a file transfer system, including a data consumption end and a data production end for executing the file transfer method described in any embodiment of the present application.
  • the present application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the file transfer method provided in any of the above embodiments when executing the program.
  • the present application also provides a computer-readable medium having a computer program stored thereon, and when the program is executed by a processor, the file transmission method provided in any of the above embodiments is implemented.
  • FIG. 7 a schematic diagram of a computer system 700 suitable for implementing the electronic device of the present application is shown.
  • the electronic device shown in Figure 7 is only an example and should not bring any limitation to the function and scope of use of the present application.
  • the computer system 700 includes a central processing unit (CPU) 701, which can perform various appropriate actions and processes according to the program stored in the read-only memory (ROM) 702 or the program loaded from the storage part 708 to the random access memory (RAM) 703.
  • CPU central processing unit
  • RAM random access memory
  • various programs and data required for the operation of the system 700 are also stored.
  • the CPU 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704.
  • the input/output (I/O) interface 705 is also connected to the bus 704.
  • the following components are connected to the I/O interface 705: an input section 706 including a keyboard, a mouse, etc.; an output section 707 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 708 including a hard disk, etc.; and a communication section 709 including a network interface card such as a local area network (LAN) card, a modem, etc. The communication section 709 performs communication processing via a network such as the Internet.
  • a drive 710 is also connected to the I/O interface 705 as needed.
  • a removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 710 as needed so that a computer program read therefrom is installed into the storage section 708 as needed.
  • the process described above with reference to the flowchart can be implemented as a computer software program.
  • the embodiments disclosed in the present application include a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program includes a program code for executing the method shown in the flowchart.
  • the computer program can be downloaded and installed from the network through the communication part 709, and/or installed from the removable medium 711.
  • the central processing unit (CPU) 701 the above-mentioned functions of the device in the system of the present application are executed.
  • the computer-readable medium described in this application may be a computer-readable signal medium or a computer-readable storage medium or any combination of the two.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination of the above.
  • the computer-readable storage medium may include, but is not limited to, an electrical connection with one or more wires, a portable computer disk, a hard disk, RAM, ROM, an erasable programmable read-only memory, or a computer-readable medium.
  • a computer-readable storage medium may be any tangible medium containing or storing a program, which can be used by or in combination with an instruction execution system, device or device.
  • a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, which carries a computer-readable program code. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which may send, propagate, or transmit a program for use by or in combination with an instruction execution system, device or device.
  • the program code contained on the computer-readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, optical cable, radio frequency (RF), etc., or any suitable combination of the above.
  • each box in the flow chart or block diagram can represent a module, a program segment or a part of a code, and the above-mentioned module, program segment or a part of a code contains one or more executable instructions for realizing the specified logical function.
  • the functions marked in the box can also occur in a different order from the order marked in the accompanying drawings. For example, two boxes represented in succession can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved.
  • each box in the block diagram or flow chart, and the combination of the boxes in the block diagram or flow chart can be implemented with a dedicated hardware-based system that performs a specified function or operation, or can be implemented with a combination of dedicated hardware and computer instructions.
  • the modules and/or units described in this application may be implemented in software or hardware.
  • the modules and/or units described may also be provided in a processor, for example, may be described as: a processor applied to a data consumption end, including a first acquisition module, a second acquisition module, and a merging module.
  • a processor applied to a data production end including a segmentation module, a sending module, and a storage module.
  • the names of these modules do not constitute limitations on the modules themselves in some cases.
  • the present application further provides a computer-readable medium, which may be included in the device described in the above embodiment; or may exist independently without being assembled into the device.
  • the computer-readable medium carries one or more programs, and when the one or more programs are executed by the device, the device includes:
  • the metadata of each file slice in the multiple file slices is obtained from the distributed database.
  • the multiple file slices and the metadata of each file slice are obtained by slicing the original file at the data production end; according to the metadata of each file slice, the matching file slice is obtained from the topic partition of the message publishing and subscription system to obtain multiple file slices; a file processing stream is created, and the file processing stream is used to merge the multiple file slices according to the metadata of each file slice to obtain the target file.
  • the computer-readable medium carries one or more programs, and when the one or more programs are executed by a device, the device includes:
  • An original file is obtained, and the original file is sliced to obtain multiple file slices and metadata of each file slice; the multiple file slices are sent to a topic partition of a message publishing and subscription system, and the metadata of each file slice is stored in a distributed database, so that after a data consumer obtains the metadata of each file slice from the distributed database and obtains the multiple file slices from the topic partition, the multiple file slices are merged according to the metadata of each file slice to obtain a target file.
  • the metadata of each file slice in the multiple file slices can be obtained from the distributed database, and the multiple file slices and the metadata of each file slice are obtained by slicing the original file at the data production end; according to the metadata of each file slice, the matching multiple file slices are obtained from the topic partition of the message publishing and subscription system; a file processing stream is created, and the file processing stream is used to merge the multiple file slices according to the metadata of each file slice to obtain the target file. That is, the present application can slice large files, transmit the file slices through the message publishing and subscription system, and store the metadata of the file slices in the distributed database. With the help of file slicing technology and distributed database, the transmission of large files using the message publishing and subscription system is realized, thereby improving the efficiency of file transmission.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A file transmission method, apparatus and system, an electronic device, and a storage medium. The method is applied to a data consumer, and comprises: acquiring metadata of each of a plurality of file slices from a distributed database, the plurality of file slices and the metadata of each file slice being obtained by splitting an original file by a data producer (101); according to the metadata of each file slice, acquiring matched file slices from a topic partition of a message publishing and subscribing system to obtain a plurality of file slices (102); and creating a file processing stream, and, according to the metadata of each file slice, merging the plurality of file slices by using the file processing stream to obtain a target file (103).

Description

文件传输方法、装置、系统、电子设备及存储介质File transmission method, device, system, electronic device and storage medium
本申请要求在2022年11月16日提交中国专利局、申请号为202211434252.6的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the China Patent Office on November 16, 2022, with application number 202211434252.6, the entire contents of which are incorporated by reference into this application.
技术领域Technical Field
本申请涉及数据传输技术,例如涉及一种文件传输方法、装置、系统、电子设备及存储介质。The present application relates to data transmission technology, for example, to a file transmission method, device, system, electronic device and storage medium.
背景技术Background technique
卡夫卡(Kafka)是一种分布式消息发布与订阅系统,它具有高吞吐量、低延时、可用性高等优点。Kafka传输的一般为日志类的结构化数据,Kafka默认传输的每条数据大小不超过1MB,这导致比较大的二进制文件(比如视频、图片、压缩包等)无法通过Kafka进行传输。Kafka is a distributed message publishing and subscription system with the advantages of high throughput, low latency, and high availability. Kafka generally transmits structured data such as logs. The default size of each data transmitted by Kafka does not exceed 1MB, which makes it impossible to transmit relatively large binary files (such as videos, pictures, compressed packages, etc.) through Kafka.
发明内容Summary of the invention
本申请提供一种文件传输方法、装置、系统、电子设备及存储介质,能够实现利用消息发布与订阅系统传输大文件,提高了文件传输效率。The present application provides a file transmission method, device, system, electronic device and storage medium, which can realize the transmission of large files by using a message publishing and subscription system, thereby improving the file transmission efficiency.
第一方面,本申请提供一种文件传输方法,应用于数据消费端,所述方法包括:In a first aspect, the present application provides a file transmission method, which is applied to a data consumption end, and the method comprises:
从分布式数据库获取多个文件切片中每个文件切片的元数据,所述多个文件切片和所述每个文件切片的元数据由数据生产端对原始文件进行切分处理得到;Obtaining metadata of each file slice in a plurality of file slices from a distributed database, wherein the plurality of file slices and the metadata of each file slice are obtained by slicing the original file at the data production end;
根据所述每个文件切片的元数据从消息发布与订阅系统的主题分区获取所述多个文件切片;Obtain the multiple file slices from the topic partition of the message publishing and subscription system according to the metadata of each file slice;
创建文件处理流,并利用所述文件处理流根据所述每个文件切片的元数据将所述多个文件切片合并,得到目标文件。A file processing stream is created, and the plurality of file slices are merged according to the metadata of each file slice by using the file processing stream to obtain a target file.
第二方面,本申请提供一种文件传输方法,应用于数据生产端,所述方法包括:In a second aspect, the present application provides a file transmission method, which is applied to a data production end, and the method includes:
获取原始文件,并将所述原始文件进行切分处理,得到多个文件切片和每个文件切片的元数据;Acquire an original file, and slice the original file to obtain multiple file slices and metadata of each file slice;
将所述多个文件切片发送至消息发布与订阅系统的主题分区,并将所述每 个文件切片的元数据存入分布式数据库,以使得数据消费端在从所述分布式数据库获取所述每个文件切片的元数据并从所述主题分区获取所述多个文件切片之后,根据所述每个文件切片的元数据将所述多个文件切片合并,从而得到目标文件。The multiple file slices are sent to the topic partition of the message publishing and subscription system, and each The metadata of each file slice is stored in a distributed database, so that after the data consumer obtains the metadata of each file slice from the distributed database and obtains the multiple file slices from the subject partition, the multiple file slices are merged according to the metadata of each file slice to obtain the target file.
第三方面,本申请提供一种文件传输装置,应用于数据消费端,所述装置包括:In a third aspect, the present application provides a file transmission device, which is applied to a data consumption end, and the device includes:
第一获取模块,设置为从分布式数据库获取多个文件切片中每个文件切片的元数据,所述多个文件切片和所述每个文件切片的元数据由数据生产端对原始文件进行切分处理得到;A first acquisition module is configured to acquire metadata of each file slice in a plurality of file slices from a distributed database, wherein the plurality of file slices and the metadata of each file slice are obtained by segmenting the original file at the data production end;
第二获取模块,设置为根据所述每个文件切片的元数据从消息发布与订阅系统的主题分区获取所述多个文件切片;A second acquisition module is configured to acquire the multiple file slices from a topic partition of a message publishing and subscription system according to the metadata of each file slice;
合并模块,设置为创建文件处理流,并利用所述文件处理流根据所述每个文件切片的元数据将所述多个文件切片合并,得到目标文件。The merging module is configured to create a file processing stream, and use the file processing stream to merge the multiple file slices according to the metadata of each file slice to obtain a target file.
第四方面,本申请提供一种文件传输装置,应用于数据生产端,所述装置包括:In a fourth aspect, the present application provides a file transmission device, which is applied to a data production end, and the device includes:
切分模块,设置为获取原始文件,并将所述原始文件进行切分处理,得到多个文件切片和每个文件切片的元数据;A segmentation module, configured to obtain an original file and segment the original file to obtain a plurality of file slices and metadata of each file slice;
发送模块,设置为将所述多个文件切片发送至消息发布与订阅系统的主题分区;A sending module, configured to send the plurality of file slices to a topic partition of a message publishing and subscription system;
存储模块,设置为将所述每个文件切片的元数据存入分布式数据库,以使得数据消费端在从所述分布式数据库获取所述每个文件切片的元数据并从所述主题分区获取所述多个文件切片之后,根据所述每个文件切片的元数据将所述多个文件切片合并,从而得到目标文件。The storage module is configured to store the metadata of each file slice into a distributed database, so that after the data consumer obtains the metadata of each file slice from the distributed database and obtains the multiple file slices from the subject partition, the multiple file slices are merged according to the metadata of each file slice to obtain the target file.
第五方面,本申请提供一种文件传输系统,包括用于执行本申请任一实施例所述的文件传输方法的数据消费端和数据生产端。In a fifth aspect, the present application provides a file transfer system, comprising a data consumption end and a data production end for executing the file transfer method described in any embodiment of the present application.
第六方面,本申请提供一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述程序时实现如本申请任一实施例所述的文件传输方法。In a sixth aspect, the present application provides an electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the file transfer method as described in any embodiment of the present application is implemented.
第七方面,本申请提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本申请任一实施例所述的文件传输方法。 In a seventh aspect, the present application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the file transfer method as described in any embodiment of the present application.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请的技术方案,下面将对实施例中所需要使用的附图作简单地介绍。In order to more clearly illustrate the technical solution of the present application, the drawings required for use in the embodiments are briefly introduced below.
图1是本申请提供的文件传输方法的一个流程示意图;FIG1 is a schematic diagram of a flow chart of a file transmission method provided by the present application;
图2是本申请提供的文件传输方法的另一流程示意图;FIG2 is another schematic diagram of the process of the file transmission method provided by the present application;
图3是本申请提供的文件传输方法的另一流程示意图;FIG3 is another schematic diagram of the process of file transmission provided by the present application;
图4是本申请提供的文件传输方法的示例性流程图;FIG4 is an exemplary flow chart of a file transmission method provided by the present application;
图5是本申请提供的文件传输装置的一个结构示意图;FIG5 is a schematic diagram of a structure of a file transmission device provided by the present application;
图6是本申请提供的文件传输装置的另一个结构示意图;FIG6 is another schematic diagram of the structure of the file transmission device provided by the present application;
图7是本申请提供的电子设备的一个结构示意图。FIG. 7 is a schematic diagram of the structure of the electronic device provided by the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行说明。The technical solutions in the embodiments of the present application will be described below in conjunction with the drawings in the embodiments of the present application.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", etc. in the specification and claims of this application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions. For example, a process, method, system, product or device that includes a series of steps or units is not necessarily limited to those steps or units clearly listed, but may include other steps or units that are not clearly listed or inherent to these processes, methods, products or devices.
图1是本申请提供的文件传输方法的一个流程示意图,该方法可以由本申请提供的文件传输装置来执行,该装置可采用软件和/或硬件的方式实现。在一个示例性实施例中,该装置可以集成在数据消费端,例如可以集成在数据消费端的电子设备中,电子设备可以是计算机。以下实施例将以该装置集成在数据消费端的电子设备中为例进行说明。FIG1 is a flowchart of a file transfer method provided by the present application, which can be performed by a file transfer device provided by the present application, and the device can be implemented in software and/or hardware. In an exemplary embodiment, the device can be integrated into a data consumption end, for example, it can be integrated into an electronic device at the data consumption end, and the electronic device can be a computer. The following embodiments will be described by taking the device integrated into an electronic device at the data consumption end as an example.
在介绍数据消费端的处理过程之前,先介绍一下数据生产端的处理过程,如下: Before introducing the processing process on the data consumption side, let's first introduce the processing process on the data production side, as follows:
本实施例中,数据生产端待传输的文件可以是视频、图片、压缩包等大文件,待传输的文件可以包括多个,这些文件通常是超过1MB的,而消息发布与订阅系统Kafka传输的每条数据大小一般不超过1MB,因而直接采用消息发布与订阅系统是无法将这些文件传输给需求端的。针对这一情况,本实施例中,在将这些文件利用消息发布与订阅系统传输之前,可以根据消息发布与订阅系统对传输数据大小的限制将每个文件切片,比如将每个文件切分成小于1MB的切片,从而得到每个文件的多个文件切片,每个文件切片可以具有元数据,元数据是文件切片的描述性数据。In this embodiment, the files to be transmitted at the data production end may be large files such as videos, pictures, compressed packages, etc. There may be multiple files to be transmitted, and these files are usually larger than 1MB, while the size of each piece of data transmitted by the message publishing and subscription system Kafka is generally no more than 1MB, so it is impossible to directly use the message publishing and subscription system to transmit these files to the demand side. In view of this situation, in this embodiment, before these files are transmitted using the message publishing and subscription system, each file can be sliced according to the message publishing and subscription system's limit on the size of the transmitted data, such as cutting each file into slices smaller than 1MB, thereby obtaining multiple file slices for each file, and each file slice can have metadata, which is descriptive data of the file slice.
在将每个文件切分成多个文件切片之后,可以将多个文件切片上传至消息发布与订阅系统,消息发布与订阅系统是分类存储数据的,即每个文件切片在消息发布与订阅系统中都有一个类别,这个类别被称为主题Topic,物理上不同主题的数据分开存储,每个主题包括一个或多个分区,即每个文件切片上传到消息发布与订阅系统后,会被存储在对应的主题分区中,存储后会得到文件切片在主题分区中的偏移量offset。主题分区中是按照队列存储文件切片的,偏移量可以表示文件切片实际位置相对于队列头偏移的位置。另外,在将文件切片时,还可以记录每个文件切片的序列号,将每个文件切片的序列号和每个文件切片在消息发布与订阅系统的主题分区中的偏移量作为对应文件切片的元数据。当然,文件切片的元数据中还可以包括其他信息,比如文件的文件名、文件的信息摘要码等,此处不做限定。After each file is cut into multiple file slices, the multiple file slices can be uploaded to the message publishing and subscription system. The message publishing and subscription system stores data in a classified manner, that is, each file slice has a category in the message publishing and subscription system, which is called a topic. Physically, data of different topics are stored separately. Each topic includes one or more partitions, that is, after each file slice is uploaded to the message publishing and subscription system, it will be stored in the corresponding topic partition. After storage, the offset of the file slice in the topic partition will be obtained. The topic partition stores file slices according to queues, and the offset can indicate the position of the actual position of the file slice relative to the offset of the queue head. In addition, when slicing files, the serial number of each file slice can also be recorded, and the serial number of each file slice and the offset of each file slice in the topic partition of the message publishing and subscription system are used as the metadata of the corresponding file slice. Of course, the metadata of the file slice can also include other information, such as the file name of the file, the information summary code of the file, etc., which are not limited here.
由于消息发布与订阅系统一般不用来长期存储数据,且存储空间有限,没有元数据存储的概念,因而本实施例中,可以将文件切片和文件切片的元数据分开处理,将文件切片推送至消息发布与订阅系统,将文件切片的元数据上传至分布式数据库,采用分布式数据库存储元数据,可以便于后续快速、高效地查询文件切片的元数据。分布式数据库可以是一个基于分布式文件存储的文档数据库,示例地,分布式数据库可以是MongoDB,MongoDB是一个介于关系数据库和非关系数据库之间的产品,是非关系数据库当中功能最丰富,最像关系数据库的,它支持的数据结构非常松散,因此可以存储比较复杂的数据类型。Since the message publishing and subscription system is generally not used for long-term data storage, and the storage space is limited, there is no concept of metadata storage. Therefore, in this embodiment, the file slice and the metadata of the file slice can be processed separately, the file slice is pushed to the message publishing and subscription system, and the metadata of the file slice is uploaded to the distributed database. The use of the distributed database to store metadata can facilitate the subsequent fast and efficient query of the metadata of the file slice. The distributed database can be a document database based on distributed file storage. For example, the distributed database can be MongoDB. MongoDB is a product between relational databases and non-relational databases. It is the most feature-rich and relational database among non-relational databases. The data structure it supports is very loose, so it can store relatively complex data types.
下面介绍数据消费端的处理过程,继续参考图1,可以包括如下步骤:The following describes the processing process of the data consumer end. Continuing with reference to Figure 1, it may include the following steps:
步骤101,从分布式数据库获取多个文件切片中每个文件切片的元数据,多个文件切片和每个文件切片的元数据由数据生产端对原始文件进行切分处理得 到。Step 101, obtain metadata of each file slice in multiple file slices from a distributed database, the multiple file slices and the metadata of each file slice are obtained by the data production end by segmenting the original file. arrive.
示例地,原始文件可以是数据生产端传输的多个文件中的任意一个,可视数据消费端的消费需求而定,即原始文件也是视频、图片、压缩包等大文件。由于数据生产端将文件切分之后,将每个文件切片的元数据存入了分布式数据库,因而在实际消费时,可以先从分布式数据库获取所需文件的每个文件切片的元数据。For example, the original file can be any one of the multiple files transmitted by the data production end, depending on the consumption needs of the data consumption end, that is, the original file is also a large file such as video, picture, compressed package, etc. Since the data production end stores the metadata of each file slice in the distributed database after splitting the file, the metadata of each file slice of the required file can be obtained from the distributed database first during actual consumption.
步骤102,根据每个文件切片的元数据从消息发布与订阅系统的主题分区获取匹配的文件切片,以得到多个文件切片。Step 102: According to the metadata of each file slice, a matching file slice is obtained from a topic partition of a message publishing and subscription system to obtain a plurality of file slices.
元数据包括文件切片的序列号和文件切片在主题分区的偏移量offset;文件切片的序列号指的是该文件切片在原始的二进制文件中的位置,示例性的,一个文件大小为1GB,将该文件切割为1024个文件切片,文件切片的序列号可以分别为1至1024;消息发布与订阅系统包括的每一条数据对应的offset分为索引(index)和日志(log)两个部分,index记载着这条数据的偏移量信息,log存储着这条数据信息。消费端根据每个文件切片的序列号查找文件切片在主题分区的偏移量,根据文件切片在主题分区的偏移量从主题分区获取匹配的文件切片进行消费,示例性的,在数据消费端第一次进行消费时,从偏移量为0的文件切片开始消费,一直消费到8,偏移量则记录在8,在下一次进行消费时,可以从头开始消费,也可以接着上一次的位置开始消费。数据消费端可以消费的最大值为数据生产端写入的偏移量最大值,达到消费的最大值时则表明多个文件切片获取完成。The metadata includes the serial number of the file slice and the offset of the file slice in the topic partition; the serial number of the file slice refers to the position of the file slice in the original binary file. For example, a file size of 1GB is cut into 1024 file slices, and the serial numbers of the file slices can be 1 to 1024 respectively; the offset corresponding to each piece of data included in the message publishing and subscription system is divided into two parts: index and log. The index records the offset information of this data, and the log stores this data information. The consumer side searches for the offset of the file slice in the topic partition according to the serial number of each file slice, and obtains the matching file slice from the topic partition according to the offset of the file slice in the topic partition for consumption. For example, when the data consumer side consumes for the first time, it starts consuming from the file slice with an offset of 0 and consumes until 8, and the offset is recorded at 8. When consuming next time, it can start consuming from the beginning or from the last position. The maximum value that the data consumer side can consume is the maximum offset value written by the data producer side. When the maximum value of consumption is reached, it indicates that multiple file slices have been obtained.
步骤103,创建文件处理流,并利用文件处理流根据每个文件切片的元数据将多个文件切片合并,得到目标文件。Step 103: Create a file processing stream, and use the file processing stream to merge multiple file slices according to the metadata of each file slice to obtain a target file.
文件处理流可以包括文件读取流和文件写入流,可以利用文件读取流根据序列号依次从多个文件切片中读出数据,利用文件写入流将读出的数据写入指定文件,从而实现将多个文件切片通过编码文件流的方式进行合并,得到目标文件。The file processing stream may include a file reading stream and a file writing stream. The file reading stream may be used to read data from multiple file slices in sequence according to the serial number, and the file writing stream may be used to write the read data into a specified file, thereby merging multiple file slices by encoding the file stream to obtain the target file.
本申请的方案,数据生产端对原始文件切片处理,得到多个文件切片,将多个文件切片推送至消息发布与订阅系统,并将每个文件切片对应的元数据保存至分布式数据库;数据消费端从分布式数据库获取每个文件切片对应的元数 据,根据每个文件切片的元数据从消息发布与订阅系统的主题分区获取与元数据匹配的多个文件切片;创建文件处理流,并利用文件处理流根据每个文件切片的元数据将多个文件切片合并,得到目标文件。即本申请可以将大文件切片,将文件切片通过消息发布与订阅系统传输,将文件切片的元数据存储在分布式数据库中,借助文件切分技术和分布式数据库实现了利用消息发布与订阅系统传输大文件,提高了文件传输效率。In the solution of this application, the data production end processes the original file slices to obtain multiple file slices, pushes the multiple file slices to the message publishing and subscription system, and saves the metadata corresponding to each file slice to the distributed database; the data consumption end obtains the metadata corresponding to each file slice from the distributed database According to the metadata of each file slice, multiple file slices matching the metadata are obtained from the topic partition of the message publishing and subscription system; a file processing stream is created, and the file processing stream is used to merge multiple file slices according to the metadata of each file slice to obtain the target file. That is, the present application can slice large files, transmit the file slices through the message publishing and subscription system, and store the metadata of the file slices in a distributed database. With the help of file segmentation technology and distributed database, large files can be transmitted using the message publishing and subscription system, thereby improving the efficiency of file transmission.
图2是本申请提供的文件传输方法的另一流程示意图,说明本申请提供的文件传输方法,该方法可以集成在数据消费端的电子设备中,电子设备可以是计算机。以下实施例将以该装置集成在数据消费端的电子设备中为例进行说明,如图2所示,该方法可以包括如下步骤:FIG2 is another flowchart of the file transmission method provided by the present application, which illustrates the file transmission method provided by the present application. The method can be integrated in an electronic device at a data consumption end, and the electronic device can be a computer. The following embodiments will be described by taking the device integrated in an electronic device at a data consumption end as an example. As shown in FIG2 , the method can include the following steps:
步骤201,通过查询分布式数据库获取多个文件的信息摘要码。Step 201, obtaining information summary codes of multiple files by querying a distributed database.
信息摘要码指的是根据公开的信息摘要算法对原信息进行数字变换后得到的一个128位(bit)的特征码,这个特征码是不可逆的,并且具有高度的离散性,可以保证文件或文件切片的唯一性。示例地,信息摘要码可以是md5码,每个文件具有唯一的md5码(即file md5,简称fmd5)。The message digest code refers to a 128-bit feature code obtained by digitally transforming the original information according to the public message digest algorithm. This feature code is irreversible and highly discrete, which can ensure the uniqueness of the file or file slice. For example, the message digest code can be an md5 code, and each file has a unique md5 code (i.e. file md5, abbreviated as fmd5).
步骤202,将多个文件的信息摘要码写入预设集合,得到原始摘要码集合。Step 202, write the information summary codes of multiple files into a preset set to obtain an original summary code set.
可以构建一个空数据集,将从分布式数据库中查询到的各个文件的md5码写入该空数据集,得到原始摘要码集合,即原始摘要码集合中包括多个fmd5。An empty data set can be constructed, and the md5 codes of each file queried from the distributed database can be written into the empty data set to obtain an original summary code set, that is, the original summary code set includes multiple fmd5s.
步骤203,对原始摘要码集合进行去重处理,得到目标摘要码集合。Step 203: perform deduplication processing on the original summary code set to obtain a target summary code set.
去重处理可以避免将重复性的数据保存到数据库中以造成大量的冗余性数据,去重处理后可以得到新的数据集合,即目标fmd5码的集合,集合所包含的fmd5码的数量就是通过kafka传输的文件数量。目标摘要码因其唯一性而使得数据消费端可以通过目标摘要码获取主题分区中某个文件的多个文件切片。Deduplication can avoid saving duplicate data in the database to cause a large amount of redundant data. After deduplication, a new data set can be obtained, that is, the set of target fmd5 codes. The number of fmd5 codes contained in the set is the number of files transmitted through Kafka. Due to the uniqueness of the target summary code, the data consumer can obtain multiple file slices of a file in the topic partition through the target summary code.
步骤204,从目标摘要码集合中识别出原始文件的信息摘要码。Step 204: Identify the information summary code of the original file from the target summary code set.
示例地,可以预先存储文件的信息摘要码与文件标识的对应关系信息,文件标识可以是文件名,信息摘要码可以是文件的md5码(即fmd5码),根据该对应关系信息从目标摘要码集合中识别出原始文件的信息摘要码,这里的原始文 件指的是消费端当前需要的文件。还可以直接根据预先存储文件的信息摘要码与文件标识的对应关系信息,确定出原始文件的信息摘要码。For example, the correspondence information between the information digest code of the file and the file identifier can be pre-stored. The file identifier can be the file name, and the information digest code can be the md5 code (i.e., fmd5 code) of the file. The information digest code of the original file is identified from the target summary code set according to the correspondence information. Here, the original file The file refers to the file currently needed by the consumer. The information summary code of the original file can also be determined directly based on the corresponding relationship information between the information summary code of the pre-stored file and the file identifier.
步骤205,根据原始文件的信息摘要码从分布式数据库获取多个文件切片中每个文件切片的元数据。Step 205: Obtain metadata of each file slice in the plurality of file slices from a distributed database according to the information summary code of the original file.
即根据原始文件的信息摘要码,从分布式数据库一次性获取原始文件的所有文件切片的元数据。示例地,每个文件切片的元数据还可以包括如下内容:That is, according to the information summary code of the original file, the metadata of all file slices of the original file are obtained from the distributed database at one time. For example, the metadata of each file slice may also include the following content:
序列号:表示文件切片在这个二进制原始文件中的位置,例如一个原始文件大小为1GB,这个文件被切割成了1024个文件切片,则这1024个文件切片的序列号可以为1-1024。Serial number: indicates the position of the file slice in the binary original file. For example, if the original file size is 1GB and the file is cut into 1024 file slices, the serial numbers of the 1024 file slices can be 1-1024.
信息摘要码:包括原始文件信息摘要码(fmd5码)和每个文件切片信息摘要码(md5码)。Information summary code: includes the original file information summary code (fmd5 code) and each file slice information summary code (md5 code).
偏移量:表示该文件切片在消息发布与订阅系统的主题分区中的位置。Offset: Indicates the location of the file slice in the topic partition of the message publishing and subscription system.
结束状态:表示目前切片是否为最后一个切片。End status: Indicates whether the current slice is the last slice.
文件名:文件的名称以及后缀名。File name: The name and extension of the file.
步骤206,根据文件切片的序列号查找文件切片在主题分区的偏移量。Step 206, find the offset of the file slice in the subject partition according to the serial number of the file slice.
在分布式数据库中可以通过序列号区分不同文件切片,在主题分区中则是通过偏移量来区分不同文件切片,因而,在得到文件切片的元数据之后,可以根据元数据中文件切片的序列号查找文件切片在主题分区的偏移量。主题分区中的每条消息都有自己唯一的偏移量,用来表示消息在分区中的位置信息。In a distributed database, different file slices can be distinguished by serial numbers, and in a topic partition, different file slices are distinguished by offsets. Therefore, after obtaining the metadata of a file slice, the offset of the file slice in the topic partition can be found according to the serial number of the file slice in the metadata. Each message in a topic partition has its own unique offset, which is used to indicate the location information of the message in the partition.
步骤207,根据文件切片在主题分区的偏移量从主题分区获取文件切片。Step 207, obtaining the file slice from the subject partition according to the offset of the file slice in the subject partition.
步骤208,根据元数据中包括的文件切片的信息摘要码对多个文件切片中的每个文件切片进行校验。Step 208: Verify each file slice in the plurality of file slices according to the information summary code of the file slice included in the metadata.
可以先根据序列号对获取的文件切片进行校验,即可以判断所获取的文件切片的序列号与要获取的文件切片的序列号是否相同,在序列号相同的情况下,序列号校验通过;接下来可以根据信息摘要码对文件切片进行校验,即计算从主题分区获取的文件切片对应的信息摘要码,将元数据中文件切片的信息摘要码与计算得到的信息摘要码进行对比,在二者相同的情况下,校验通过,在二者不同的情况下,校验不通过,从而确保主题分区传输的文件切片是正确的、 没有被篡改的。The obtained file slice can be verified according to the serial number first, that is, it can be determined whether the serial number of the obtained file slice is the same as the serial number of the file slice to be obtained. If the serial numbers are the same, the serial number verification passes; next, the file slice can be verified according to the information digest code, that is, the information digest code corresponding to the file slice obtained from the subject partition is calculated, and the information digest code of the file slice in the metadata is compared with the calculated information digest code. If the two are the same, the verification passes, and if they are different, the verification fails, thereby ensuring that the file slice transmitted by the subject partition is correct. Not tampered with.
步骤209,确定是否校验通过,在校验通过的情况下,执行步骤210,在校验不通过的情况下,返回执行步骤207。Step 209, determine whether the verification is passed. If the verification is passed, execute step 210. If the verification is not passed, return to execute step 207.
即在校验通过的情况下,继续获取下一个文件切片,在校验不通过的情况下,重新拉取对应文件切片并校验,直至得到原始文件的所有文件切片。That is, if the verification passes, continue to obtain the next file slice. If the verification fails, re-pull the corresponding file slice and verify it until all file slices of the original file are obtained.
步骤210,获取下一个文件切片,并执行步骤208和209,直至得到多个文件切片。Step 210, obtaining the next file slice, and executing steps 208 and 209, until multiple file slices are obtained.
步骤211,创建文件处理流,利用文件处理流根据每个文件切片的序列号将多个文件切片中的数据写入指定文件,得到目标文件。Step 211, create a file processing stream, and use the file processing stream to write the data in multiple file slices into a designated file according to the serial number of each file slice to obtain a target file.
在多个文件切片均校验通过的情况下,证明数据传输完成且数据传输过程中没有出现错漏,在检测到元数据中的结束状态时证明文件切片已获取完成,接下来触发执行利用文件处理流根据每个文件切片的元数据将多个文件切片合并的步骤。When multiple file slices are verified, it proves that the data transmission is completed and no errors occur during the data transmission process. When the end status in the metadata is detected, it proves that the file slices have been acquired. Next, the step of merging multiple file slices according to the metadata of each file slice using the file processing flow is triggered.
在合并多个文件切片时,可以创建文件处理流,文件处理流可以包括文件读取流和文件写入流,可以将文件切片按照序列号排序,使其与数据传输前的文件顺序相同,利用文件读取流根据按照排序依次从多个文件切片中读出数据,利用文件写入流将读出的数据写入指定文件,从而实现将多个文件切片通过编码文件流的方式进行合并,得到目标文件。When merging multiple file slices, a file processing stream can be created. The file processing stream may include a file reading stream and a file writing stream. The file slices may be sorted according to the serial numbers so that they are the same as the order of the files before data transmission. The file reading stream is used to read data from multiple file slices in sequence according to the sorting, and the file writing stream is used to write the read data into a specified file, thereby merging multiple file slices by encoding the file stream to obtain the target file.
步骤212,根据原始文件的信息摘要码对目标文件进行校验。Step 212: verify the target file according to the information summary code of the original file.
可以计算目标文件的信息摘要码,将原始文件的信息摘要码与目标文件的信息摘要码进行对比,信息摘要码一致则目标文件校验通过,文件传输过程完成。The information summary code of the target file can be calculated, and the information summary code of the original file can be compared with the information summary code of the target file. If the information summary codes are consistent, the target file verification passes and the file transfer process is completed.
本申请的方案,数据消费端从分布式数据库获取多个文件切片中每个文件切片的元数据,多个文件切片和每个文件切片的元数据由数据生产端对原始文件进行切分处理得到;根据每个文件切片的元数据从消息发布与订阅系统的主题分区获取多个文件切片;创建文件处理流,并利用文件处理流根据每个文件切片的元数据将多个文件切片合并,得到目标文件。即本申请可以将大文件切片,将文件切片通过消息发布与订阅系统传输,将文件切片的元数据存储在分布式数据库中,借助文件切分技术和分布式数据库实现了利用消息发布与订阅 系统传输大文件,提高了文件传输效率。In the scheme of the present application, the data consumer obtains the metadata of each file slice in multiple file slices from a distributed database. The multiple file slices and the metadata of each file slice are obtained by the data producer through segmentation processing of the original file; multiple file slices are obtained from the topic partition of the message publishing and subscription system according to the metadata of each file slice; a file processing stream is created, and the file processing stream is used to merge the multiple file slices according to the metadata of each file slice to obtain the target file. That is, the present application can slice large files, transmit the file slices through the message publishing and subscription system, store the metadata of the file slices in a distributed database, and realize the use of message publishing and subscription by means of file segmentation technology and distributed database. The system transfers large files, which improves the efficiency of file transfer.
图3是本申请提供的文件传输方法的另一流程示意图,该方法可以集成在数据生产端的电子设备中,电子设备比如可以是计算机。以下实施例将以该文件传输装置集成在数据生产端的电子设备中为例进行说明,如图3所示,该方法可以包括如下步骤:FIG3 is another flowchart of the file transmission method provided by the present application. The method can be integrated in an electronic device at the data production end, and the electronic device can be, for example, a computer. The following embodiments will be described by taking the file transmission device integrated in the electronic device at the data production end as an example. As shown in FIG3, the method can include the following steps:
步骤301,获取原始文件,并将原始文件进行切分处理,得到多个文件切片和每个文件切片的元数据。Step 301, obtain an original file, and segment the original file to obtain multiple file slices and metadata of each file slice.
在数据生产端,可以使用日志收集系统(Flume)拉取二进制的原始文件,Flume是一个高可用的、高可靠的、分布式的海量日志采集、聚合和传输的系统,是一个可以收集例如日志,事件等数据资源,并将这些数量庞大的数据从各项数据资源中集中起来存储的工具。通过Flume拉取该二进制文件,可以自定义调节分片大小适配Kafka的消息(Message)大小和性能调优,例如,将原始文件切分为多个不大于1MB的文件切片,以满足Kafka的文件传输条件,同时得到每个文件切片的元数据。On the data production side, you can use the log collection system (Flume) to pull binary original files. Flume is a highly available, highly reliable, distributed system for collecting, aggregating and transmitting massive logs. It is a tool that can collect data resources such as logs and events, and centralize and store these huge amounts of data from various data resources. By pulling the binary file through Flume, you can customize the shard size to adapt to Kafka's message size and performance tuning. For example, you can split the original file into multiple file slices no larger than 1MB to meet Kafka's file transmission conditions, and get the metadata of each file slice.
步骤302,将多个文件切片发送至消息发布与订阅系统的主题分区,并将每个文件切片的元数据存入分布式数据库。Step 302: Send multiple file slices to the topic partition of the message publishing and subscription system, and store the metadata of each file slice in a distributed database.
元数据中包括文件切片的信息摘要码、文件切片在主题分区的偏移量、文件切片的序列号、文件切片结束状态、原文件全名和原文件的信息摘要码。文件切片结束状态是仅在最后一个文件切片的元数据中的标志;原文件全名和原文件的信息摘要码用于识别文件并保证原始文件的唯一性。接下来将文件切片发送至消息发布与订阅系统的主题分区,因切片后的文件都不大于1MB,所以可以在主题分区中稳定传输,同时将每个文件切片的元数据存入分布式数据库中。后续的数据处理,可参阅前面实施例,此处不再赘述。The metadata includes the information summary code of the file slice, the offset of the file slice in the subject partition, the serial number of the file slice, the end status of the file slice, the full name of the original file and the information summary code of the original file. The end status of the file slice is a flag that is only in the metadata of the last file slice; the full name of the original file and the information summary code of the original file are used to identify the file and ensure the uniqueness of the original file. Next, the file slice is sent to the subject partition of the message publishing and subscription system. Since the sliced files are no larger than 1MB, they can be stably transmitted in the subject partition, and the metadata of each file slice is stored in the distributed database. For subsequent data processing, please refer to the previous embodiment, which will not be repeated here.
本申请的方案,数据生产端获取原始文件,并将原始文件进行切分处理,得到多个文件切片和每个文件切片的元数据;将多个文件切片发送至消息发布与订阅系统的主题分区,并将每个文件切片的元数据存入分布式数据库,以使得数据消费端在从所述分布式数据库获取所述每个文件切片的元数据并从所述主题分区获取所述多个文件切片之后,根据所述每个文件切片的元数据将所述 多个文件切片合并,从而得到目标文件。即本申请可以将大文件切片,将文件切片通过消息发布与订阅系统传输,将文件切片的元数据存储在分布式数据库中,借助文件切分技术和分布式数据库实现了利用消息发布与订阅系统传输大文件,提高了文件传输效率。The scheme of the present application is that the data production end obtains the original file and divides the original file into multiple file slices and metadata of each file slice; the multiple file slices are sent to the topic partition of the message publishing and subscription system, and the metadata of each file slice is stored in the distributed database, so that the data consumption end obtains the metadata of each file slice from the distributed database and obtains the multiple file slices from the topic partition, and then stores the metadata of each file slice in the distributed database according to the metadata of each file slice. Multiple file slices are merged to obtain the target file. That is, the present application can slice large files, transmit the file slices through the message publishing and subscription system, store the metadata of the file slices in a distributed database, and realize the transmission of large files using the message publishing and subscription system with the help of file segmentation technology and distributed database, thereby improving the efficiency of file transmission.
图4是本申请提供的文件传输方法的示例性流程图,该流程如下:数据生产端先将原始文件按照每个文件切片大小不超过1MB来进行切片,并将每一个文件切片的元数据保存到分布式数据库中,然后将文件切片发送到消息发布与订阅系统的主题分区进行传输。数据消费端需要消费数据时,数据消费端可以查询分布式数据库获取各个文件的信息摘要码,将获取的各个文件的信息摘要码写入预设的集合并进行自动去重,去重后集合(即目标摘要码集合)中的信息摘要码数量就等于文件数量;根据预先存储的文件的信息摘要码与文件标识的对应关系信息,从目标摘要码集合中识别出当前需要下载的原始文件的信息摘要码,根据该原始文件的信息摘要码从分布式数据库获取该原始文件的各个文件切片的元数据,根据元数据中的文件切片的序列号查找对应文件切片的偏移量,在消费端根据查询到的偏移量进行消费,从消息发布与订阅系统拉取文件切片,当拉取的文件切片校验通过后,利用文件处理流将拉取的文件切片中的数据写入预设文件,继续进行下一条文件切片的拉取,原始文件的所有文件切片均拉取并写入预设文件之后,关闭文件处理流,得到目标文件,此后可以基于原始文件的信息摘要码与计算得到的目标文件的信息摘要码进行比对以对目标文件进行校验,在校验通过的情况下,文件传输完成。Figure 4 is an exemplary flow chart of the file transfer method provided by the present application, and the flow is as follows: the data production end first slices the original file according to the size of each file slice not exceeding 1MB, and saves the metadata of each file slice in a distributed database, and then sends the file slice to the topic partition of the message publishing and subscription system for transmission. When a data consumer needs to consume data, the data consumer can query the distributed database to obtain the information digest code of each file, write the obtained information digest code of each file into a preset set and automatically deduplicate, and the number of information digest codes in the deduplicated set (i.e., the target digest code set) is equal to the number of files; according to the correspondence relationship information between the information digest code of the pre-stored file and the file identifier, the information digest code of the original file currently required to be downloaded is identified from the target digest code set, and the metadata of each file slice of the original file is obtained from the distributed database according to the information digest code of the original file, and the offset of the corresponding file slice is found according to the serial number of the file slice in the metadata, and the consumer consumes according to the queried offset, and pulls the file slice from the message publishing and subscription system. When the pulled file slice passes the verification, the file processing flow is used to write the data in the pulled file slice into the preset file, and the next file slice is continued to be pulled. After all the file slices of the original file are pulled and written into the preset file, the file processing flow is closed to obtain the target file, and then the information digest code of the original file can be compared with the information digest code of the calculated target file to verify the target file. If the verification passes, the file transfer is completed.
图5是本申请提供的文件传输装置的一个结构示意图,该装置适用于执行本申请提供的文件传输方法,应用于数据消费端。如图5所示,该装置可以包括:FIG5 is a schematic diagram of a structure of a file transmission device provided by the present application, which is suitable for executing the file transmission method provided by the present application and is applied to a data consumer. As shown in FIG5 , the device may include:
第一获取模块501,设置为从分布式数据库获取多个文件切片中每个文件切片的元数据,所述多个文件切片和所述每个文件切片的元数据由数据生产端对原始文件进行切分处理得到;第二获取模块502,设置为根据所述每个文件切片的元数据从消息发布与订阅系统的主题分区获取匹配的文件切片,得到所述多个文件切片;合并模块503,设置为创建文件处理流,并利用所述文件处理流根 据所述每个文件切片的元数据将所述多个文件切片合并,得到目标文件。The first acquisition module 501 is configured to obtain metadata of each file slice in a plurality of file slices from a distributed database, wherein the plurality of file slices and the metadata of each file slice are obtained by segmenting the original file at the data production end; the second acquisition module 502 is configured to obtain matching file slices from the topic partition of the message publishing and subscription system according to the metadata of each file slice, thereby obtaining the plurality of file slices; the merging module 503 is configured to create a file processing flow, and utilize the file processing flow to generate a matching file slice; The multiple file slices are merged according to the metadata of each file slice to obtain a target file.
一实施例中,该装置还包括集合获取模块,设置为:通过查询所述分布式数据库获取多个文件的信息摘要码,将所述多个文件的信息摘要码写入预设集合,得到原始摘要码集合;对所述原始摘要码集合进行去重处理,得到目标摘要码集合。In one embodiment, the device also includes a set acquisition module, which is configured to: obtain information summary codes of multiple files by querying the distributed database, write the information summary codes of the multiple files into a preset set to obtain an original summary code set; and deduplicate the original summary code set to obtain a target summary code set.
一实施例中,第一获取模块501设置为:从所述目标摘要码集合中识别出所述原始文件的信息摘要码;根据所述原始文件的信息摘要码从所述分布式数据库获取多个文件切片中每个文件切片的元数据。In one embodiment, the first acquisition module 501 is configured to: identify the information summary code of the original file from the target summary code set; and acquire metadata of each file slice in multiple file slices from the distributed database according to the information summary code of the original file.
一实施例中,所述元数据中包括文件切片的序列号和文件切片在所述主题分区的偏移量,第二获取模块502设置为:根据文件切片的序列号查找文件切片在所述主题分区的偏移量;根据所述文件切片在所述主题分区的偏移量从所述主题分区获取所述文件切片,直至得到所述多个文件切片。In one embodiment, the metadata includes the serial number of the file slice and the offset of the file slice in the subject partition, and the second acquisition module 502 is configured to: search for the offset of the file slice in the subject partition according to the serial number of the file slice; and obtain the file slice from the subject partition according to the offset of the file slice in the subject partition until the multiple file slices are obtained.
一实施例中,所述元数据中还包括对应文件切片的信息摘要码,所述装置还包括:文件切片校验模块,设置为在利用所述文件处理流根据所述每个文件切片的元数据将所述多个文件切片合并之前,根据所述元数据中包括的对应文件切片的信息摘要码对所述多个文件切片中的每个文件切片进行校验;触发模块,设置为在所述多个文件切片均校验通过的情况下,触发合并模块503执行利用所述文件处理流根据所述每个文件切片的元数据将所述多个文件切片合并的步骤。In one embodiment, the metadata also includes an information summary code of the corresponding file slice, and the device also includes: a file slice verification module, configured to verify each of the multiple file slices according to the information summary code of the corresponding file slice included in the metadata before using the file processing flow to merge the multiple file slices according to the metadata of each file slice; a trigger module, configured to trigger the merging module 503 to execute the step of merging the multiple file slices according to the metadata of each file slice using the file processing flow when all the multiple file slices have passed the verification.
一实施例中,合并模块503设置为:利用所述文件处理流根据所述每个文件切片的序列号将所述多个文件切片中的数据写入指定文件,得到所述目标文件。In one embodiment, the merging module 503 is configured to: write the data in the multiple file slices into a designated file according to the serial number of each file slice by using the file processing flow to obtain the target file.
一实施例中,所述装置还包括:文件校验模块,设置为根据所述原始文件的信息摘要码对所述目标文件进行校验。In one embodiment, the device further includes: a file verification module, configured to verify the target file according to the information summary code of the original file.
本申请的装置,从分布式数据库获取多个文件切片中每个文件切片的元数据,多个文件切片和每个文件切片的元数据由数据生产端对原始文件进行切分处理得到;根据每个文件切片的元数据从消息发布与订阅系统的主题分区获取匹配的多个文件切片;创建文件处理流,并利用文件处理流根据每个文件切片的元数据将多个文件切片合并,得到目标文件。即本申请可以将大文件切片,将文件切片通过消息发布与订阅系统传输,将文件切片的元数据存储在分布式 数据库中,借助文件切分技术和分布式数据库实现了利用消息发布与订阅系统传输大文件,提高了文件传输效率。The device of the present application obtains the metadata of each file slice in multiple file slices from a distributed database. The multiple file slices and the metadata of each file slice are obtained by slicing the original file at the data production end; according to the metadata of each file slice, multiple matching file slices are obtained from the topic partition of the message publishing and subscription system; a file processing stream is created, and the file processing stream is used to merge the multiple file slices according to the metadata of each file slice to obtain the target file. That is, the present application can slice a large file, transmit the file slices through the message publishing and subscription system, and store the metadata of the file slices in the distributed database. In the database, with the help of file segmentation technology and distributed database, large files can be transmitted using the message publishing and subscription system, which improves the efficiency of file transmission.
图6是本申请提供的文件传输装置的另一个结构示意图,该装置适用于执行本申请提供的文件传输方法,应用于数据生产端。如图6所示,该装置可以包括:FIG6 is another structural schematic diagram of the file transmission device provided by the present application, which is suitable for executing the file transmission method provided by the present application and is applied to the data production end. As shown in FIG6 , the device may include:
切分模块601,设置为获取原始文件,并将所述原始文件进行切分处理,得到多个文件切片和每个文件切片的元数据;发送模块602,设置为将所述多个文件切片发送至消息发布与订阅系统的主题分区;存储模块603,设置为将所述每个文件切片的元数据存入分布式数据库,以使得数据消费端在从所述分布式数据库获取所述每个文件切片的元数据并从所述主题分区获取所述多个文件切片之后,根据所述每个文件切片的元数据将所述多个文件切片合并,从而得到目标文件。The segmentation module 601 is configured to obtain an original file and segment the original file to obtain multiple file slices and metadata of each file slice; the sending module 602 is configured to send the multiple file slices to the topic partition of the message publishing and subscription system; the storage module 603 is configured to store the metadata of each file slice in a distributed database, so that after the data consumer obtains the metadata of each file slice from the distributed database and obtains the multiple file slices from the topic partition, the multiple file slices are merged according to the metadata of each file slice to obtain the target file.
本申请的装置,获取原始文件,并将原始文件进行切分处理,得到多个文件切片和每个文件切片的元数据;将多个文件切片发送至消息发布与订阅系统的主题分区,并将每个文件切片的元数据存入分布式数据库,以使得数据消费端在从所述分布式数据库获取所述每个文件切片的元数据并从所述主题分区获取所述多个文件切片之后,根据所述每个文件切片的元数据将所述多个文件切片合并,从而得到目标文件。即本申请可以将大文件切片,将文件切片通过消息发布与订阅系统传输,将文件切片的元数据存储在分布式数据库中,借助文件切分技术和分布式数据库实现了利用消息发布与订阅系统传输大文件,提高了文件传输效率。The device of the present application obtains the original file, and divides the original file into multiple file slices and metadata of each file slice; sends the multiple file slices to the topic partition of the message publishing and subscription system, and stores the metadata of each file slice in a distributed database, so that after the data consumer obtains the metadata of each file slice from the distributed database and obtains the multiple file slices from the topic partition, the multiple file slices are merged according to the metadata of each file slice to obtain the target file. That is, the present application can slice large files, transmit the file slices through the message publishing and subscription system, and store the metadata of the file slices in the distributed database. With the help of file segmentation technology and distributed database, it realizes the transmission of large files using the message publishing and subscription system, and improves the efficiency of file transmission.
本申请还提供了一种文件传输系统,包括用于执行本申请任一实施例所述的文件传输方法的数据消费端和数据生产端。The present application also provides a file transfer system, including a data consumption end and a data production end for executing the file transfer method described in any embodiment of the present application.
本申请还提供了一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行程序时实现上述任一实施例提供的文件传输方法。The present application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the file transfer method provided in any of the above embodiments when executing the program.
本申请还提供了一种计算机可读介质,该可读介质上存储有计算机程序,程序被处理器执行时实现上述任一实施例提供的文件传输方法。 The present application also provides a computer-readable medium having a computer program stored thereon, and when the program is executed by a processor, the file transmission method provided in any of the above embodiments is implemented.
下面参考图7,其示出了适于用来实现本申请的电子设备的计算机系统700的结构示意图。图7示出的电子设备仅仅是一个示例,不应对本申请的功能和使用范围带来任何限制。Referring to Figure 7, a schematic diagram of a computer system 700 suitable for implementing the electronic device of the present application is shown. The electronic device shown in Figure 7 is only an example and should not bring any limitation to the function and scope of use of the present application.
如图7所示,计算机系统700包括中央处理单元(Central Processing Unit,CPU)701,其可以根据存储在只读存储器(Read-Only Memory,ROM)702中的程序或者从存储部分708加载到随机访问存储器(Random Access Memory,RAM)703中的程序而执行各种适当的动作和处理。在RAM 703中,还存储有系统700操作所需的各种程序和数据。CPU 701、ROM 702以及RAM 703通过总线704彼此相连。输入/输出(Input/Output,I/O)接口705也连接至总线704。As shown in FIG. 7 , the computer system 700 includes a central processing unit (CPU) 701, which can perform various appropriate actions and processes according to the program stored in the read-only memory (ROM) 702 or the program loaded from the storage part 708 to the random access memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. The input/output (I/O) interface 705 is also connected to the bus 704.
以下部件连接至I/O接口705:包括键盘、鼠标等的输入部分706;包括诸如阴极射线管(Cathode Ray Tube,CRT)、液晶显示器(Liquid Crystal Display,LCD)等以及扬声器等的输出部分707;包括硬盘等的存储部分708;以及包括诸如局域网(Local Area Network,LAN)卡、调制解调器等的网络接口卡的通信部分709。通信部分709经由诸如因特网的网络执行通信处理。驱动器710也根据需要连接至I/O接口705。可拆卸介质711,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器710上,以便于从其上读出的计算机程序根据需要被安装入存储部分708。The following components are connected to the I/O interface 705: an input section 706 including a keyboard, a mouse, etc.; an output section 707 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 708 including a hard disk, etc.; and a communication section 709 including a network interface card such as a local area network (LAN) card, a modem, etc. The communication section 709 performs communication processing via a network such as the Internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 710 as needed so that a computer program read therefrom is installed into the storage section 708 as needed.
根据本申请公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本申请公开的实施例包括一种计算机程序产品,该程序产品包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分709从网络上被下载和安装,和/或从可拆卸介质711被安装。在该计算机程序被中央处理单元(CPU)701执行时,执行本申请的系统中的设备的上述功能。According to the embodiments disclosed in the present application, the process described above with reference to the flowchart can be implemented as a computer software program. For example, the embodiments disclosed in the present application include a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program includes a program code for executing the method shown in the flowchart. In such an embodiment, the computer program can be downloaded and installed from the network through the communication part 709, and/or installed from the removable medium 711. When the computer program is executed by the central processing unit (CPU) 701, the above-mentioned functions of the device in the system of the present application are executed.
本申请所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、RAM、ROM、可擦式可编程只读存储 器(Erasable Programmable Read Only Memory,EPROM)或闪存、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、射频(Radio Frequency,RF)等等,或者上述的任意合适的组合。The computer-readable medium described in this application may be a computer-readable signal medium or a computer-readable storage medium or any combination of the two. The computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination of the above. The computer-readable storage medium may include, but is not limited to, an electrical connection with one or more wires, a portable computer disk, a hard disk, RAM, ROM, an erasable programmable read-only memory, or a computer-readable medium. EPROM (Erasable Programmable Read Only Memory) or flash memory, optical fiber, portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In the present application, a computer-readable storage medium may be any tangible medium containing or storing a program, which can be used by or in combination with an instruction execution system, device or device. In the present application, a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, which carries a computer-readable program code. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which may send, propagate, or transmit a program for use by or in combination with an instruction execution system, device or device. The program code contained on the computer-readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, optical cable, radio frequency (RF), etc., or any suitable combination of the above.
附图中的流程图和框图,图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图或流程图中的每个方框、以及框图或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flow chart and block diagram in the accompanying drawings illustrate the possible architecture, function and operation of the system, method and computer program product according to various embodiments of the present application. In this regard, each box in the flow chart or block diagram can represent a module, a program segment or a part of a code, and the above-mentioned module, program segment or a part of a code contains one or more executable instructions for realizing the specified logical function. It should also be noted that in some alternative implementations, the functions marked in the box can also occur in a different order from the order marked in the accompanying drawings. For example, two boxes represented in succession can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved. It should also be noted that each box in the block diagram or flow chart, and the combination of the boxes in the block diagram or flow chart can be implemented with a dedicated hardware-based system that performs a specified function or operation, or can be implemented with a combination of dedicated hardware and computer instructions.
描述于本申请中所涉及到的模块和/或单元可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的模块和/或单元也可以设置在处理器中,例如,可以描述为:一种处理器应用于数据消费端,包括第一获取模块、第二获取模块和合并模块。或者,可以描述为:一种处理器应用于数据生产端,包括切分模块、发送模块和存储模块。其中,这些模块的名称在某种情况下并不构成对该模块本身的限定。The modules and/or units described in this application may be implemented in software or hardware. The modules and/or units described may also be provided in a processor, for example, may be described as: a processor applied to a data consumption end, including a first acquisition module, a second acquisition module, and a merging module. Alternatively, it may be described as: a processor applied to a data production end, including a segmentation module, a sending module, and a storage module. Among them, the names of these modules do not constitute limitations on the modules themselves in some cases.
作为另一方面,本申请还提供了一种计算机可读介质,该计算机可读介质可以是上述实施例中描述的设备中所包含的;也可以是单独存在,而未装配入 该设备中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被一个该设备执行时,使得该设备包括:As another aspect, the present application further provides a computer-readable medium, which may be included in the device described in the above embodiment; or may exist independently without being assembled into the device. The computer-readable medium carries one or more programs, and when the one or more programs are executed by the device, the device includes:
从分布式数据库获取多个文件切片中每个文件切片的元数据,多个文件切片和每个文件切片的元数据由数据生产端对原始文件进行切分处理得到;根据每个文件切片的元数据从消息发布与订阅系统的主题分区获取匹配的文件切片,得到多个文件切片;创建文件处理流,并利用文件处理流根据每个文件切片的元数据将多个文件切片合并,得到目标文件。The metadata of each file slice in the multiple file slices is obtained from the distributed database. The multiple file slices and the metadata of each file slice are obtained by slicing the original file at the data production end; according to the metadata of each file slice, the matching file slice is obtained from the topic partition of the message publishing and subscription system to obtain multiple file slices; a file processing stream is created, and the file processing stream is used to merge the multiple file slices according to the metadata of each file slice to obtain the target file.
或者上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被一个该设备执行时,使得该设备包括:Or the computer-readable medium carries one or more programs, and when the one or more programs are executed by a device, the device includes:
获取原始文件,并将所述原始文件进行切分处理,得到多个文件切片和每个文件切片的元数据;将所述多个文件切片发送至消息发布与订阅系统的主题分区,并将所述每个文件切片的元数据存入分布式数据库,以使得数据消费端在从所述分布式数据库获取所述每个文件切片的元数据并从所述主题分区获取所述多个文件切片之后,根据所述每个文件切片的元数据将所述多个文件切片合并,从而得到目标文件。An original file is obtained, and the original file is sliced to obtain multiple file slices and metadata of each file slice; the multiple file slices are sent to a topic partition of a message publishing and subscription system, and the metadata of each file slice is stored in a distributed database, so that after a data consumer obtains the metadata of each file slice from the distributed database and obtains the multiple file slices from the topic partition, the multiple file slices are merged according to the metadata of each file slice to obtain a target file.
根据本申请的技术方案,可以从分布式数据库获取多个文件切片中每个文件切片的元数据,多个文件切片和每个文件切片的元数据由数据生产端对原始文件进行切分处理得到;根据每个文件切片的元数据从消息发布与订阅系统的主题分区获取匹配的多个文件切片;创建文件处理流,并利用文件处理流根据每个文件切片的元数据将多个文件切片合并,得到目标文件。即本申请可以将大文件切片,将文件切片通过消息发布与订阅系统传输,将文件切片的元数据存储在分布式数据库中,借助文件切分技术和分布式数据库实现了利用消息发布与订阅系统传输大文件,提高了文件传输效率。According to the technical solution of the present application, the metadata of each file slice in the multiple file slices can be obtained from the distributed database, and the multiple file slices and the metadata of each file slice are obtained by slicing the original file at the data production end; according to the metadata of each file slice, the matching multiple file slices are obtained from the topic partition of the message publishing and subscription system; a file processing stream is created, and the file processing stream is used to merge the multiple file slices according to the metadata of each file slice to obtain the target file. That is, the present application can slice large files, transmit the file slices through the message publishing and subscription system, and store the metadata of the file slices in the distributed database. With the help of file slicing technology and distributed database, the transmission of large files using the message publishing and subscription system is realized, thereby improving the efficiency of file transmission.
应该理解,可以使用上面所示的多种形式的流程,重新排序、增加或删除步骤。例如,本申请中记载的多个步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本申请的技术方案所期望的结果,本文在此不进行限制。It should be understood that the various forms of processes shown above can be used to reorder, add or delete steps. For example, the multiple steps recorded in this application can be executed in parallel, sequentially or in different orders, as long as the expected results of the technical solution of this application can be achieved, and this document is not limited here.
上述具体实施方式,并不构成对本申请保护范围的限制。 The above specific implementations do not constitute limitations on the protection scope of this application.

Claims (13)

  1. 一种文件传输方法,应用于数据消费端,所述方法包括:A file transmission method, applied to a data consumer, comprising:
    从分布式数据库获取多个文件切片中每个文件切片的元数据,所述多个文件切片和所述每个文件切片的元数据由数据生产端对原始文件进行切分处理得到;Obtaining metadata of each file slice in a plurality of file slices from a distributed database, wherein the plurality of file slices and the metadata of each file slice are obtained by slicing the original file at the data production end;
    根据所述每个文件切片的元数据从消息发布与订阅系统的主题分区获取匹配的文件切片,得到所述多个文件切片;According to the metadata of each file slice, a matching file slice is obtained from the topic partition of the message publishing and subscription system to obtain the multiple file slices;
    创建文件处理流,并利用所述文件处理流根据所述每个文件切片的元数据将所述多个文件切片合并,得到目标文件。A file processing stream is created, and the plurality of file slices are merged according to the metadata of each file slice by using the file processing stream to obtain a target file.
  2. 根据权利要求1所述的文件传输方法,在从分布式数据库获取多个文件切片中每个文件切片的元数据之前,还包括:The file transmission method according to claim 1, before obtaining metadata of each file slice in the plurality of file slices from the distributed database, further comprises:
    通过查询所述分布式数据库获取多个文件的信息摘要码;Obtaining information summary codes of multiple files by querying the distributed database;
    将所述多个文件的信息摘要码写入预设集合,得到原始摘要码集合;Writing the information summary codes of the multiple files into a preset set to obtain an original summary code set;
    对所述原始摘要码集合进行去重处理,得到目标摘要码集合。The original summary code set is deduplicated to obtain a target summary code set.
  3. 根据权利要求2所述的文件传输方法,其中,所述从分布式数据库获取多个文件切片中每个文件切片的元数据,包括:The file transmission method according to claim 2, wherein the step of obtaining metadata of each file slice in the plurality of file slices from a distributed database comprises:
    从所述目标摘要码集合中识别出所述原始文件的信息摘要码;Identifying the information summary code of the original file from the target summary code set;
    根据所述原始文件的信息摘要码从所述分布式数据库获取多个文件切片中每个文件切片的元数据。The metadata of each file slice in the plurality of file slices is obtained from the distributed database according to the information summary code of the original file.
  4. 根据权利要求1所述的文件传输方法,其中,所述元数据中包括文件切片的序列号和文件切片在所述主题分区的偏移量;The file transmission method according to claim 1, wherein the metadata includes a serial number of the file slice and an offset of the file slice in the subject partition;
    所述根据所述每个文件切片的元数据从消息发布与订阅系统的主题分区获取匹配的文件切片,得到所述多个文件切片,包括: The step of acquiring a matching file slice from a topic partition of a message publishing and subscription system according to the metadata of each file slice to obtain the multiple file slices includes:
    根据文件切片的序列号查找文件切片在所述主题分区的偏移量;Find the offset of the file slice in the topic partition according to the serial number of the file slice;
    根据所述文件切片在所述主题分区的偏移量从所述主题分区获取所述文件切片,直至得到所述多个文件切片。The file slice is obtained from the subject partition according to the offset of the file slice in the subject partition until the plurality of file slices are obtained.
  5. 根据权利要求4所述的文件传输方法,其中,所述元数据中还包括文件切片的信息摘要码;The file transmission method according to claim 4, wherein the metadata further includes an information summary code of the file slice;
    在利用所述文件处理流根据所述每个文件切片的元数据将所述多个文件切片合并之前,还包括:Before merging the multiple file slices according to the metadata of each file slice by using the file processing flow, the method further includes:
    根据所述元数据中包括的文件切片的信息摘要码对所述多个文件切片中的每个文件切片进行校验;Verify each file slice of the plurality of file slices according to the information summary code of the file slice included in the metadata;
    在所述多个文件切片均校验通过的情况下,触发执行利用所述文件处理流根据所述每个文件切片的元数据将所述多个文件切片合并的步骤。In the case that the multiple file slices all pass the verification, the step of merging the multiple file slices according to the metadata of each file slice by using the file processing flow is triggered.
  6. 根据权利要求4所述的文件传输方法,其中,所述利用所述文件处理流根据所述每个文件切片的元数据将所述多个文件切片合并,得到目标文件,包括:The file transfer method according to claim 4, wherein the step of merging the plurality of file slices according to the metadata of each file slice using the file processing flow to obtain a target file comprises:
    利用所述文件处理流根据所述每个文件切片的序列号将所述多个文件切片中的数据写入指定文件,得到所述目标文件。The file processing flow is used to write the data in the multiple file slices into a designated file according to the serial number of each file slice to obtain the target file.
  7. 根据权利要求3所述的文件传输方法,所述方法还包括:The file transmission method according to claim 3, further comprising:
    根据所述原始文件的信息摘要码对所述目标文件进行校验。The target file is verified according to the information summary code of the original file.
  8. 一种文件传输方法,应用于数据生产端,所述方法包括:A file transmission method, applied to a data production end, comprising:
    获取原始文件,并将所述原始文件进行切分处理,得到多个文件切片和每个文件切片的元数据;Acquire an original file, and slice the original file to obtain multiple file slices and metadata of each file slice;
    将所述多个文件切片发送至消息发布与订阅系统的主题分区,并将所述每个文件切片的元数据存入分布式数据库,以使得数据消费端在从所述分布式数 据库获取所述每个文件切片的元数据并从所述主题分区获取所述多个文件切片之后,根据所述每个文件切片的元数据将所述多个文件切片合并,从而得到目标文件。The multiple file slices are sent to the topic partition of the message publishing and subscription system, and the metadata of each file slice is stored in the distributed database, so that the data consumer can After obtaining the metadata of each file slice from the database and obtaining the multiple file slices from the subject partition, the multiple file slices are merged according to the metadata of each file slice to obtain the target file.
  9. 一种文件传输装置,应用于数据消费端,所述装置包括:A file transmission device, applied to a data consumption end, comprising:
    第一获取模块,设置为从分布式数据库获取多个文件切片中每个文件切片的元数据,所述多个文件切片和所述每个文件切片的元数据由数据生产端对原始文件进行切分处理得到;A first acquisition module is configured to acquire metadata of each file slice in a plurality of file slices from a distributed database, wherein the plurality of file slices and the metadata of each file slice are obtained by segmenting the original file at the data production end;
    第二获取模块,设置为根据所述每个文件切片的元数据从消息发布与订阅系统的主题分区获取匹配的文件切片,得到所述多个文件切片;A second acquisition module is configured to acquire matching file slices from a topic partition of a message publishing and subscription system according to the metadata of each file slice to obtain the multiple file slices;
    合并模块,设置为创建文件处理流,并利用所述文件处理流根据所述每个文件切片的元数据将所述多个文件切片合并,得到目标文件。The merging module is configured to create a file processing stream, and use the file processing stream to merge the multiple file slices according to the metadata of each file slice to obtain a target file.
  10. 一种文件传输装置,应用于数据生产端,所述装置包括:A file transmission device, applied to a data production end, comprising:
    切分模块,设置为获取原始文件,并将所述原始文件进行切分处理,得到多个文件切片和每个文件切片的元数据;A segmentation module, configured to obtain an original file and segment the original file to obtain a plurality of file slices and metadata of each file slice;
    发送模块,设置为将所述多个文件切片发送至消息发布与订阅系统的主题分区;A sending module, configured to send the plurality of file slices to a topic partition of a message publishing and subscription system;
    存储模块,设置为将所述每个文件切片的元数据存入分布式数据库,以使得数据消费端在从所述分布式数据库获取所述每个文件切片的元数据并从所述主题分区获取所述多个文件切片之后,根据所述每个文件切片的元数据将所述多个文件切片合并,从而得到目标文件。The storage module is configured to store the metadata of each file slice into a distributed database, so that after the data consumer obtains the metadata of each file slice from the distributed database and obtains the multiple file slices from the subject partition, the multiple file slices are merged according to the metadata of each file slice to obtain the target file.
  11. 一种文件传输系统,包括用于执行如权利要求1至7中任一所述的文件传输方法的数据消费端和用于执行如权利要求8所述的文件传输方法的数据生产端。 A file transmission system comprises a data consumption end for executing the file transmission method as claimed in any one of claims 1 to 7 and a data production end for executing the file transmission method as claimed in claim 8.
  12. 一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述程序时实现如权利要求1至7中任一所述的文件传输方法,或者所述处理器执行所述程序时实现如权利要求8所述的文件传输方法。An electronic device comprises a memory, a processor and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the file transfer method as described in any one of claims 1 to 7 is implemented, or when the processor executes the program, the file transfer method as described in claim 8 is implemented.
  13. 一种计算机可读存储介质,所述存储介质上存储有计算机程序,该程序被处理器执行时实现如权利要求1至7中任一所述的文件传输方法,或者该程序被处理器执行时实现如权利要求8所述的文件传输方法。 A computer-readable storage medium having a computer program stored thereon, wherein when the program is executed by a processor, the file transmission method according to any one of claims 1 to 7 is implemented, or when the program is executed by a processor, the file transmission method according to claim 8 is implemented.
PCT/CN2023/103618 2022-11-16 2023-06-29 File transmission method, apparatus and system, electronic device, and storage medium WO2024103752A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211434252.6A CN115801765A (en) 2022-11-16 2022-11-16 File transmission method, device, system, electronic equipment and storage medium
CN202211434252.6 2022-11-16

Publications (1)

Publication Number Publication Date
WO2024103752A1 true WO2024103752A1 (en) 2024-05-23

Family

ID=85438160

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/103618 WO2024103752A1 (en) 2022-11-16 2023-06-29 File transmission method, apparatus and system, electronic device, and storage medium

Country Status (2)

Country Link
CN (1) CN115801765A (en)
WO (1) WO2024103752A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115801765A (en) * 2022-11-16 2023-03-14 工赋(青岛)科技有限公司 File transmission method, device, system, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109361629A (en) * 2018-10-26 2019-02-19 江苏大学 One kind being based on the big message method for reliable transmission of Kafka and system
US10887253B1 (en) * 2014-12-04 2021-01-05 Amazon Technologies, Inc. Message queuing with fan out
WO2021258831A1 (en) * 2020-06-23 2021-12-30 华为技术有限公司 Data processing method and system
CN114077518A (en) * 2020-08-21 2022-02-22 湖南福米信息科技有限责任公司 Data snapshot method, device, equipment and storage medium
CN115250181A (en) * 2022-07-22 2022-10-28 中国电信股份有限公司 Kafka-based file verification transmission method, device, equipment and storage
CN115801765A (en) * 2022-11-16 2023-03-14 工赋(青岛)科技有限公司 File transmission method, device, system, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10887253B1 (en) * 2014-12-04 2021-01-05 Amazon Technologies, Inc. Message queuing with fan out
CN109361629A (en) * 2018-10-26 2019-02-19 江苏大学 One kind being based on the big message method for reliable transmission of Kafka and system
WO2021258831A1 (en) * 2020-06-23 2021-12-30 华为技术有限公司 Data processing method and system
CN114077518A (en) * 2020-08-21 2022-02-22 湖南福米信息科技有限责任公司 Data snapshot method, device, equipment and storage medium
CN115250181A (en) * 2022-07-22 2022-10-28 中国电信股份有限公司 Kafka-based file verification transmission method, device, equipment and storage
CN115801765A (en) * 2022-11-16 2023-03-14 工赋(青岛)科技有限公司 File transmission method, device, system, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN115801765A (en) 2023-03-14

Similar Documents

Publication Publication Date Title
CN111400408B (en) Data synchronization method, device, equipment and storage medium
TWI518530B (en) Repeated data processing methods, devices and systems
US7849227B2 (en) Stream data processing method and computer systems
CN110764706B (en) Storage system, data management method, and storage medium
US11301425B2 (en) Systems and computer implemented methods for semantic data compression
TWI549005B (en) Multi-layer search-engine index
JP5719037B2 (en) Storage apparatus and duplicate data detection method
CN101796492A (en) Use the cluster storage of segmentation section
WO2024103752A1 (en) File transmission method, apparatus and system, electronic device, and storage medium
JP5199317B2 (en) Database processing method, database processing system, and database server
JP6638821B2 (en) Database archiving method and apparatus, archived database search method and apparatus
CN104584524A (en) Aggregating data in a mediation system
US11030172B2 (en) Database archiving method and device for creating index information and method and device of retrieving archived database including index information
CN101158981A (en) Method, system and device for classifying downloaded resource
CN112988916B (en) Full and incremental synchronization method, apparatus and storage medium for Clickhouse
US20180349422A1 (en) Database management system, database server, and database management method
US9734171B2 (en) Intelligent redistribution of data in a database
US20140012879A1 (en) Database management system, apparatus, and method
WO2020243022A1 (en) High density time-series data indexing and compression
WO2024021491A1 (en) Data slicing method, apparatus and system
WO2015055062A1 (en) Data file writing method and system, and data file reading method and system
CN113609123B (en) HBase-based mass user data deduplication storage method and device
CN118233454B (en) File batch uploading method and medium
CN116719821B (en) Concurrent data insertion elastic search weight removing method, device and storage medium
US12032525B2 (en) Systems and computer implemented methods for semantic data compression

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23890194

Country of ref document: EP

Kind code of ref document: A1