CN112019605A - Data distribution method and system of data stream - Google Patents

Data distribution method and system of data stream Download PDF

Info

Publication number
CN112019605A
CN112019605A CN202010814942.9A CN202010814942A CN112019605A CN 112019605 A CN112019605 A CN 112019605A CN 202010814942 A CN202010814942 A CN 202010814942A CN 112019605 A CN112019605 A CN 112019605A
Authority
CN
China
Prior art keywords
file
data
files
fragmented
queue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010814942.9A
Other languages
Chinese (zh)
Other versions
CN112019605B (en
Inventor
周虓岗
张明磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bilibili Technology Co Ltd
Original Assignee
Shanghai Bilibili Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bilibili Technology Co Ltd filed Critical Shanghai Bilibili Technology Co Ltd
Priority to CN202010814942.9A priority Critical patent/CN112019605B/en
Publication of CN112019605A publication Critical patent/CN112019605A/en
Application granted granted Critical
Publication of CN112019605B publication Critical patent/CN112019605B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/104Peer-to-peer [P2P] networks
    • H04L67/1074Peer-to-peer [P2P] networks for supporting data block transmission mechanisms
    • H04L67/1078Resource delivery mechanisms
    • H04L67/108Resource delivery mechanisms characterised by resources being split in blocks or fragments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/06Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the application provides a data distribution method of a data stream, which comprises the following steps: acquiring a plurality of data files of a target data stream; carrying out fragmentation processing on the plurality of data files to obtain a plurality of fragmented files; writing each fragmented file into a file queue; merging the fragment files in the file queue to obtain a target merged file; and outputting the target merging file. The embodiment of the application solves the problem of IO (input/output) blockage during data distribution, and improves the data distribution capability and the data distribution efficiency of the data distribution layer.

Description

Data distribution method and system of data stream
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a data distribution method, a system, equipment and a computer readable storage medium for data streams.
Background
The streaming data transmission link is composed of a data transmission layer (such as a gateway), a data caching layer, a data distribution layer (controller) and a data storage terminal. When the data source has data to be reported, the data source can finally flow the reported data into the data storage terminal through the data transmission layer, the data cache layer and the data distribution layer. At present, the file size between data files in a streaming data transmission link may be several tens or even hundreds of times different, resulting in the following consequences: the problem of IO (input/output) blockage easily occurs in the data distribution process of larger data files, and the concurrency difficulty in the data distribution process can be increased due to the larger number of the smaller data files. Therefore, how to solve the problem of IO blocking and the problem of great difficulty in concurrency during data distribution, so as to further improve data distribution efficiency, becomes one of the technical problems to be solved at present.
Disclosure of Invention
An object of the embodiments of the present application is to provide a data distribution method, a system, a computer device, and a computer-readable storage medium for data streams, which are used to solve the technical problems of IO congestion in data distribution and high concurrency difficulty in data distribution in a data distribution layer.
One aspect of the embodiments of the present application provides a data distribution method for a data stream, including: acquiring a plurality of data files of a target data stream; carrying out fragmentation processing on the plurality of data files to obtain a plurality of fragmented files; writing each fragmented file into a file queue; merging the fragment files in the file queue to obtain a target merged file; and outputting the target merging file.
Optionally, the performing fragmentation processing on the plurality of data files to obtain a plurality of fragmented files includes: and carrying out fragmentation processing on the plurality of data files at a preset time frequency to obtain the plurality of fragmented files.
Optionally, the writing each fragmented file into the file queue includes: and writing each fragment file into the file queue according to the fragment time sequence of each fragment file.
Optionally, performing a merge operation on the fragment files in the file queue to obtain a target merge file, including: monitoring the file queue; when it is monitored that a new fragmented file is written in the file queue, detecting all fragmented files in the file queue to obtain the file size corresponding to each fragmented file in the file queue; calculating the sum of the file sizes of all the fragmented files in the file queue; and if the sum of the file sizes of all the fragmented files in the file queue is not smaller than a preset value, merging one or more fragmented files in the file queue to obtain the target merged file.
Optionally, each fragmented file in the file queue carries a write-in time when writing into the file queue; merging one or more fragmented files in the file queue to obtain the target merged file, including: detecting whether the time interval between the writing time of the earliest fragmented file in the file queue and the current time is greater than a preset time length; and if the time interval between the writing time of the earliest fragmented file in the file queue and the current time is greater than the preset time length, performing merging operation on one or more fragmented files in the file queue to obtain a target merged file.
Optionally, the method further includes: and deleting one or more fragment files corresponding to the target merged file in the file queue after the target merged file is obtained.
Optionally, the method further includes: generating a corresponding target event message according to the target combination file; and deleting one or more fragment files corresponding to the target merged file in the file queue according to the target event information.
An aspect of an embodiment of the present application further provides a data distribution system for data streams, including: the receiving module is used for acquiring a plurality of data files of the target data stream; the fragmentation module is used for carrying out fragmentation processing on the plurality of data files to obtain a plurality of fragmentation files; the writing module is used for writing each fragment file into a file queue; the merging module is used for performing merging operation on the fragment files in the file queue to obtain a target merged file; and the output module is used for outputting the target merging file.
An aspect of the embodiments of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor executes the computer program to implement the steps of the data distribution method of the data stream described above.
An aspect of the embodiments of the present application further provides a computer-readable storage medium, in which a computer program is stored, the computer program being executable by at least one processor to cause the at least one processor to perform the steps of the data distribution method of data streams as described above.
According to the data distribution method, the data distribution system, the data distribution equipment and the computer-readable storage medium of the data stream, the data file is subjected to fragmentation processing to obtain the fragment files, and the fragment files are subjected to merging operation, so that the problem of IO (input/output) blockage during data distribution is solved, the number of files processed by the Sink is reduced, and the data distribution capability and the data distribution efficiency of a data distribution layer to the data stream are improved.
Drawings
FIG. 1 schematically illustrates an environmental application diagram according to an embodiment of the present application;
fig. 2 schematically shows a flow chart of a data distribution method of a data stream according to a first embodiment of the present application;
fig. 3 schematically shows a flow chart of a data distribution method of a data stream according to a second embodiment of the present application;
fig. 4 schematically shows a detailed flowchart of step S306;
fig. 5 schematically shows a detailed flowchart of step S406;
fig. 6 schematically shows a detailed flowchart of step S406;
FIG. 7 schematically illustrates an overall flow diagram of a second data file transfer route according to an embodiment of the present application;
fig. 8 schematically shows a block diagram of a data distribution system of data flows according to a third embodiment of the present application; and
fig. 9 schematically shows a hardware architecture diagram of a computer device suitable for implementing a data distribution method of a data stream according to a fourth embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application more clearly understood, the embodiments of the present application are described in further detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the embodiments of the application and are not intended to limit the embodiments of the application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without any creative effort belong to the protection scope of the embodiments in the present application.
It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
The following are explanations of terms referred to in the present application:
LogId (flow id) may be defined by three-segment semantics (e.g., department + project + business) so that the category to which the data belongs can be quickly locked, while the flow id may also be defined with other ancillary information, such as creator information, etc. The data stream may be defined with schema (organization and structure of the database) such as information of fields, types, necessity or not. The schema may be used for analysis and evaluation operations of the data stream. According to the defined schema, the metadata information of the data stream may be written with corresponding field values, such as Service scenarios, and different Service scenarios may be configured with different SLA (Service-Level agent) quality guarantees. It should be noted that these field values may be written and modified by a user or by management.
Source, as a data input interface, is used to consume one or more data streams from corresponding topics (Topic) in the data caching layer 3.
And the transform is used as a data processing module and is used for performing data processing on one or more data streams received by the Source.
And Sink, as a data output interface, for distributing the data obtained after the Tranform processing to a storage terminal of the data storage layer 5.
Fig. 1 schematically shows a streaming data transmission link according to an embodiment of the present application, said streaming data transmission link consisting in providing a streaming data transmission service, such as data collection and distribution for both real-time streaming and offline streaming scenarios. The real-time streaming scene is mainly used for writing data into databases such as kafka and hbase, and corresponds to the timeliness of data at the level of seconds. The offline flow scene corresponds to the timeliness of data at an hour level or a day level and is mainly used for writing the data into databases such as HDFS (Hadoop distributed File System), hive and the like. The streaming data transmission system may be composed of: BFE layer 1, network routing layer 2, data buffer layer 3, data distribution layer 4, data storage layer 5, etc.
The BFE layer 1 may be implemented by one or more edge nodes, and is configured to receive, process, and output the reported data. The reporting data may be data from different data sources, for example, reporting data of APP and Web.
The network routing layer 2, which may be implemented by one or more gateway nodes, is configured to forward data provided by the BFE layer 1 to the data buffer layer 3. Specifically, the network routing layer 2 is configured to be connected to the BFE layer 1, and may be adapted to various service scenarios and data protocols, for example, APP and Web data configured to be compatible with a HyperText Transfer Protocol (HTTP) Protocol, and internal communication data of a GRPC Protocol.
The data buffer layer 3 can be implemented by a message distribution subscription system or the above system cluster. In some embodiments, the data buffer layer 3 may be composed of multiple sets of kafka cluster, which functions as data peak clipping and valley filling. Data with different importance, priority and data throughput can be distributed to different kafka clusters to guarantee the value of different types of data and avoid the influence of system faults on the whole data.
The data distribution layer 4 may be implemented by a streaming data distribution system (composed of a plurality of traffic distribution nodes Collector), and is used for content conversion and distribution storage, that is, ensuring that data is acquired from the data buffer layer 3 and written into a corresponding storage terminal in the data storage layer 5. Specifically, the data distribution layer 4 is used for data distribution landing, and supported distribution scenes include HDFS (Hadoop Distributed File System), Kafka, Hbase, ES (elastic search), and the like, and in the distribution process, due to different data landing timeliness requirements of different storage terminals, for example, data writing of the HDFS is calculation and application of a task by day, and data writing of the Kafka is calculation and application of a task by second, and is generally used in scenes such as real-time recommendation, real-time calculation, and the like. The data distribution layer 4 may perform service grouping management according to the storage terminal according to the distribution requirements of different scenarios of data. For example, the lines may be divided into Kafka Collector groups, HDFS Collector groups, and the like. Different Collector groups will take data of the corresponding topic (topic) from the data buffer layer 3 and distribute it downstream.
The data storage layer 5 is used for storing data and can be composed of different forms of databases, such as HDFS, ES, Hive, Kafka, Hbase and the like.
Namely, the data flow of the streaming data transmission link is as follows: BFE layer 1 → network routing layer 2 → data buffer layer 3 → data distribution layer 4 → data storage layer 5. Through the streaming data transmission link, data in a data source can be transmitted to a target terminal. The method comprises the following specific steps: the data source can output data streams with LogId as stream identification, report the data to the edge node through protocols such as HTTP and RPC, and sequentially pass through the gateway routing layer 2, the data buffer layer 3 and the data distribution layer 4, and finally enter the storage terminal in the data storage layer 5.
Example one
Fig. 2 schematically shows a flowchart of a data distribution method of a data stream according to a first embodiment of the present application. The present embodiment is exemplarily described with the computer device 40 as an execution subject. It is to be understood that the flow charts in the embodiments of the present method are not intended to limit the order in which the steps are performed.
As shown in fig. 2, the data distribution method of the data stream may include steps S200 to S208, wherein:
step S200, a plurality of data files of the target data stream are acquired.
The target data stream comes from the data caching layer 3 and is distributed to the storage terminal of the data storage layer 5 through the computer device 40.
For the target data stream, the computer device 40 performs the following:
(1) receiving initial data of the target data stream through Source;
(2) analyzing, cleaning, converting and the like the target data stream through transform to obtain a plurality of data files;
(3) and processing the plurality of data files through Sink so as to issue the plurality of data files to a storage terminal of the data storage layer 5.
The method aims to improve the flow in the Sink so as to solve the problems of IO blocking and high concurrency difficulty in data distribution. The Sink includes one or more writers (write modules) and one or more Mergers (merge modules), and the writers and the Mergers may be in a one-to-one correspondence relationship. The Writer may be used for splitting the data file, and the Merger may be used for merging the data slices. Reference is made in detail to the following.
In some embodiments, the data distribution layer 4 may obtain initial data of multiple data streams from the data buffer layer 3 through the Source at the same time, and analyze the initial data of each data stream through the Transform to obtain the multiple data files corresponding to each data stream; so as to write the plurality of data files corresponding to each data stream into the Sink.
Step S202, performing fragmentation processing on the plurality of data files to obtain a plurality of fragmented files.
After receiving the multiple data files, the Sink may perform a fragmentation processing operation on the multiple data files according to a built-in Writer thereof. For example, a contiguous plurality of data files may be split into a plurality of sharded files based on CheckPoint in Writer.
It should be noted that, when the Sink receives multiple data streams, multiple writers may be configured for the file splitting operation. Specifically, one Writer correspondingly processes a plurality of data files of one data stream, and each Writer configures one CheckPoint. The Writer may be further configured to record processing event information of a plurality of data files corresponding to the respective data streams, and send the processing event information to a next node.
Step S204, writing each fragment file into a file queue.
The Sink may write each obtained fragmented file into a file queue.
In an exemplary embodiment, the Writer may generate corresponding processing event information according to each fragmented file, and send the processing event information to the Merger, so that the Merger writes each fragmented file into a file queue according to the processing event information. The Merger is configured in advance, and the Merger may cache each fragmented file through the file queue, and may also perform merge operation on the fragmented files cached in the file queue. Wherein one Merger corresponds to one Writer.
Step S206, merging the fragmented files in the file queue to obtain a target merged file.
The merge may determine whether all the fragment files in the file queue meet a merge condition, and if all the fragment files in the file queue meet the merge condition, may perform a merge operation on the fragment files in the file queue to obtain the target merge file. The merging condition may be a sum of file sizes of all fragmented files in the file queue.
In some embodiments, the merge file may be obtained by the merge file, and the sharded file corresponding to the merge file may be deleted, so as to ensure that there is no duplicated file data.
And step S208, outputting the target merged file.
After the merge operation is completed by the Merger, the target merge file may be moved to a formal target of the Sink, so that the Sink outputs the target merge file to a storage terminal of the data storage layer 5.
In the embodiment, the number of files processed by Sink is reduced by combining the split files, and the data distribution capability and the data distribution efficiency of the data distribution layer are improved.
Example two
Fig. 3 schematically shows a flowchart of a data distribution method of a data stream according to the second embodiment of the present application.
As shown in fig. 3, the data distribution method of the data stream may include steps S300 to S308, wherein:
in step S300, a plurality of data files of the target data stream are acquired.
Step S302, performing fragmentation processing on the plurality of data files at a preset time frequency to obtain a plurality of fragmented files.
And the data files are subjected to fragmentation processing at a preset time frequency, so that the file size of a single data file is reduced, and the problem of IO (input/output) blockage when the data file is written into a next node is reduced.
Step S304, writing each sliced file into the file queue according to the slicing time sequence of each sliced file.
And writing each fragment file into the file queue for caching according to the fragment time sequence, so that the correct sequence of the data files in the target data stream is ensured, and the problem of disordered sequence of the data files during combination of the data files in the target data stream is avoided.
Step S306, performing a merge operation on the fragment files in the file queue to obtain a target merge file.
And step S308, outputting the target merged file.
As shown in fig. 4, the step S306 may further include steps S400 to S406, where: step S400, monitoring the file queue; step S402, when it is monitored that a new fragmented file is written in the file queue, all fragmented files in the file queue are detected to obtain the file size corresponding to each fragmented file in the file queue; step S404, calculating the sum of the file sizes of all the fragmented files in the file queue; and step S406, if the sum of the file sizes of the fragmented files in the file queue is not less than a preset value, performing a merging operation on one or more fragmented files in the file queue to obtain a target merged file.
The file sizes of the data files in the data stream without file splitting vary greatly, for example, in some embodiments, the file sizes may vary several tens or even hundreds of times between different data files. This may cause a problem that, in a data distribution process, IO congestion is likely to occur in a large data file in the data distribution process, and a concurrency difficulty in the data distribution process may be increased due to a large number of small data files. Therefore, in this embodiment, a larger file may be segmented, a preset value is configured in advance, and if the sum of the file sizes corresponding to a plurality of fragmented files in the file queue is not smaller than the preset value, a merging operation is performed on one or more fragmented files in the file queue to obtain a target merged file.
In this embodiment, the file size of the target merged file can be regulated and controlled by configuring the preset value, so that the file size of a single data file is reduced, the difference between different data files is also reduced, and the concurrence efficiency during data distribution is improved.
As shown in fig. 5, each fragmented file in the file queue carries a write time when writing into the file queue, and step S406 may further include steps S500 to S502, where:
step S500, detecting whether the time interval between the writing time of the earliest fragmented file in the file queue and the current time is greater than a preset time length; and step S502, if the time interval between the writing time of the earliest fragmented file in the file queue and the current time is greater than the preset time length, performing merging operation on one or more fragmented files in the file queue to obtain a target merged file.
For example. The current time is ten times of four points, and the three points of ten written fragmented files exist in the file queue, which indicates that the data stream is blocked for one hour in the file queue. In order to solve the problem, if the time interval between the writing time of the earliest fragmented file in the file queue and the current time is greater than the preset time length, the multiple fragmented files in the file queue may be forced to be merged to obtain a target merged file.
In the embodiment, by detecting whether the time interval between the writing time of the earliest fragmented file in the file queue and the current time is greater than the preset time, the problem that the sum of the sizes of the accumulated fragmented files in the file queue cannot reach the preset value for a long time due to the fact that the size of the fragmented file is too small is solved, and the data distribution efficiency is improved.
In an exemplary embodiment, the step S406 may further include: and deleting one or more fragment files corresponding to the target merged file in the file queue after the target merged file is obtained. The Sink is ensured not to have repeated file data, and the situation that the Merger repeatedly processes the fragment file is avoided.
As shown in fig. 6, the step S406 may further include steps S600 to S602, where: generating corresponding target event information according to the target combination file; and step S602, deleting one or more fragment files corresponding to the target merged file in the file queue according to the target event information.
After the target merged file is obtained, the merge may generate a corresponding target event information according to the target merged file, and send the target event information to the commit (forwarding module). After receiving the target event information, the committer may generate a corresponding file deletion information according to the target event information, and send the file deletion information to the dismardingsink (deletion module); and deleting one or more fragment files corresponding to the target merged file in the file queue according to the file deletion information by the Discardingsink.
According to the embodiment, the target event information is generated, and the corresponding one or more fragment files are deleted according to the target event information, so that the timeliness and the accuracy of fragment file deletion are improved, and the situations of mistaken deletion and mistaken deletion are avoided.
For easy understanding, as shown in fig. 7, the present embodiment further provides a data file transfer route flowchart of the present embodiment.
Wherein, the solid arrow is a data flow, and a data flow can flow from the Source to the transform and finally to the Writer. Specifically, the computer device 40 may receive the initial data of the target data stream through the Source, and write the initial data of the target data stream into the transform; and analyzing, cleaning, converting and the like the target data stream through the transform to obtain the plurality of data files, and writing the plurality of data files into the Writer.
The dotted lines are signaling flows from the Writer to the Merger, from the Merger to the Committer, and from the Committer to the Discardingsink. The signaling flow may be used to transmit the processing event information, the target event information, and the file deletion information. Specifically, the computer device 40 performs a fragmentation processing operation on the plurality of data files through a Writer to obtain a plurality of fragmented files; and recording the processing event information of a plurality of data files corresponding to each data stream through the Writer, and sending the processing event information to the Merger. Merging the multiple fragmented files through the Merger to obtain a target merged file, generating a corresponding target event information according to the target merged file through the Merger, and sending the target event information to the Committer (forwarding module). Receiving the target event information through the Committer, generating corresponding file deletion information according to the target event information, and sending the file deletion information to Discardingsink (deletion module); and deleting one or more fragment files corresponding to the target merged file in the file queue according to the file deletion information by the Discardingsink.
EXAMPLE III
Fig. 8 schematically illustrates a block diagram of a data distribution system for data streaming, which may be divided into one or more program modules, according to a third embodiment of the present application, where the one or more program modules are stored in a storage medium and executed by one or more processors to implement the third embodiment of the present application. The program modules referred to in the embodiments of the present application refer to a series of computer program instruction segments capable of performing specific functions, and the following description will specifically describe the functions of each program module in the embodiments.
As shown in fig. 8, the data distribution system 800 of the data stream may include a receiving module 810, a slicing module 820, a writing module 830, a merging module 840, and an outputting module 850, wherein:
a determining module 810 is configured to obtain a plurality of data files of the target data stream.
The fragmentation module 820 is configured to perform fragmentation processing on the multiple data files to obtain multiple fragmentation files.
In an exemplary embodiment, the fragmentation module 820 is further configured to: and carrying out fragmentation processing on the plurality of data files at a preset time frequency to obtain a plurality of fragmented files.
A writing module 830, configured to write each fragmented file into a file queue.
In an exemplary embodiment, the writing module 830 is further configured to: and writing each fragment file into the file queue according to the fragment time sequence of each fragment file.
And the merging module 840 is configured to perform merging operation on the fragment files in the file queue to obtain a target merged file.
In an exemplary embodiment, the merge module 840 is further configured to: monitoring the file queue; when it is monitored that a new fragmented file is written in the file queue, detecting all fragmented files in the file queue to obtain the file size corresponding to each fragmented file in the file queue; calculating the sum of the file sizes of all the fragmented files in the file queue; and if the sum of the file sizes of the fragmented files in the file queue is not smaller than a preset value, performing merging operation on one or more fragmented files in the file queue to obtain a target merged file.
In an exemplary embodiment, the fragmented file carries a write time when writing into the file queue; the merge module 840 is further configured to: detecting whether the time interval between the writing time of the earliest fragmented file in the file queue and the current time is greater than a preset time length; and if the time interval between the writing time of the earliest fragmented file in the file queue and the current time is greater than the preset time length, performing merging operation on one or more fragmented files in the file queue to obtain a target merged file.
And an output module 850, configured to output the target merged file.
In an exemplary embodiment, the data distribution system 800 of the data stream may further include a deletion module, configured to: deleting one or more fragment files corresponding to the target merged file in the file queue after the target merged file is obtained
In an exemplary embodiment, the deleting module is further configured to: generating a corresponding target event message according to the target combination file; and deleting one or more fragment files corresponding to the target merged file in the file queue according to the target event information.
Example four
Fig. 9 schematically shows a hardware architecture diagram of a computer device suitable for implementing a data distribution method of a data stream according to a fourth embodiment of the present application. In the present embodiment, the computer device 40 is a device capable of automatically performing numerical calculation and/or information processing in accordance with a command set or stored in advance. For example, the server may be a rack server, a blade server, a tower server, or a rack server (including an independent server or a server cluster composed of a plurality of servers). As shown in fig. 9, computer device 40 includes at least, but is not limited to: the memory 910, processor 920, and network interface 930 may be communicatively linked to each other via a system bus. Wherein:
the memory 910 includes at least one type of computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 910 may be an internal storage module of the computer device 40, such as a hard disk or a memory of the computer device 40. In other embodiments, the memory 910 may also be an external storage device of the computer device 40, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the computer device 40. Of course, the memory 910 may also include both internal and external memory modules of the computer device 40. In this embodiment, the memory 910 is generally used for storing an operating system installed in the computer device 40 and various types of application software, such as program codes of a data distribution method of a data stream. In addition, the memory 910 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 920 may be, in some embodiments, a Central Processing Unit (CPU), a controller, a microcontroller, a microprocessor, or other data Processing chip. The processor 920 is generally configured to control the overall operation of the computer device 40, such as performing control and processing related to data interaction or communication with the computer device 40. In this embodiment, the processor 920 is configured to execute program codes stored in the memory 910 or process data.
Network interface 930 may include a wireless network interface or a wired network interface, with network interface 930 typically being used to establish communication links between computer device 40 and other computer devices. For example, the network interface 930 is used to connect the computer device 40 to an external terminal via a network, establish a data transmission channel and a communication link between the computer device 40 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), or Wi-Fi.
It is noted that FIG. 9 only shows a computer device having components 810 and 830, but it is understood that not all of the shown components are required and that more or fewer components may be implemented instead.
In this embodiment, the data distribution method of the data stream stored in the memory 910 may be further divided into one or more program modules and executed by one or more processors (in this embodiment, the processor 920) to complete the present invention.
EXAMPLE five
The present embodiment also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the data distribution method of data streams in the embodiments.
In this embodiment, the computer-readable storage medium includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the computer readable storage medium may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the computer readable storage medium may be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device. Of course, the computer-readable storage medium may also include both internal and external storage devices of the computer device. In this embodiment, the computer-readable storage medium is generally used for storing an operating system and various types of application software installed in the computer device, for example, the program codes of the data distribution method of the data stream in the embodiment, and the like. Further, the computer-readable storage medium may also be used to temporarily store various types of data that have been output or are to be output.
It will be apparent to those skilled in the art that the modules or steps of the embodiments of the present invention described above may be implemented by a general purpose computing device, may be integrated into a single computing device or distributed across a network of multiple computing devices, and may alternatively be implemented by program code executable by a computing device, such that the steps shown or described may be executed by a computing device stored in a storage device and, in some cases, may be executed out of order from that shown or described, or separately fabricated into individual circuit modules, or fabricated into a single circuit module from multiple modules or steps of the same. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A method of data distribution of a data stream, the method comprising:
acquiring a plurality of data files of a target data stream;
carrying out fragmentation processing on the plurality of data files to obtain a plurality of fragmented files;
writing each fragmented file into a file queue;
merging the fragment files in the file queue to obtain a target merged file; and
and outputting the target merging file.
2. The method for distributing data of a data stream according to claim 1, wherein the fragmenting the plurality of data files to obtain a plurality of fragmented files comprises:
and carrying out fragmentation processing on the plurality of data files at a preset time frequency to obtain the plurality of fragmented files.
3. The data distribution method of data stream according to claim 1, wherein writing each fragmented file into a file queue comprises:
and writing each fragment file into the file queue according to the fragment time sequence of each fragment file.
4. The data distribution method of data stream according to claim 1, wherein performing a merge operation on the fragmented files in the file queue to obtain a target merged file comprises:
monitoring the file queue;
when it is monitored that a new fragmented file is written in the file queue, detecting all fragmented files in the file queue to obtain the file size corresponding to each fragmented file in the file queue;
calculating the sum of the file sizes of all the fragmented files in the file queue; and
if the sum of the file sizes of all the fragmented files in the file queue is not smaller than a preset value, merging one or more fragmented files in the file queue to obtain the target merged file.
5. The data distribution method of data stream according to claim 4, wherein each fragmented file in the file queue carries a write time when writing into the file queue; merging one or more fragmented files in the file queue to obtain the target merged file, including:
detecting whether the time interval between the writing time of the earliest fragmented file in the file queue and the current time is greater than a preset time length; and
and if the time interval between the writing time of the earliest fragmented file in the file queue and the current time is greater than the preset time length, performing merging operation on one or more fragmented files in the file queue to obtain a target merged file.
6. The data distribution method for data stream according to claim 4, further comprising:
and deleting one or more fragment files corresponding to the target merged file in the file queue after the target merged file is obtained.
7. The data distribution method of the data stream according to claim 6, wherein deleting one or more fragmented files in the file queue corresponding to the target merged file comprises:
generating a corresponding target event message according to the target combination file; and
and deleting one or more fragment files corresponding to the target merged file in the file queue according to the target event information.
8. A data distribution system for a data stream, comprising:
the receiving module is used for acquiring a plurality of data files of the target data stream;
the fragmentation module is used for carrying out fragmentation processing on the plurality of data files to obtain a plurality of fragmentation files;
the writing module is used for writing each fragment file into a file queue;
the merging module is used for performing merging operation on the fragment files in the file queue to obtain a target merged file; and
and the output module is used for outputting the target merging file.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the computer program, is adapted to carry out the steps of the data distribution method of a data stream according to any of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which is executable by at least one processor for causing the at least one processor to carry out the steps of the data distribution method of a data stream according to any one of claims 1 to 7.
CN202010814942.9A 2020-08-13 2020-08-13 Data distribution method and system for data stream Active CN112019605B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010814942.9A CN112019605B (en) 2020-08-13 2020-08-13 Data distribution method and system for data stream

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010814942.9A CN112019605B (en) 2020-08-13 2020-08-13 Data distribution method and system for data stream

Publications (2)

Publication Number Publication Date
CN112019605A true CN112019605A (en) 2020-12-01
CN112019605B CN112019605B (en) 2023-05-09

Family

ID=73506038

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010814942.9A Active CN112019605B (en) 2020-08-13 2020-08-13 Data distribution method and system for data stream

Country Status (1)

Country Link
CN (1) CN112019605B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905358A (en) * 2021-02-05 2021-06-04 中国工商银行股份有限公司 Software distribution method, device and system of distributed system
CN113034194A (en) * 2021-04-02 2021-06-25 深圳市英特飞电子有限公司 Intelligent lamp pole advertisement management method and device, computer equipment and storage medium
CN113055433A (en) * 2021-02-02 2021-06-29 新华三信息技术有限公司 File transmission method, device, equipment and machine-readable storage medium
CN113612832A (en) * 2021-07-29 2021-11-05 上海哔哩哔哩科技有限公司 Streaming data distribution method and system
CN117632860A (en) * 2024-01-25 2024-03-01 云粒智慧科技有限公司 Method and device for merging small files based on Flink engine and electronic equipment

Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090055422A1 (en) * 2007-08-23 2009-02-26 Ken Williams System and Method For Data Compression Using Compression Hardware
US20110167148A1 (en) * 2010-01-04 2011-07-07 International Business Machines Corporation System and method for merging monitoring data streams from a server and a client of the server
US20120158984A1 (en) * 2010-12-17 2012-06-21 Microsoft Corporation Streaming digital content with flexible remote playback
CN103914522A (en) * 2014-03-20 2014-07-09 电子科技大学 Data block merging method applied to deleting duplicated data in cloud storage
CN104978232A (en) * 2014-04-09 2015-10-14 阿里巴巴集团控股有限公司 Computation resource capacity expansion method for real-time stream-oriented computation, computation resource release method for real-time stream-oriented computation, computation resource capacity expansion device for real-time stream-oriented computation and computation resource release device for real-time stream-oriented computation
CN105512201A (en) * 2015-11-26 2016-04-20 晶赞广告(上海)有限公司 Data collection and processing method and device
CN106547859A (en) * 2016-10-21 2017-03-29 杭州朗和科技有限公司 A kind of storage method and device of the data file under multi-tenant data storage system
CN106603686A (en) * 2016-12-23 2017-04-26 郑州云海信息技术有限公司 File transmission method based on distributed storage system
CN107315761A (en) * 2017-04-17 2017-11-03 阿里巴巴集团控股有限公司 A kind of data-updating method, data query method and device
US20180075100A1 (en) * 2016-09-15 2018-03-15 Oracle International Corporation Non-intrusive monitoring output of stages in micro-batch streaming
CN109361754A (en) * 2018-11-05 2019-02-19 中国广核电力股份有限公司 A kind of document transmission method and device based on browser
CN109547566A (en) * 2018-12-25 2019-03-29 华南理工大学 A kind of multithreading upload optimization method distributed based on memory
CN109634957A (en) * 2018-11-19 2019-04-16 中国石油集团长城钻探工程有限公司 A kind of log data dynamic high-efficiency access method
CN109871710A (en) * 2018-12-29 2019-06-11 天津南大通用数据技术股份有限公司 A kind of cutting of stream data and storage method
CN109981751A (en) * 2019-03-06 2019-07-05 珠海金山网络游戏科技有限公司 A kind of document transmission method and system, computer equipment and storage medium
CN110442645A (en) * 2019-07-11 2019-11-12 新华三大数据技术有限公司 Data index method and device
CN110502491A (en) * 2019-07-25 2019-11-26 北京神州泰岳智能数据技术有限公司 A kind of Log Collect System and its data transmission method, device
CN110650207A (en) * 2019-09-29 2020-01-03 中电福富信息科技有限公司 Method for uploading break point of super-large video file
CN110716959A (en) * 2019-10-09 2020-01-21 北京百度网讯科技有限公司 Streaming data processing method and device, electronic equipment and storage medium
CN111092931A (en) * 2019-11-15 2020-05-01 中国科学院计算技术研究所 Method and system for rapidly distributing streaming data of online super real-time simulation of power system

Patent Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090055422A1 (en) * 2007-08-23 2009-02-26 Ken Williams System and Method For Data Compression Using Compression Hardware
US20110167148A1 (en) * 2010-01-04 2011-07-07 International Business Machines Corporation System and method for merging monitoring data streams from a server and a client of the server
US20120158984A1 (en) * 2010-12-17 2012-06-21 Microsoft Corporation Streaming digital content with flexible remote playback
CN103914522A (en) * 2014-03-20 2014-07-09 电子科技大学 Data block merging method applied to deleting duplicated data in cloud storage
CN104978232A (en) * 2014-04-09 2015-10-14 阿里巴巴集团控股有限公司 Computation resource capacity expansion method for real-time stream-oriented computation, computation resource release method for real-time stream-oriented computation, computation resource capacity expansion device for real-time stream-oriented computation and computation resource release device for real-time stream-oriented computation
CN105512201A (en) * 2015-11-26 2016-04-20 晶赞广告(上海)有限公司 Data collection and processing method and device
US20180075100A1 (en) * 2016-09-15 2018-03-15 Oracle International Corporation Non-intrusive monitoring output of stages in micro-batch streaming
CN106547859A (en) * 2016-10-21 2017-03-29 杭州朗和科技有限公司 A kind of storage method and device of the data file under multi-tenant data storage system
CN106603686A (en) * 2016-12-23 2017-04-26 郑州云海信息技术有限公司 File transmission method based on distributed storage system
CN107315761A (en) * 2017-04-17 2017-11-03 阿里巴巴集团控股有限公司 A kind of data-updating method, data query method and device
CN109361754A (en) * 2018-11-05 2019-02-19 中国广核电力股份有限公司 A kind of document transmission method and device based on browser
CN109634957A (en) * 2018-11-19 2019-04-16 中国石油集团长城钻探工程有限公司 A kind of log data dynamic high-efficiency access method
CN109547566A (en) * 2018-12-25 2019-03-29 华南理工大学 A kind of multithreading upload optimization method distributed based on memory
CN109871710A (en) * 2018-12-29 2019-06-11 天津南大通用数据技术股份有限公司 A kind of cutting of stream data and storage method
CN109981751A (en) * 2019-03-06 2019-07-05 珠海金山网络游戏科技有限公司 A kind of document transmission method and system, computer equipment and storage medium
CN110442645A (en) * 2019-07-11 2019-11-12 新华三大数据技术有限公司 Data index method and device
CN110502491A (en) * 2019-07-25 2019-11-26 北京神州泰岳智能数据技术有限公司 A kind of Log Collect System and its data transmission method, device
CN110650207A (en) * 2019-09-29 2020-01-03 中电福富信息科技有限公司 Method for uploading break point of super-large video file
CN110716959A (en) * 2019-10-09 2020-01-21 北京百度网讯科技有限公司 Streaming data processing method and device, electronic equipment and storage medium
CN111092931A (en) * 2019-11-15 2020-05-01 中国科学院计算技术研究所 Method and system for rapidly distributing streaming data of online super real-time simulation of power system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PAVEL BENÁCEK ETAL: ""Architecture of Effective High-Speed Network Stream Merger"", 《2014 17TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN》 *
易佳;薛晨;王树鹏;: "分布式流数据加载和查询技术优化", 计算机科学 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113055433A (en) * 2021-02-02 2021-06-29 新华三信息技术有限公司 File transmission method, device, equipment and machine-readable storage medium
CN112905358A (en) * 2021-02-05 2021-06-04 中国工商银行股份有限公司 Software distribution method, device and system of distributed system
CN113034194A (en) * 2021-04-02 2021-06-25 深圳市英特飞电子有限公司 Intelligent lamp pole advertisement management method and device, computer equipment and storage medium
CN113034194B (en) * 2021-04-02 2024-05-17 深圳市英特飞电子有限公司 Intelligent lamp post advertisement management method, intelligent lamp post advertisement management device, computer equipment and storage medium
CN113612832A (en) * 2021-07-29 2021-11-05 上海哔哩哔哩科技有限公司 Streaming data distribution method and system
CN117632860A (en) * 2024-01-25 2024-03-01 云粒智慧科技有限公司 Method and device for merging small files based on Flink engine and electronic equipment

Also Published As

Publication number Publication date
CN112019605B (en) 2023-05-09

Similar Documents

Publication Publication Date Title
CN112019605B (en) Data distribution method and system for data stream
US10977147B2 (en) System for continuous monitoring of data quality in a dynamic feed environment
CN112507029B (en) Data processing system and data real-time processing method
CN111966289B (en) Partition optimization method and system based on Kafka cluster
CN109918349A (en) Log processing method, device, storage medium and electronic device
CN112559475B (en) Data real-time capturing and transmitting method and system
CN111970195B (en) Data transmission method and streaming data transmission system
CN109190025B (en) Information monitoring method, device, system and computer readable storage medium
CN111966943A (en) Streaming data distribution method and system
CN111522786A (en) Log processing system and method
CN112751772A (en) Data transmission method and system
WO2018156979A1 (en) Selective distribution of messages in a publish-subscribe system
CN111209126A (en) Data transmission method and device between microservices and electronic equipment
CN112069264A (en) Heterogeneous data source acquisition method and device, electronic equipment and storage medium
US20160203032A1 (en) Series data parallel analysis infrastructure and parallel distributed processing method therefor
CN112751722B (en) Data transmission quality monitoring method and system
CN112019604A (en) Edge data transmission method and system
CN115576973B (en) Service deployment method, device, computer equipment and readable storage medium
CN113612832A (en) Streaming data distribution method and system
CN115473858A (en) Data transmission method and streaming data transmission system
CN112131198B (en) Log analysis method and device and electronic equipment
CN112559445B (en) Data writing method and device
CN113704203A (en) Log file processing method and device
CN113568966A (en) Data processing method and system used between ODS layer and DW layer
CN111078975B (en) Multi-node incremental data acquisition system and acquisition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant