CN116431063A

CN116431063A - Streaming data processing method and device, electronic equipment and storage medium

Info

Publication number: CN116431063A
Application number: CN202310220485.4A
Authority: CN
Inventors: 韩旭东; 刘勇成; 胡志鹏; 袁思思; 程龙
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2023-03-03
Filing date: 2023-03-03
Publication date: 2023-07-14

Abstract

The application provides a streaming data processing method, a streaming data processing device, electronic equipment and a storage medium, and is applied to the technical field of computers. The method comprises the following steps: upon arrival of a first checkpoint, a data amount of streaming data stored in a first storage space in a storage system is detected. And copying the streaming data stored in the first storage space into a second storage space of the storage system when the data amount of the streaming data stored in the first storage space is smaller than the data amount threshold. Updating the state of the first storage space to be a readable state. The method can realize storage space combination without manual operation, thereby reducing the number of small files generated in the streaming data writing process of the storage system and further improving the performance of the storage system.

Description

Streaming data processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and apparatus for processing streaming data, an electronic device, and a storage medium.

Background

The game log is a game record and is used for recording the running information of the game system in the game process. When the game log is generated, it is temporarily stored in the kafka queue for processing stream data. The processing engine flank then periodically reads the game logs of the kafka queue and writes all the game logs read in one period to a file of the distributed file system (HadoopDistributedFile System, HDFS) for querying by downstream applications (e.g., hive databases, etc.).

The file status of the file is invisible to downstream tools when the Flink writes a game log to the file of the HDFS. When a cycle ends, the file state of the file is updated to be visible downstream at the end of the cycle (checkpoint). Therefore, in order to ensure timeliness of the downstream tool reading the game log, a shorter period needs to be set for the HDFS. However, a shorter period will result in a smaller amount of data written per file, potentially resulting in a larger number of "small files" with a smaller amount of data.

Since each file needs to occupy one slot, and the HDFS needs to set an interface for each file, a large number of "small files" will seriously waste HDFS resources, affecting HDFS performance. And when the downstream tool reads data, the downstream tool needs to continuously jump from one file to the next file, so that the reading efficiency of the downstream tool is seriously affected. Therefore, how to merge "small files" in HDFS to reduce the number of "small files" in HDFS is a need for solving the problem.

Disclosure of Invention

In view of this, the present application provides a method, an apparatus, an electronic device, and a storage medium for processing streaming data, so as to solve the problem in the prior art that in the process of writing data, the HDFS performance of the distributed file system is poor due to too many small files generated by a checkpoint mechanism.

An embodiment of the present application provides a method for processing streaming data, where the method includes:

processing the streaming data by a streaming processing engine, and writing the processed streaming data into a storage system according to the check point time interval of the streaming processing engine; the method comprises the steps that streaming data written in different check point time intervals are stored in different storage spaces of a storage system;

detecting the data volume of streaming data stored in a first storage space in a storage system when a first check point arrives; the first storage space is used for storing streaming data written by the processing engine in a first check point time interval, and the first check point is the ending time of the first check point time interval;

when the data volume of the streaming data stored in the first storage space is smaller than the data volume threshold value, copying the streaming data stored in the first storage space into a second storage space of the storage system; the second storage space is used for storing streaming data written by the streaming processing engine in a second check point time interval, and the second check point time interval is the next check point time interval adjacent to the first check point time interval;

the state of the first memory space is updated to a readable state.

A second aspect of the embodiments of the present application provides a streaming processing engine, where the streaming processing engine processes streaming data, and writes the processed streaming data into a storage system according to a checkpoint time interval of the streaming processing engine; the method comprises the steps that streaming data written in different check point time intervals are stored in different storage spaces of a storage system; the stream processing engine includes:

a detection unit configured to detect, when the first checkpoint arrives, a data amount of streaming data stored in a first storage space in the storage system; the first storage space is used for storing streaming data written by the streaming processing engine in a first check point time interval, and the first check point is the ending time of the first check point time interval;

the processing unit is used for copying the streaming data stored in the first storage space into a second storage space of the storage system when the data volume of the streaming data stored in the first storage space is smaller than a data volume threshold value; the second storage space is used for storing streaming data written by the streaming processing engine in a second check point time interval, and the second check point time interval is the next check point time interval adjacent to the first check point time interval;

The processing unit is further configured to update the state of the first storage space to a readable state.

A third aspect of the embodiments of the present application further provides a server, including: a processor and a memory. Wherein:

the memory stores computer-executable instructions.

The processor executes computer-executable instructions to cause the electronic device to perform the method of processing streaming data as described in the first aspect above.

The fourth aspect of the embodiments of the present application further provides a computer readable storage medium, in which computer executable instructions are stored, where the computer executable instructions are executed by a processor to implement a method for processing streaming data as described in the first aspect.

According to the technical scheme provided by the embodiment of the application, when a current check point arrives, whether the data volume of the streaming data stored in the storage space corresponding to the time interval of the check point is smaller than the data volume threshold value is checked, and if so, the streaming data in the storage space needs to be copied into the storage space corresponding to the time interval of the next check point. In this way, the streaming data corresponding to the two time intervals are stored in the storage space corresponding to the next check point time interval, so that the combination of the storage spaces is realized, and the number of small files in the storage system is reduced. Meanwhile, the 'small file' needs to be temporarily reserved, and the state of the 'small file' is updated to be a readable state at the time of checking points, so that the data in the 'small file' can be read by a downstream application in the next time interval of checking points, and the data reading delay is avoided. And after the next check point arrives, deleting the small file after the streaming data writing process of the storage space corresponding to the time interval of the next check point is finished. If the data volume of the streaming data stored in the storage space corresponding to the current check point time interval is greater than or equal to the data volume threshold, the data volume written in the check point time interval is enough, and the storage space is also a large file, so that management of the storage space is not tired. Then the memory space is reserved. Therefore, in the process of writing streaming data, the storage system can automatically complete the combination of small files, the number of the small files in the storage system is greatly reduced, and the management performance of the storage system is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a system architecture diagram of a streaming data storage system according to an embodiment of the present application;

fig. 2 is a flow chart of a method for processing streaming data according to an embodiment of the present application;

fig. 3 is a flow chart of another method for processing streaming data according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a stream processing engine according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present application, the present application is clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. This application is intended to be limited to the details shown and described, and it is intended that the invention not be limited to the particular embodiment disclosed, but that the application will include all embodiments falling within the scope of the appended claims.

It should be noted that the terms "first," "second," "third," and the like in the claims, specification, and drawings herein are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. The data so used may be interchanged where appropriate to facilitate the embodiments of the present application described herein, and may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and their variants are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, the technical background of the present application will be described:

the game log is a game record and is used for recording the running information of the game system in the game process. During the running of the game, the game log is continuously generated, so that the game log can be managed using kafka which processes stream data, and the game log is temporarily stored in the queue. The processing engine flank may then need to read the game logs of the kafka queue, write the game logs of the kafka queue to the distributed file system HDFS for query by downstream applications.

The process of writing the game logs into the HDFS by the link is that the link periodically reads the newly generated game logs in the kafka, and writes all the game logs read in one period into one file of the distributed file system HDFS. Wherein, when the Flink writes the game log to the file of the HDFS, the file status of the file is invisible to the downstream tool. Thus, the Flink needs to write data periodically, and then write the game logs generated in different periods into different files, which means that new game logs generated in the period are written into one file separately at intervals to be saved, and after the period is finished, the Flink does not write data into the file any more, but writes data into the new file. So that the game log can be queried in time at the end of each cycle. It will be appreciated that a shorter period of time may be required for the HDFS in order to ensure timeliness of the downstream tool reading the game log. However, a shorter period will result in a smaller amount of data being written to each file, potentially resulting in a larger number of "small files" with a smaller amount of data.

Since each file needs to occupy one slot, and the HDFS needs to set an interface for each file, a large number of "small files" will seriously waste HDFS resources, affecting HDFS performance. And when the downstream tool reads data, the downstream tool needs to continuously jump from one file to the next file, so that the reading efficiency of the downstream tool is seriously affected. If multiple "doclets" in an HDFS can be combined, the performance of the HDFS can be improved. In the prior art, a common merging method is that when a certain partition file is completely written, a system manager manually calls a related API to perform unified merging of the files, for example, in a log directory taking a date as a partition, when the partition has no file writing, a related command is called at a set time to call a small file in the partition. The method needs manual merging, and can be performed after all files in the partition are completely written, so that merging is not timely and the efficiency is extremely poor. Therefore, how to efficiently merge "small files" in HDFS to reduce the number of "small files" in HDFS is a need for solving the problem.

Aiming at the technical problems, the application provides a streaming data processing method, a streaming data processing device, electronic equipment and a storage medium. In this embodiment of the present application, when a check point arrives (one period ends), it is necessary to check whether the data amount of the streaming data stored in the storage space (file) corresponding to the period is smaller than the data amount threshold, and if so, it is indicated that the storage space is a "small file", and if so, it is necessary to copy the streaming data in the "small file" to the storage space corresponding to the next period, so that the streaming data generated in two periods will be stored in the storage space corresponding to the next period. Therefore, the combination of the small files can be effectively completed in time, and the number of the small files in the storage system is reduced. The method, apparatus, terminal and computer readable storage medium described in the present application are described in further detail below with reference to specific embodiments and attached drawings.

The network architecture in the embodiments of the present application is briefly described below. Fig. 1 is a system architecture diagram of a streaming data storage system according to an embodiment of the present application, where, as shown in fig. 1, the system includes a Kafka queue, a processing engine Flink, a distributed file system HDFS, and a downstream application.

Among them, kafka is outstanding in message middleware, which is essentially a data storage platform. As an open source software, the method has the advantages of supporting multi-language development of an application end, supporting real-time large-scale message processing and the like. Among them, the Kafka system mainly has three roles of producer, consumer and broker, and the Kafka cluster is composed of a plurality of broker nodes. When a producer generates new data information, it is divided under a certain theme. While a topic may be partitioned into multiple partitions, the partitions may be deployed on multiple brookers. The producer continuously sends data information to a certain type of theme, the proxy node temporarily stores the data information of different themes, then forwards the data information to the consumer, and finally the consumer uses and processes the data information.

With the continuous increase of service data volume, the diversity of service demands is improved. In the current big data environment, many scenarios use the Kafka system as a data message queue to relay data messages. The proxy server in Kafka acquires the data message sent by multiple parties, and the data message is dropped onto the distributed file system HDFS after data processing. Then, the downstream application program directly inquires the data on the HDFS.

In the system, the game system is used as a producer to continuously generate game logs, and the generated game logs are sent to a Kafka queue for temporary storage. Then the processing engine flank reads the game logs in the Kafka queue and writes the read game logs into files in the distributed file system HDFS. It can be understood that the writing process of the game log is performed according to a period, that is, the processing engine Flink periodically reads the newly-added game log in the Kafka queue, and the game log generated in each period is independently stored in a file of the HDFS, and when the period is finished, the Flink does not write the game log into the file corresponding to the period, but continues writing the game log into another newly-built file. And after the period is finished, the file corresponding to the period is updated to be in a readable state in time, and the file is visible to downstream applications. Thus, the downstream file can conveniently and timely check the game log in the file.

Based on the above network architecture, fig. 2 is a flow chart of a method for processing streaming data according to an embodiment of the present application. As shown in fig. 2, the method may include the steps of:

201. upon arrival of a first checkpoint, a data amount of streaming data in a first storage space in a storage system is detected.

Wherein the streaming engine processes the streaming data. Exemplary include data cleansing, data integration, and data transformation, among others. Specifically, the data cleaning includes filling in a damaged value of a part of the streaming data, deleting "outlier" data in the streaming data, correcting problem data in the streaming data, and the like. And the data integration is to combine a plurality of streaming data according to a certain rule to obtain combined data. The data transformation is to perform format conversion on the received lost data so as to make the received lost data meet the data type requirement of downstream application. After the streaming processing engine processes the received streaming data, the processed data is stored in the storage system for query or call by other downstream applications.

In processing streaming data, the streaming processing engine stores the streaming data according to checkpoints. Specifically, the streaming processing engine periodically writes streaming data to the storage system, and streaming data generated during different checkpoint time intervals is written to different storage spaces of the storage system. That is, when a checkpoint arrives, the memory space corresponding to the time interval of the last checkpoint will stop writing data, and then the newly generated streaming data from the checkpoint will be written into a new memory space.

Illustratively, in the scenario of the system architecture depicted in fig. 1, the processing engine Flink reads the streaming data (such as game log data) in the Kafka queue and writes the read streaming data to the files (storage space) in the distributed file system HDFS. It will be appreciated that the processing engine, flink, is periodically writing streaming data to the HDFS. The streaming data read in each period is independently written into one file of the HDFS, and when the end time (check point) of the period comes, the processing engine Flink does not continue to write streaming data into the current file, but continues to write newly added streaming data into the new file. And the HDFS updates the file state of the current file to a readable state at the check point for the downstream application to query and call. It will be appreciated that the file is unreadable during the writing of the file data, and that the file is only visible to downstream files after the writing has ended.

In this step, the first storage space is used to store streaming data written into the storage system by the streaming processing engine in the current checkpoint time interval, and the first checkpoint is the end time of the current checkpoint time interval. When the first checkpoint arrives, the streaming processing engine stops writing streaming data to the first storage space, but begins writing data to the second storage space corresponding to the next checkpoint interval. At this time, it is necessary to detect the data writing amount of the first storage space in the current checkpoint time interval, and determine whether the first storage space is a "large file" or a "small file" according to the amount of the written streaming data. It will be appreciated that if the first storage space is a "large file," file management of the storage system is not compromised. And if "doclet" then a merge of storage space is required.

202. And judging whether the data volume of the streaming data stored in the first storage space is smaller than a data volume threshold value. If yes, go to step 203. If not, step 205 is performed.

After the data amount of the streaming data stored in the first storage space is acquired, the data amount needs to be compared with a preset data amount threshold. When the data amount of the data stored in the first storage space is greater than or equal to the data amount threshold, then the first storage space is considered to be a "large file". The first storage space is considered to be a "doclet" if the data amount of the data stored in the first storage space is less than the data amount threshold. Since too many small files seriously affect cluster expansion, subsequent merging processing is needed for the small files, and the large files are independently stored in a storage system.

Illustratively, in the scenario of the system architecture depicted in FIG. 1, the storage system is a distributed file system HDFS. The data amount threshold may be the amount of data that can be stored by the data block on the HDFS, which is generally 128M. I.e. each file (storage space) in the HDFS stores an amount of data that occupies at least one block to prevent the generation of too many "small files". Therefore, if the amount of data stored in the file after the end of the read-write cycle is less than 128M, the file is determined to be a "small file", and then the "small file" needs to be subjected to the merging process. And if the file is greater than or equal to 128M, the file is normally stored. It will be appreciated that the amount of data written by the processing engine Flink in a cycle is related to the rate of generation of file data, and that some file data is generated at a high rate, so that the amount of data written by the Flink is large. And if the file data generation speed is slow, the data amount written by the Flink is small. Thus, it is preferable that the data amount threshold be rationally planned according to the data amount of the file data stored in the history file.

For example, the data amount corresponding to each of the plurality of history files in the HDFS may be obtained, then the data amount average value corresponding to the plurality of history files is calculated, and then the data amount average value is determined as the data amount threshold. For example, the data stored in the plurality of history files may be subjected to a deduplication process to obtain data amounts corresponding to the history files after deduplication, and then a data amount average value is calculated based on the data amounts corresponding to the history files after deduplication, and then the data amount average value is determined as a data amount threshold. For example, constraints may be added to determine a minimum amount of data. When the calculated average data amount is greater than the minimum data amount, the average data amount is determined as a data amount threshold. And when the calculated average value of the data amount is smaller than the minimum data amount, determining the minimum data amount as a data amount threshold. It will be appreciated that the data amount threshold may be flexibly set according to the storage requirement, and is not specifically limited.

203. And copying the streaming data stored in the first storage space into the second storage space.

If the data stored in the first storage space is smaller than the data amount threshold, the streaming data stored in the first storage space needs to be copied to the second storage space corresponding to the time interval of the next check point. The streaming processing engine then begins writing the newly added streaming data to the second storage space after the checkpoints arrive. The second memory space thus includes the streaming data written during the previous checkpoint interval and the streaming data written during the subsequent checkpoint interval stored in the first memory space. That is, the merging of the storage spaces corresponding to two consecutive check point time intervals is equivalent, and the data written by the two check point time intervals is merged into one storage space (second storage space). Thus, the probability of generating small files is reduced.

204. The state of the first memory space is updated to a readable state.

After copying, the first memory space still needs to be reserved, because the second memory space is still being subjected to the step of writing data during the second checkpoint interval, and is not readable during the second checkpoint interval. Thus, the data of the first storage space replicated in the second storage space cannot be queried by the downstream application within the second checkpoint time interval. If the downstream application should immediately delete the first storage space after the arrival of the first check point, the downstream application can only wait until the arrival of the second check point to read the streaming data corresponding to the first storage space from the second storage space, which causes the problem of untimely data reading. Therefore, after one period is finished, even if the data in the storage space corresponding to the period is duplicated and backed up in the storage space corresponding to the next period, the storage space corresponding to the period is reserved first. And, the state of the storage space corresponding to the period needs to be updated to be readable, so that the downstream application can timely and effectively inquire the data in the storage space corresponding to the period.

205. The state of the first memory space is updated to a readable state.

If the data amount of the data stored in the first storage space is greater than or equal to the data amount threshold, the data amount stored in the first storage space is considered to be enough, and no merging is needed. Since the large file does not burden management of the storage system, the state of the first storage space is directly updated to be a readable state, and the first storage space is saved in the storage system.

206. And writing the newly added stream data into the second storage space.

It will be appreciated that the data stored in the first storage space is copied to the second storage space, and when the next checkpoint interval begins, the streaming engine needs to write the newly added streaming data to the second storage space, i.e. the streaming engine stops writing the streaming data to the first storage space whose state has changed, and instead starts writing the data to the second storage space. At this time, the second storage space will include the streaming data written in the last checkpoint interval as well as the streaming data written in the current checkpoint interval. It will be appreciated that the data writing to the second storage space will also stop when the second checkpoint arrives again. At this time. The state of the second storage space is modified to be readable, and the second storage space includes the streaming data corresponding to the last check point time interval (the streaming data written into the first storage space) and the streaming data written into the next check point time interval. Then the downstream application queries the second storage space to obtain the data in the first storage space, and the first storage space can be deleted. Thus, the number of small files in the storage system is reduced, and the file management performance of the storage space is improved. When the downstream application queries the data, the downstream application does not need to frequently jump different storage spaces to query. Therefore, the query efficiency of the downstream application is improved while the reading timeliness of the downstream application is ensured.

It will be appreciated that if the total data volume of the streaming data included in the second storage space is still smaller than the data volume threshold after the data writing process of the second storage space is stopped, that is, the second storage space storing streaming data of a plurality of periods is still a "small file", all the data thereof is continuously copied to the storage space corresponding to the next checkpoint time interval, and is merged again. After the time interval of the check point is finished, the state of the corresponding storage space is updated to be in a readable state, the second storage space is deleted until the storage space corresponding to a certain time interval of the check point becomes a large file after the data writing is finished, and the large file is reserved. When the check point comes again, the data copying process is not performed, and the streaming data is directly and continuously written into the new storage space.

On the basis of the foregoing embodiments, fig. 3 is a flowchart of another method for processing streaming data according to the embodiments of the present application, so as to fully describe steps possibly involved in the embodiments of the present application, and as shown in fig. 3, the method for processing streaming data may include the following steps:

301. upon arrival of the first checkpoint, target data in the streaming data stored in the first storage space is determined.

It will be appreciated that the step of copying data cannot be performed too many times in order to conserve storage resources of the storage system. However, in some cases, streaming data is generated at a slow rate, and the amount of generated data is small. Therefore, after the data is copied for many times, the streaming data corresponding to a plurality of periods is still "small files" in the storage space, and then the merging program of the storage space is required to be stopped, the data is stopped being copied, and the "small files" are stored in the storage system. If the management performance of the storage system needs to be improved subsequently, after the writing of all the streaming data is completed, the storage space is merged again by using other strategies so as to eliminate the small file.

When the first checkpoint arrives, it is first determined whether the first storage space corresponding to the first checkpoint time interval includes the copied target data, and it is necessary to determine whether all the data of the first storage space is copied in the second storage space corresponding to the next cycle according to the state of the target data. It will be appreciated that the target data refers to streaming data in the first storage space that is copied from the history storage space corresponding to the history checkpoint time interval.

302. And judging whether the state of the target data is a preset state or not. If yes, go to step 303. If not, go to step 304.

When the state of the target data is the preset state, the streaming data stored in the first storage space is not copied to the second storage space, and even if the first storage space is a small file, the first storage space is directly stored in the storage system. The process of determining whether to copy streaming data of the first storage space to the second storage space based on the target data will be described in detail below in connection with several cases.

Case one:

for example, the data storage record of the target data in the first storage space may be queried to obtain the number of copies of the target data. And then determining the highest replication times in the replication times, and if the highest replication times reach the first preset times, not replicating the streaming data of the first storage space to the second storage space corresponding to the next check point time interval.

It is appreciated that the target data in the first storage space may be from a plurality of history storage spaces. For example, the file corresponding to the period a is the file a, the file corresponding to the period B is the file B, the period a is the time interval of the previous check point of the period B, and the period B is the time interval of the previous check point. After the period a ends, the storage space a is a "small file". Then the streaming data in storage space a needs to be copied to storage space b. When the period B is finished, the storage space is still a "small file", and the streaming data of the storage space a and the streaming data newly written in the period B are stored in the storage space B. The streaming data in the storage space B is then copied to the first storage space, where the data in the storage space a, the data written in the period B, and the data written in the first checkpoint interval are stored. The target data in the first storage space is the data in the storage space a and the data written in the period B. The number of copies corresponding to the data in the storage space a is 2, and the number of copies of the target data is 2 when the file written in the period B is 1.

If the highest replication times corresponding to the target data in the first storage space reach a preset times, it is indicated that after several rounds of reading cycles (several check point time intervals are passed), the merged storage space is still a "small file". The merge process is terminated and is directly maintained in the storage system.

And a second case:

for example, the data storage record of the target data in the first storage space may be queried to obtain the number of copies of the target data. And then determining the average replication times of the target data, and if the average replication times reach the second preset times, not replicating the streaming data stored in the first storage space to the second storage space corresponding to the next check point time interval.

Similarly, the target data in the first storage space may be from multiple history storage spaces. For example, the storage space corresponding to the period a is a, the storage space corresponding to the period B is B, the period a is a previous check point time interval of the period B, and the period B is a previous check point time interval of the first check point time interval. Wherein, after the end of period A, a is "doclet". Then the streaming data in a needs to be copied into b. When period B ends, the storage space B is still a "small file", and the data of a and the streaming file written in period B are stored in B. Then, the data in B is copied to the first storage space, and at this time, the data in a, the streaming data written in the period B, and the streaming data written in the first checkpoint time interval are stored in the first storage space. The target data in the first storage space is the data of the file a and the streaming file written in the period B. The number of copies corresponding to the data of a is 2, and the number of copies of the streaming file written in the period B is 1, so that the average number of copies of the target data is 1.5.

If the average copy number corresponding to the target data in the first storage space reaches a preset number, it also indicates that after several cycles have elapsed, the combined storage space is still a "small file". And the original streaming data is not time-efficient, the merging process is terminated, and the small file is directly reserved in the storage system.

And a third case:

for example, the data storage record of the target data in the first storage space may also be queried to obtain the initial write time of the target data. And then determining the written time length of the target data according to the initial writing time and the current time, and if the written time length reaches the preset time length, not copying the streaming data of the first storage space to the second storage space corresponding to the next check point time interval.

Similarly, the target data in the first storage space may be from multiple history storage spaces. For example, the storage space corresponding to the period a is the storage space corresponding to the period B, the period a is the previous check point time interval of the period B, and the period B is the previous check point time interval of the first check point time interval. Wherein, after the end of period A, a is "doclet". Then the streaming data in a needs to be copied into b. And B is still a "small file" when period B ends, and the data of a and the streaming data written in period B are stored in B. The data in B is then copied to the first storage space where the data of a, the streaming data written in cycle B, and the streaming data written in the first checkpoint interval are stored. The target data in the first storage space is the data of a and the streaming data written in the period B. Then, the write time corresponding to the data of a is the initial write time. At this time, the initial write time of the data of a needs to be queried, and the write time is, for example, 20:30, and the current first checkpoint time is 20:40, then it is explained that the data of a has been written for 10 minutes.

If the written duration corresponding to the data of a reaches a preset duration, the storage space after combining after a plurality of rounds is still a small file. And the initial data is the data with low timeliness, the merging process is required to be terminated, and the first storage space which is still the small file is directly reserved in the storage system.

303. The state of the first memory space is updated to a readable state.

After the merging process is terminated, even if the first storage space is still a small file, the first storage space needs to be directly saved in the storage system, and the data included in the first storage space is not copied. At this time, the state of the first storage space is directly updated to be a readable state, so that the downstream application can access the data of the first storage space. It will be appreciated that after the next checkpoint interval has begun, the written streaming data will be saved directly in the new memory space and no longer be associated with the first memory space.

304. A data amount of streaming data stored in a first storage space in a storage system is detected.

If the state of the target data is not the above state, that is, the data stored in the first storage space is the new data with short writing duration, the step of merging the storage spaces is needed, so as to avoid storing too many small files in the storage system. For example, the data amount of the streaming data stored in the first storage space needs to be acquired, and then whether the first storage space and the storage space corresponding to the next checkpoint time interval need to be combined is determined according to the data amount corresponding to the first storage space.

305. And judging whether the data volume of the streaming data stored in the first storage space is smaller than a data volume threshold value. If yes, go to step 306. If not, step 310 is performed.

After the data amount of the data stored in the first storage space is acquired, the data amount needs to be compared with a preset data amount threshold value. When the data amount of the data stored in the first storage space is greater than or equal to the data amount threshold, then the first storage space is considered to be a "large file". The first storage space is considered to be a "doclet" if the data amount of the data stored in the first storage space is less than the data amount threshold. It will be appreciated that a "large file" indicates that the first storage space stores a large amount of data and therefore may be stored independently in the storage system. However, the "small file" affects the cluster expansion, so if the first storage space is the "small file", the "small file" needs to be combined with the subsequent storage space to reduce the number of "small files" in the storage system.

Illustratively, in the scenario of the system architecture shown in fig. 1, the data amount threshold may be the amount of data that can be stored by the data block on the HDFS, typically 128M. That is, the data amount stored in the file in the HDFS occupies at least one block to prevent the generation of excessive "small files". Therefore, if the amount of data stored in the file after the read-write cycle is finished is less than 128M, the file is determined to be a "small file", and then the "small file" and the file corresponding to the next read-write cycle need to be combined. And if the file is greater than or equal to 128M, the file is normally stored.

306. And copying the streaming data stored in the first storage space into the second storage space.

The specific merging mode is that before the next check point time interval starts to receive the streaming data, the streaming data in the first storage space corresponding to the previous check point time interval is copied to the second storage space corresponding to the next check point time interval, and then the newly written streaming data is stored in the second storage space. The second memory space thus includes the data written in the last checkpoint interval and the data newly written in the next checkpoint interval stored in the first memory space. That is, the merging of the storage spaces corresponding to two consecutive check point time intervals is equivalent, and the streaming data written by the two check point time intervals is merged into one storage space (second storage space). Thus, the probability of generating small files is reduced.

307. The state of the first memory space is updated to a readable state.

After copying the data of the first memory space, the first memory space still needs to be reserved for the next checkpoint interval, because the second memory space is unreadable for the next checkpoint interval during which the data writing step is performed. Thus, the data of the first storage space in the second storage space that is replication-stored is not visible to downstream applications for the next checkpoint interval. The downstream application should acquire the data in the first storage space after the last checkpoint time interval is finished, but if the first storage space is deleted immediately, the downstream application can only read the data in the first storage space after the next checkpoint time interval is finished, which causes the problem of untimely data reading. Therefore, when a check point arrives, even if the data in the storage space corresponding to the time interval of the last check point is duplicated and backed up in the storage space corresponding to the time interval of the next check point, the storage space corresponding to the time interval of the last check point is reserved first. And, the state of the storage space needs to be updated to be readable, so that the downstream application can timely and effectively inquire the data of the storage space corresponding to the last check point time interval in the next check point time interval.

308. The streaming data is written to the second storage space within a second checkpoint time interval.

It will be appreciated that after the next checkpoint interval has started, the streaming engine needs to write the newly added streaming data into the second storage space, i.e. stop writing streaming data to the first storage space, where the state has changed, and start writing data to the second storage space. At this time, the second storage space will include the data written at the first checkpoint interval as well as the data written at the second checkpoint interval. It will be appreciated that the digital writing process of the second storage space will also stop when the second checkpoint arrives.

309. When the second checkpoint arrives, the first storage space is deleted.

After the second checkpoint time interval is finished, the state of the second storage space is modified to be a readable state, and the data corresponding to the first checkpoint time interval (the streaming data in the first storage space) and the data written in the second checkpoint time interval, which are included in the second storage space, are visible to downstream applications. Then the downstream application queries the second memory space to obtain the data in the first memory space, and the first memory space needs to be deleted. Therefore, only the second storage space is needed to be stored in the storage system, so that the number of small files in the storage system is reduced, and the management performance of the storage system is improved. When the downstream application queries the data, the query is performed without frequently jumping different storage spaces, so that the query efficiency of the downstream application is improved.

It will be appreciated that if the second checkpoint time interval is finished, the data amount corresponding to the second storage space still needs to be evaluated, and if the total data amount of the second storage space is still less than the data amount threshold, that is, the second storage space storing the data of multiple periods is still a "small file", all the data of the second storage space is continuously copied to the storage space corresponding to the next checkpoint time interval, and then merging is performed again. After the next check point time interval is finished, the state of the storage space corresponding to the next check point time interval is updated to be in a readable state, and then the second storage space is deleted until a certain storage space becomes a large file after the data writing is finished. Then the "large file" is preserved and the copying process of the data is no longer performed.

310. The state of the first memory space is updated to a readable state.

The first storage space is considered to be a "large file" if the data amount of the data stored in the first storage space is greater than or equal to the data amount threshold. I.e. the amount of data stored in the first storage space is sufficiently large that no merging is necessary. Since the large file does not burden management of the storage system, the state of the first storage space is directly updated to be a readable state, and the first storage space is saved in the HDFS.

311. The streaming data is written to the second storage space within a second checkpoint time interval.

Finally, after the next cycle starts, the streaming engine needs to write the newly added streaming data into the new second storage space, i.e. stop writing the streaming data into the first storage space with the state changed, but start writing the data into the second storage space.

In the embodiment of the present application, it is necessary to determine whether to perform the merging step of the storage space according to the writing state of the data in the storage space. If the data in one storage space has been copied many times or the writing time is long, even if the storage space is still a 'small file', the data in the storage space does not need to be copied again, but the 'small file' is stored in the storage system in time. This may save storage resources of the storage system. If the data in the storage space is not copied or written for a long time, file merging is needed according to the data amount of the data stored in the storage space, so that the number of small files in the storage system is reduced. Thus, the management efficiency is improved under the condition of ensuring data reading. In the data writing process, the storage system can automatically complete the combination of small files, and the number of the small files in the storage system is greatly reduced, so that the file management performance of the storage system is improved.

On the basis of the above method embodiment, fig. 4 is a schematic structural diagram of a streaming processing engine according to the embodiment of the present application, and as shown in fig. 4, the streaming processing engine processes streaming data, and writes the processed streaming data into a storage system according to a checkpoint time interval of the streaming processing engine. Wherein the streaming data written in different checkpoint time intervals are stored in different storage spaces of the storage system, the streaming processing engine comprising:

the detecting unit 401 is configured to detect, when the first checkpoint arrives, a data amount of streaming data stored in a first storage space in the storage system. The first storage space is used for storing streaming data written by the streaming processing engine in a first check point time interval. The first checkpoint is the end time of the first checkpoint time interval.

The processing unit 402 is configured to copy the streaming data stored in the first storage space to the second storage space of the storage system when the data amount of the streaming data stored in the first storage space is smaller than the data amount threshold. The second storage space is used for storing streaming data written by the streaming processing engine in a second check point time interval. The second checkpoint time interval is the next checkpoint time interval adjacent to the first checkpoint time interval.

The processing unit 402 is further configured to update the state of the first storage space to a readable state.

In an alternative embodiment, the processing unit 402 is further configured to delete the first storage space when the second checkpoint arrives. The second checkpoint is the end time of the second checkpoint time interval. The state of the second memory space is updated to a readable state.

In an alternative embodiment, the streaming engine further comprises a storage unit 403.

The processing unit 402 is further configured to directly update the state of the first storage space to a readable state when the data amount of the streaming data stored in the first storage space is greater than or equal to the data amount threshold.

The storage unit 403 is further configured to store, from the first checkpoint, streaming data newly written by the streaming processing engine in the second storage space.

In an alternative embodiment, the processing unit 402 is further configured to reserve the first storage space when the second checkpoint arrives. The second checkpoint is the end time of the second checkpoint time interval. The state of the second memory space is updated to a readable state.

In an alternative embodiment, the streaming engine further comprises a determining unit 404.

A determining unit 404, configured to determine, when the first checkpoint arrives, target data in the streaming data stored in the first storage space. The target data is streaming data copied from the history storage space corresponding to the history checkpoint time interval.

The processing unit 402 is further configured to not copy the streaming data in the first storage space to the second storage space when the state of the target data is a preset state.

In an alternative embodiment, the processing unit 402 is specifically configured to query the data storage record, and obtain the highest replication number corresponding to the target data. When the highest copying times corresponding to the target data reach the first preset times, not copying all streaming data stored in the first storage space to the second storage space.

In an alternative embodiment, the processing unit 402 is specifically configured to query the data storage record, and obtain the average copy number corresponding to the target data. When the average replication times corresponding to the target data reach the second preset times, all the streaming data stored in the first storage space are not replicated to the second storage space.

In an alternative embodiment, the processing unit 402 is specifically configured to query the data storage record, and obtain the initial writing time corresponding to the target data.

The determining unit 404 is further configured to determine a written duration corresponding to the target data according to the initial writing time and the first checkpoint.

The processing unit 402 is specifically configured to, when the writing time reaches the preset time period, not copy all the streaming data stored in the first storage space to the second storage space.

In an alternative embodiment, the processing unit 402 is further configured to directly update the state of the first storage space to a readable state at the first checkpoint, and store the streaming data newly written by the streaming processing engine in the second storage space.

In an alternative embodiment, the determining unit 404 is further configured to obtain data amounts corresponding to a plurality of history storage spaces stored in the storage system. And determining an average value of the data quantity according to the data quantity corresponding to the plurality of historical storage spaces. And determining a data quantity threshold according to the data quantity average value.

It should be noted that, it should be understood that the division of the modules of the above apparatus is merely a division of a logic function, and may be fully or partially integrated into a physical entity or may be physically separated. And these modules may all be implemented in software in the form of calls by the processing element; or can be realized in hardware; the method can also be realized in a form of calling software by a processing element, and the method can be realized in a form of hardware by a part of modules. In addition, all or part of the modules can be integrated together or can be independently implemented. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in a software form.

Next, referring to fig. 5, fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 800 may be deployed with the stream processing engine described in the corresponding embodiment of fig. 4, for implementing the functions in the corresponding embodiments of fig. 1 to 3. Specifically, the electronic device 800 includes: a receiver 801, a transmitter 802, a processor 803, and a memory 804 (where the number of processors 803 in the execution device 800 may be one or more, one processor is exemplified in fig. 5), where the processor 803 may include an application processor 8031 and a communication processor 8032. In some embodiments of the present application, the receiver 801, transmitter 802, processor 803, and memory 804 may be connected by a bus or other means.

Memory 804 may include read only memory and random access memory and provides instructions and data to the processor 803. A portion of the memory 804 may also include non-volatile random access memory (NVRAM). The memory 804 stores a processor and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, where the operating instructions may include various operating instructions for performing various operations.

The processor 803 controls the operation of the execution device. In a specific application, the individual components of the execution device are coupled together by a bus system, which may include, in addition to a data bus, a power bus, a control bus, a status signal bus, etc. For clarity of illustration, however, the various buses are referred to in the figures as bus systems.

The methods disclosed in the embodiments of the present application may be applied to the processor 803 or implemented by the processor 803. The processor 803 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry of hardware or instructions in software form in the processor 803. The processor 803 may be a general purpose processor, a Digital Signal Processor (DSP), a microprocessor, or a microcontroller, and may further include an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The processor 803 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 804, and the processor 803 reads information in the memory 804, and in combination with the hardware, performs the steps of the above method.

The receiver 801 may be used to receive input numeric or character information and to generate signal inputs related to performing relevant settings and function control of the device. The transmitter 802 may be used to output numeric or character information through a first interface; the transmitter 802 may also be configured to send instructions to the disk group through the first interface to modify data in the disk group; the transmitter 802 may also include a display device such as a display screen.

In the embodiment of the present application, the application processor 8031 in the processor 803 is configured to perform the method for processing streaming data in the corresponding embodiment of fig. 1 to 3. It should be noted that, the specific manner in which the application processor 8031 executes each step is based on the same concept as that of each method embodiment corresponding to fig. 1 to 3 in the present application, so that the technical effects brought by the specific manner are the same as those of each method embodiment corresponding to fig. 1 to 3 in the present application, and details of the specific manner may be referred to the descriptions in the foregoing method embodiments shown in the present application, which are not repeated herein.

The embodiment of the application also provides a chip for running the instruction, which is used for executing the technical scheme of the processing method of the streaming data in the embodiment.

The embodiment of the application also provides a computer readable storage medium, in which computer instructions are stored, and when the computer instructions run on a server, the server is caused to execute the technical scheme of the method for processing streaming data in the embodiment.

The embodiment of the application also provides a computer program product, which comprises a computer program, wherein the computer program is used for executing the technical scheme of the method for storing the data in the embodiment when being executed by a processor.

The computer readable storage medium described above may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk. A readable storage medium can be any available medium that can be accessed by a general purpose or special purpose server.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

While the preferred embodiment has been described, it is not intended to limit the invention thereto, and any person skilled in the art may make variations and modifications without departing from the spirit and scope of the present invention, so that the scope of the present invention shall be defined by the claims of the present application.

Claims

1. A method for processing streaming data, the method comprising:

processing the streaming data by a streaming processing engine, and writing the processed streaming data into a storage system according to the check point time interval of the streaming processing engine; wherein the streaming data written in different checkpoint time intervals are stored in different storage spaces of the storage system;

detecting the data volume of streaming data stored in a first storage space in the storage system when a first check point arrives; the first storage space is used for storing streaming data written by the streaming processing engine in a first check point time interval; the first check point is the ending time of the first check point time interval;

copying the streaming data stored in the first storage space to a second storage space of the storage system when the data amount of the streaming data stored in the first storage space is smaller than a data amount threshold; the second storage space is used for storing the streaming data written by the streaming processing engine in a second check point time interval; the second checkpoint time interval is a next checkpoint time interval adjacent to the first checkpoint time interval;

Updating the state of the first storage space to be a readable state.

2. The method of claim 1, wherein after copying the streaming data stored in the first storage space into the second storage space of the storage system, the method further comprises:

deleting the first storage space when a second check point arrives; the second check point is the ending time of the second check point time interval;

updating the state of the second storage space to be a readable state.

3. The method according to claim 1, wherein the method further comprises:

when the data volume of the streaming data stored in the first storage space is greater than or equal to the data volume threshold, directly updating the state of the first storage space to be the readable state;

starting from the first check point, storing the streaming data newly written by the streaming processing engine in the second storage space.

4. A method according to claim 3, characterized in that the method further comprises:

reserving the first storage space when a second checkpoint arrives; the second check point is the ending time of the second check point time interval;

Updating the state of the second storage space to be a readable state.

5. The method of any of claims 1 to 4, wherein prior to detecting the data amount of streaming data stored in a first storage space in the storage system, the method further comprises:

when the first check point arrives, determining target data in the streaming data stored in the first storage space; the target data is streaming data copied from a history storage space corresponding to a history check point time interval;

and when the state of the target data is a preset state, the streaming data in the first storage space is not copied to the second storage space.

6. The method of claim 5, wherein when the state of the target data is a preset state, not copying the streaming data in the first storage space to the second storage space comprises:

inquiring a data storage record to obtain the highest copying times corresponding to the target data;

and when the highest copying times corresponding to the target data reach a first preset times, not copying all streaming data stored in the first storage space to the second storage space.

7. The method of claim 5, wherein when the state of the target data is a preset state, not copying the streaming data in the first storage space to the second storage space comprises:

inquiring a data storage record to obtain the average copy times corresponding to the target data;

and when the average replication times corresponding to the target data reach a second preset times, not replicating all streaming data stored in the first storage space to the second storage space.

8. The method of claim 5, wherein when the state of the target data is a preset state, not copying the streaming data in the first storage space to the second storage space comprises:

inquiring a data storage record, and acquiring an initial writing time corresponding to the target data;

determining the written duration corresponding to the target data according to the initial writing time and the first check point;

and when the written time length reaches a preset time length, not copying all streaming data stored in the first storage space to the second storage space.

9. The method according to claim 1, wherein the method further comprises:

And directly updating the state of the first storage space into the readable state at the first check point, and storing the streaming data newly written by the streaming processing engine in the second storage space.

10. The method according to claim 9, wherein the method further comprises:

acquiring data amounts corresponding to a plurality of history storage spaces stored by the storage system;

determining a data quantity average value according to the data quantity corresponding to the plurality of history storage spaces;

and determining the data quantity threshold according to the data quantity average value.

11. The streaming processing engine is characterized in that the streaming processing engine processes streaming data and writes the processed streaming data into a storage system according to a check point time interval of the streaming processing engine; wherein the streaming data written in different checkpoint time intervals are stored in different storage spaces of the storage system; the stream processing engine includes:

a detection unit, configured to detect a data amount of streaming data stored in a first storage space in the storage system when a first checkpoint arrives; the first storage space is used for storing streaming data written by the streaming processing engine in a first check point time interval; the first check point is the ending time of the first check point time interval;

A processing unit, configured to copy, when the data amount of the streaming data stored in the first storage space is smaller than a data amount threshold, the streaming data stored in the first storage space into a second storage space of the storage system; the second storage space is used for storing the streaming data written by the streaming processing engine in a second check point time interval; the second checkpoint time interval is a next checkpoint time interval adjacent to the first checkpoint time interval;

the processing unit is further configured to update a state of the first storage space to a readable state.

12. An electronic device, comprising: a processor, a memory, and computer program instructions stored on the memory and executable on the processor;

the processor, when executing the computer program instructions, implements a method of processing streaming data as claimed in any one of the preceding claims 1 to 10.

13. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein computer executable instructions which, when executed by a processor, are adapted to implement a method of processing streaming data according to any of the preceding claims 1 to 10.