WO2023109302A1

WO2023109302A1 - Data processing method and device, and storage medium

Info

Publication number: WO2023109302A1
Application number: PCT/CN2022/125989
Authority: WO
Inventors: 杨伟伟; 占义忠
Original assignee: 中兴通讯股份有限公司
Priority date: 2021-12-15
Filing date: 2022-10-18
Publication date: 2023-06-22
Also published as: CN116263747A

Abstract

The present application discloses a data processing method and device, and a storage medium. The data processing method comprises: receiving a data stream, the data stream comprising a plurality of pieces of data to be processed, and said data comprising dimension information (S110); slicing the data stream at a preset time interval to obtain a plurality of data stream fragments (S120); caching the plurality of data stream fragments in a disk (S130); extracting said data in each data stream fragment from the disk to a memory (S140); and performing first merging processing on said data having the same dimension information in the memory to obtain a target data set (S150).

Description

Data processing method and its device, storage medium

Cross References to Related Applications

This application is based on a Chinese patent application with application number 202111532433.8 and a filing date of December 15, 2021, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference into this application.

technical field

The present application relates to the technical field of big data, in particular to a data processing method, its device, and a storage medium.

Background technique

With the advent of the Internet of Everything era, devices such as sensors, smartphones, wearable devices, and smart home appliances have become part of the Internet of Everything and generate massive amounts of data. In the traditional offline computing of data, the data is generally saved to the storage medium first, and then processed in batches according to a certain scheduling strategy. However, reading a large amount of data requires a large amount of memory resources, and how to improve memory utilization is a major technical problem.

Contents of the invention

The following is an overview of the topics described in detail in this article. This summary is not intended to limit the scope of the claims.

Embodiments of the present application provide a data processing method and device thereof, and a storage medium.

In the first aspect, the embodiment of the present application provides a data processing method, including: receiving a data stream, wherein the data stream includes a plurality of data to be processed, and the data to be processed includes dimension information; Slicing the data stream to obtain multiple data stream slices; caching the multiple data stream slices on a disk; extracting the data to be processed in each of the data stream slices from the disk to the memory; Performing a first merging process on the data to be processed having the same dimension information in the memory to obtain a target data set.

In the second aspect, the embodiment of the present application also provides a data processing device, including: a memory, a processor, and a computer program stored in the memory and operable on the processor, and the computer program is implemented when the processor executes the computer program. The data processing method as described in the first aspect above.

In a third aspect, the embodiment of the present application further provides a computer-readable storage medium storing computer-executable instructions, and the computer-executable instructions are used to execute the data processing method described in the first aspect above.

Additional features and advantages of the application will be set forth in the description which follows, and, in part, will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Description of drawings

The accompanying drawings are used to provide a further understanding of the technical solution of the present application, and constitute a part of the specification, and are used together with the embodiments of the present application to explain the technical solution of the present application, and do not constitute a limitation to the technical solution of the present application.

Fig. 1 is a flowchart of a data processing method provided by an embodiment of the present application;

Fig. 2 is the flowchart of the method of step S130 among Fig. 1;

Fig. 3 is the flowchart of the method of step S150 in Fig. 1;

Fig. 4 is the flowchart of the method of step S152 in Fig. 3;

Fig. 5 is the flowchart of the method of step S153 among Fig. 3;

FIG. 6 is a flowchart of a data processing method provided by another embodiment of the present application;

Fig. 7 is the flowchart of the method of step S140 in Fig. 1;

Fig. 8 is the flowchart of another embodiment of the method of step S130 in Fig. 1;

FIG. 9 is a flow chart of a data processing method provided in another embodiment of the present application;

FIG. 10 is a flowchart of another embodiment of the method of step S140 in FIG. 1;

Fig. 11 is an example diagram of a data structure to be processed in a data processing method provided by another embodiment of the present application;

Fig. 12 is an example diagram of data to be processed in a data processing method provided by another embodiment of the present application;

Fig. 13 is an example diagram of data to be processed in a data processing method provided by another embodiment of the present application;

FIG. 14 is an example diagram of data to be processed in a data processing method provided in another embodiment of the present application;

Fig. 15 is an example diagram of a data flow of a data processing method provided by an embodiment of the present application.

Detailed ways

In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the embodiments described here are only used to explain the present application, not to limit the present application.

It should be noted that although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than in the flowchart. The terms "first", "second" and the like in the specification and claims and the above drawings are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence.

The present application provides a data processing method, its device, and a storage medium, wherein the data processing method includes: receiving a data stream, wherein the data stream includes a plurality of data to be processed, and the data to be processed includes dimension information; The data stream is sliced to obtain multiple data stream fragments; multiple data stream fragments are cached on the disk; the data to be processed in each data stream fragment is extracted from the disk to the memory; the same dimension information is stored in the memory The data to be processed are combined for the first time to obtain the target data set. According to the solution of the embodiment of the present application, the data flow slices are cached on the disk, which saves the storage cost of the memory; the data to be processed in each data flow slice is extracted from the disk to the memory, and the data with the same dimension information is stored in the memory The data to be processed is first merged to obtain the target data set, reducing the use of memory to directly read the data stream. The number of data stream fragments is reduced after merging, thereby reducing the data volume of the target data set and further reducing the direct use of memory. In the case of reading a data set with a large amount of data, the purpose of improving the utilization of memory is achieved.

The embodiments of the present application will be further described below in conjunction with the accompanying drawings.

As shown in FIG. 1 , FIG. 1 is a flow chart of a data processing method provided by an embodiment of the present application. The data processing method may include but not limited to steps S110 , S120 , S130 , S140 , and S150 .

Step S110: Receive a data stream, wherein the data stream includes a plurality of data to be processed, and the data to be processed includes dimension information.

In this step, the data to be processed may be any data in related technologies, may be network data with key-value pairs, or may be relational data in a relational database. The dimension information can be the data information corresponding to the fields in the artificially divided data. In one embodiment, referring to FIG. To process data, the field refers to the field name, and the length refers to the length of the data in the field. If field 1 and field 2 are used as dimensions, the dimension information includes data 11 to data m1, and data 12 to data m2. Or, referring to Figure 13, there are five pieces of data in total, and the data in the user number field and the cell number field can be preset as dimension information, that is to say, the dimension information includes 44600001 and 25681 in

data

1, 44600002 and 25682, 44600001 and 25682 in data 3, 44600002 and 25684 in data 4, 44600003 and 25683 in data 5, or, referring to Figure 14, a total of four pieces of data, the data in the business type number field and the district/county number field can be preset The data in is dimensional information, that is to say, the dimensional information includes 32 and 1 in

data

1, 23 and 2 in

data

2, 21 and 2 in

data

3, and 15 and 3 in data 4.

It should be noted that, in one embodiment, the structure of the received data can be the data structure shown in Figure 11 or Figure 13 or Figure 14, which can be converted into a memory data format after processing, such as computer C language structure , TLV data structure in the communication field, etc., or the data structure shown in FIG. 12 .

It should also be noted that the data stream may be a data stream formed by receiving data from the network in real time, or a data stream formed when data is read from a database. In one embodiment, the data to be processed is obtained from the database. The fields and lengths included in the data structure of the data to be processed are shown in Figure 13. Multiple pieces of data can correspond to different fields, refer to Data 1 to Data 5 in Figure 13 Or data 1 to data 4 in Figure 14, data records can be accessed through fields. The purpose of receiving the data stream is to facilitate slicing of the data stream in subsequent steps.

Step S120: Slicing the data stream at preset time intervals to obtain multiple data stream fragments.

In this step, the preset time refers to the artificially preset time. In one embodiment, the user sets the preset time according to the size of the disk, thereby controlling the size of the multiple data stream fragments obtained and reducing the problem of insufficient disk space. In addition, the time for memory to read data flow fragments will also be reduced, achieving the purpose of improving memory utilization.

In one embodiment, the data stream is sliced at preset time intervals, and the start time can be set when the data stream is received, and the current timestamp can be obtained in real time, so that the interval time can be obtained. When the interval time is equal to the preset time, Slice the data stream.

It can also be understood that the data volume of the data stream fragments obtained by slicing the data stream is less than the data volume of the data stream, and the obtained data stream fragmentation can fragment the data volume of the data stream and reduce the direct reading of data. The purpose of the flow situation.

It can also be understood that the data flow is sliced at preset intervals, and the waiting time can be set for the data flow. When the data flow receives a piece of data to be processed, the time interval between receiving the next piece of data to be processed is greater than the waiting time. Slice the data stream. In addition, a fixed slicing time can also be set, and the data stream is sliced at every slicing time interval, so as to achieve the purpose of dividing the data stream into multiple data stream fragments.

Step S130: sharding and caching multiple data streams on disk.

In this step, the disk refers to any disk capable of storing data in the related art, which is not specifically limited again. Caching multiple data stream fragments on the disk can be either separately caching the data stream fragments on the disk, or continuously caching multiple data stream fragments on the disk. The purpose of caching multiple data stream fragments on disk is to reduce the memory usage of data stream fragments and achieve the purpose of improving memory utilization.

It can be understood that caching multiple data stream fragments on the disk may be saving the data stream fragments as files to the disk, or compressing the data stream fragments into compressed files and caching them on the disk.

Step S140: Extract the data to be processed in each data flow fragment from the disk to the memory.

In this step, since the data stream fragments are obtained by slicing the data stream, the data exchange fragments include data to be processed. The data to be processed in each data flow fragment is extracted from the disk to the memory, which can be sequentially extracted from each data flow fragment, reducing the situation that the data flow fragments are directly loaded into the memory without merging, so as to improve the utilization of memory the goal of.

It can be understood that the data flow fragments are extracted from the disk to the memory, and the data flow fragments can be directly read from the disk into the memory through the data reading method in the related art, which is not specifically limited here.

Step S150: Perform the first merging process on the data to be processed with the same dimension information in the memory to obtain the target data set.

In this step, the data to be processed includes dimension information. In one embodiment, the dimension information may include multiple dimension values, and having the same dimension information means that the multiple dimension values are all the same. The first merging process refers to merging the data to be processed with the same dimension information into one piece of data. The merging process can be to merge other information in the data to be processed that does not belong to the dimension information, and the merging method can be to add other information separately , subtraction or division, etc., as long as it can combine data to be processed with the same dimension information into one piece of data to be processed, again no specific limitation is made.

It can be understood that the target data set refers to the data set obtained after the first merging process, which is also a data set cached in memory. Since the received data stream is subjected to the first merging process to obtain the target data set, it has the same dimension The data to be processed of the information is merged, the data volume of the originally received data stream is reduced, and the target data set read by the memory occupies less space, so as to achieve the purpose of improving the utilization rate of the memory.

In this embodiment, the data stream is received by adopting the data processing method including the above steps S110 to S150, wherein the data stream includes a plurality of data to be processed, and the data to be processed includes dimension information; the data stream is processed at intervals of preset time Slice to obtain multiple data stream fragments; cache multiple data stream fragments on disk; extract the data to be processed in each data stream fragment from disk to memory; The data is first merged to obtain the target data set. According to the solution of the embodiment of the present application, the data flow slices are cached on the disk, which saves the storage cost of the memory; the data to be processed in each data flow slice is extracted from the disk to the memory, and the data with the same dimension information is stored in the memory The data to be processed is first merged to obtain the target data set, reducing the use of memory to directly read the data stream. The number of data stream fragments is reduced after merging, thereby reducing the data volume of the target data set and further reducing the direct use of memory. In the case of reading a data set with a large amount of data, the purpose of improving the utilization of memory is achieved.

In an embodiment, as shown in FIG. 2 , step S130 is further described, and step S130 may also include, but not limited to, step S210 and step S220 .

Step S210: Sort the data to be processed in each data flow fragment according to the dimension information.

In this step, sorting the data to be processed in each data flow slice according to the dimension information refers to sorting the data to be processed in each data slice according to the size of the dimension information.

It should be noted that the sorting can be sorted according to the dimension information from small to large, or according to the dimension information from large to small. When the dimension information includes multiple dimension data, it will be sorted according to the size of each dimension information in turn. In the implementation manner, referring to FIG. 12 , the data in dimension a is sorted first, and then the data in dimension b is sorted to obtain multiple ordered data flow fragments, so as to facilitate the subsequent first merge processing of the data to be processed , you only need to sequentially read the data in each data stream shard before merging, so as to improve the merging efficiency.

Step S220: cache multiple sorted data stream fragments on disk.

In this step, the sorted data stream fragments have an order, and multiple sorted data stream fragments are cached on the disk, which can facilitate the first merge processing of the data to be processed in the subsequent steps, and only need to read each data stream sequentially The data to be processed in the shards can be merged, which saves the memory consumption of reading and comparing separately, improves the efficiency of merging, and thus achieves the purpose of improving memory utilization.

In this embodiment, by adopting the data processing method including the above steps S210 to S220, the data in each data stream slice is sorted according to the dimension information, and multiple sorted data stream slices are cached on the disk. According to the solution of the embodiment of the present application, the data stream fragments are first sorted and then cached on the disk. When the data stream fragments are read and entered into the memory, the sorted data stream fragments can be convenient for each data stream fragmentation in the subsequent steps Merge processing, improve merge efficiency, so as to achieve the purpose of improving memory utilization.

In an embodiment, as shown in FIG. 3 , step S150 is further described, and step S150 may also include, but not limited to, step S151 , step S152 , step S153 , and step S154 .

Step S151: Traverse the data to be processed in each data stream fragment.

In this step, since the data stream is sliced to obtain data stream fragments, the data stream fragments may also include multiple pieces of data to be processed. Traversing the data to be processed in each data stream fragment refers to sequentially extracting the data to be processed in each data stream fragment into the memory, which can facilitate the processing of the data to be processed in each data stream fragment in subsequent steps. merge.

Step S152: Obtain the data to be processed according to the dimension information.

In this step, the data to be processed refers to the data to be processed respectively obtained from each data stream fragment. Since the data to be processed in the data stream fragments are sorted, the data to be processed in each data stream fragment is ordered Yes, obtaining the data to be processed according to the dimension information refers to obtaining the data to be processed according to the size of the dimension information, so that the data to be processed can be directly merged and processed, reducing the need to obtain the dimension information of the data to be processed and the fragmentation of each data stream In the case of comparing the dimension information of all the data to be processed, it is convenient to obtain the target data in the subsequent steps.

It can be understood that the data to be processed is obtained according to the dimension information. In one embodiment, the N pieces of data to be processed in the compressed file formed by the corresponding data stream fragmentation are respectively read to generate a result set {RS1, RS2,..., RSk}, where k refers to the number of data stream fragments. For different files, the value of N may be different, that is to say, the amount of data to be processed in each data stream fragment may be different.

Step S153: Obtain the target data according to the quantity information of the data to be processed.

In this step, the target data is obtained according to the quantity information of the data to be processed, which means that there are multiple data to be processed with the same dimension information, and the data to be processed is merged into the target data. If the dimension information of the processed data is the same, the data to be processed is directly determined as the target data.

It should be noted that since the data to be processed in the data stream fragmentation is sorted, the data to be processed can be obtained according to the dimension information, and the target data can be obtained according to the quantity information of the data to be processed in the traversal process, which reduces the number of pending data to be obtained The dimension information of each data flow fragment is compared with the dimension information of all data to be processed in each data flow fragment to improve the efficiency of merging, so as to achieve the purpose of improving the utilization rate of memory.

Step S154: Obtain the target data set according to the target data.

In this step, the target data refers to the data obtained after combining the data to be processed, and the target data set includes multiple target data.

It should be noted that since the data to be processed is merged to obtain the target data, the occurrence of the same dimension information in the target data is reduced, thereby reducing the amount of data in the target data set read by the memory, thereby improving the utilization rate of the memory the goal of.

In this embodiment, by adopting the data processing method including the above steps S151 to S154, the data to be processed in each data flow fragment is traversed, the data to be processed is obtained according to the dimension information, and the target data is obtained according to the quantity information of the data to be processed , get the target data set according to the target data. According to the scheme of the embodiment of the present application, the data to be processed is obtained according to the dimension information, and the target data is obtained according to the quantity information of the data to be processed, so as to obtain the target data set, which simplifies the operation of the first merge processing of the data to be processed, and reduces the memory Firstly, the consumption of processing operations is merged, so as to achieve the purpose of improving the utilization rate of memory.

It is worth noting that, in one embodiment, the N pieces of data to be processed in the compressed file formed by each data flow fragment are respectively read, and the result set {RS1, RS2,...,RSk} is generated. For different files, N The value of can be inconsistent, that is to say, the amount of data to be processed in the data stream fragments can be different. Record the merged result set as RSmer; sequentially read i pieces of data to be processed from the result set {RS1, RS2,...,RSk} (denoted as R1, R2,...,Ri), if {RS1 ,RS2,...,RSk}, the number of data to be processed in some result sets is less than 1, then i<k, otherwise i=k. Compare the dimension information of the read i pieces of data R1, R2,...,Ri to be processed, and record it as Procmin(R1,R2,...,Ri), to obtain the data to be processed with the smallest dimension data value, record As RSmin; considering that the number of data to be processed in the Procmin process may be large, an algorithm can be used to sort and generate the required results. This result can be organized through a data structure with one key corresponding to multiple values, that is, a key (key ) corresponds to the form of multiple values. After sorting, only the first data to be processed in the result set needs to be taken out, and recorded as RSmin; if there are multiple data to be processed corresponding to RSmin, these pending The data were merged to give RSmin'=Procmerge(RSmin). The data to be processed may include indicator information, and the indicator information will be combined and calculated during the merger. The combined calculation method of the indicator information includes but is not limited to simple or complex mathematical algorithms such as summation, averaging, counting, and maximum/minimum values; RSmin has only one piece of data to be processed, then RSmin'=RSmin; then write RSmin' into RSmer; continue to read one piece of data to be processed from the result set where RSmin is located for filling, repeat the above merge operation until the Until the number of data to be processed i=0, it means that all the data in the result set {RS1, RS2, . . . , RSk} have been processed. When there are multiple first merging processes, the data flow fragments are merged according to the above steps, and the data to be processed in all the data flow fragments of the queue Qmer to be merged are merged to obtain the target data set.

In an embodiment, as shown in FIG. 4 , step S152 is further described, and step S152 may also include, but not limited to, step S1521 and step S1522 .

Step S1521: When sorting is based on dimension information from small to large, obtain the data to be processed with the smallest dimension information.

In this step, sorting from small to large according to the dimension information refers to comparing the size of the dimension information through any sorting method in related technologies, and sorting according to the dimension information from small to large. When the data type of the dimension information is not a value, it can be sorted according to the letter of the dimension information from small to large, as long as the data to be processed can form an orderly sequence of dimension information from small to large, and no specific details are given here. limited.

It should be noted that since the data to be processed extracted from the disk to the memory comes from different data stream fragments, and the data to be processed in the data stream fragments are sorted from small to large according to the dimension information, the dimension information is obtained The minimum data to be processed is to facilitate the merging of the data to be processed in the subsequent steps, reduce the memory consumption of the first merging process, and thus achieve the purpose of improving the utilization rate of the memory.

It should also be noted that, in one embodiment, there are i data stream fragments in total, and the read dimension information of the i pieces of data to be processed is compared to obtain the data to be processed with the smallest dimension data value. Data can be organized through a data structure with one keyword corresponding to multiple values, that is, one key (key) corresponds to multiple values (values). After sorting, only the first element of the result needs to be taken out, corresponding to dimension information Minimal pending data.

Step S1522: When sorting is based on dimension information from large to small, obtain the data to be processed with the largest dimension information.

In this step, sorting according to the dimension information from large to small refers to comparing the size of the dimension information through any sorting method in related technologies, and sorting according to the dimension information from large to small. When the data type of the dimension information is not a numerical value, it can be sorted according to the letter of the dimension information from large to small, as long as the data to be processed can form an orderly sequence of dimension information from large to small, and will not be detailed here limited.

It should be noted that since the data to be processed extracted from the disk to the memory comes from different data stream fragments, and the data to be processed in the data stream fragments are sorted from large to small according to the dimension information, the dimension information is obtained The largest data to be processed is to facilitate the merging of the data to be processed in the subsequent steps, reduce the memory consumption of the first merging process, and thus achieve the purpose of improving the utilization rate of the memory.

In this embodiment, by adopting the data processing method including the above steps S1521 to S1522, when the sorting is based on the dimension information from small to large, the data to be processed with the smallest dimension information is obtained, or, when the sorting is based on the dimension information by Sort from large to small to obtain the data to be processed with the largest dimension information. According to the solutions of the embodiments of the present application, the data to be processed with different dimension information is selected according to different sorting methods and dimension information, thereby reducing the consumption of memory by the first merging process, thereby achieving the purpose of improving memory utilization.

In an embodiment, as shown in FIG. 5 , step S153 is further described, and step S153 may also include but not limited to step S1531 and step S1532 .

Step S1531: When the number of data to be processed is greater than one, merge the data to be processed to obtain the target data.

In this step, since the data to be processed in the data stream fragmentation is sorted, when the number of data to be processed is greater than one, the data to be processed can be combined to obtain the target data. In one embodiment, the data to be processed can be extracted first, When the number of data to be processed with the same dimension information is more than one, it will be merged to form the target data, and the data stream fragmentation for merging the extracted data to be processed will be traversed to extract the next data to be processed, and the data to be processed will be combined with the data according to the dimension information. Merge target data and data to be processed in other data stream fragments until there is only one data to be processed with the same dimensional information, where merging includes but is not limited to calculating other information in the data to be processed that does not belong to dimensional information Simple or complex mathematical algorithms such as sum, average, count, maximum/minimum value, etc., are not specifically limited here. Therefore, the efficiency of the first merging process is improved, so as to achieve the purpose of improving the utilization rate of the memory.

Step S1532: When the number of data to be processed is equal to one, determine the data to be processed as target data.

In this step, when the number of extracted data to be processed is equal to one, and there is only one data to be processed with the same dimension information in the data flow fragment extracted to the memory, the data to be processed is directly used as the target data, thereby improving The efficiency of the first merge processing is improved, so as to achieve the purpose of improving the utilization rate of the memory.

In this embodiment, by adopting the data processing method including the above steps S1531 to S1532, when the number of data to be processed is greater than one, the data to be processed is combined to obtain the target data; or, when the number of data to be processed is equal to one, the data to be processed is combined The processed data is determined as target data. According to the solution of the embodiment of the present application, it is judged according to the quantity information of the data to be processed whether there is data to be processed with the same dimension information in the data stream fragmentation, so as to determine the target data and reduce the number of data to be processed directly extracted from the disk to the memory. The situation that the data is compared with the data to be processed in each data flow fragment improves the efficiency of the first merging process, thereby achieving the purpose of improving the utilization rate of the memory.

In an embodiment, as shown in FIG. 6 , the data processing method is further described, and the data processing method may also include but not limited to step S610 and step S620 .

Step S610: Acquiring the address information of each data flow fragment cached on the disk.

In this step, the address information of the data stream fragments in the disk may be obtained after the data stream fragments are cached in the disk, or the free address information in the disk may be obtained before the data stream fragments are cached in the disk , and then cache the data stream fragments in the corresponding disk address of the address information. The purpose of obtaining the address information of each data flow fragment cached on the disk is to facilitate the storage of the address information in the memory in the subsequent steps.

Step S620: Save the address information in memory.

In this step, since the data stream fragments are cached on the disk, the address information is stored in the memory. When the data stream fragments need to be read, the data stream fragments are obtained in the memory according to the address information, which reduces the use of memory to store data stream fragments. Generated consumption, so as to achieve the purpose of improving memory utilization.

In this embodiment, by adopting the data processing method including the above steps S610 to S620, the address information of each data flow fragment cached on the disk is obtained; and the address information is stored in the memory. According to the solution of the embodiment of the present application, since the data stream fragments are cached on the disk, the address information is stored in the memory. When the data stream fragments need to be read, the memory directly obtains the data stream fragments according to the address information, reducing the use of memory to store data. The consumption generated by stream sharding, so as to achieve the purpose of improving the utilization rate of memory.

In an embodiment, as shown in FIG. 7 , step S140 is further described, and step S140 may also include, but not limited to, step S710 and step S720 .

Step S710: Read the address information of each data stream fragment from the memory.

In this step, the address information refers to the address information of each data flow fragment in the disk. In one embodiment, each data flow fragment is saved in the disk in the form of a file, and the address information refers to the address information of each data flow fragment. The path information of the file formed by the fragmentation, or the data stream fragmentation may be stored in a data table of the database, and the address information refers to the path information of each data table.

It should be noted that the data stream fragments are cached in the disk. When the memory needs to read the data stream fragments, the memory needs to first obtain the address information of the data stream fragments cached in the disk, and read each data stream fragment according to the address information. Slices, the memory only caches address information, thereby saving the consumption of using memory to store data flow fragments, and achieving the purpose of improving memory utilization.

Step S720: extract the data to be processed in each data flow fragment from the disk to the memory according to the address information.

In this step, the extracted data to be processed comes from different data flow fragments. Extracting the data to be processed from the disk to the memory can write the data to be processed into the cache queue of the memory in order, which is convenient for subsequent processing of the data to be processed. Combine processing.

It should be noted that the address information of each data flow fragment is cached in the memory, and the data to be processed in each data flow fragment is extracted from the disk to the memory according to the address information, reducing the need for unmerged data flow fragments to directly enter The condition of the internal memory facilitates subsequent first merge processing of the extracted data to be processed, reducing the consumption of data storage by using the internal memory, thereby achieving the purpose of improving the utilization rate of the internal memory.

In this embodiment, by adopting the data processing method including the above steps S710 to S720, the address information of each data stream fragment is read from the memory; the data to be processed in each data stream fragment is read from the disk according to the address information fetched into memory. According to the solution of the embodiment of the present application, the address information of each data stream fragment is cached in the memory, and the data to be processed in each data stream fragment is extracted from the disk to the memory through the address information, so as to facilitate the merging of the data to be processed in subsequent steps, Reduce the consumption of memory for data storage, so as to achieve the purpose of improving memory utilization.

In an embodiment, as shown in FIG. 8 , step S130 is further described, and step S130 may also include but not limited to step S810 and step S820 .

Step S810: Perform a second merging process on the data to be processed with the same dimension information in each data stream slice, and obtain multiple merged data stream slices.

In this step, after the data streams are sliced, the data to be processed with the same dimension information in each data stream slice may be subjected to a second merging process. The second merging process refers to merging the data to be processed in each data flow fragment, so that the dimension information of each data to be processed in the same data flow fragment is not the same, thereby reducing the first step in the subsequent steps. Consolidate the consumption of memory resources to achieve the purpose of improving memory utilization.

Step S820: slicing and caching the multiple combined data streams on the disk.

In this step, the data stream fragments cached in the disk undergo the second merge process, which reduces the consumption of memory resources by the first merge process in the subsequent steps, and the data volume of the merged data stream fragments is reduced, thereby reducing The purpose of disk resource consumption for data stream shard cache.

In this embodiment, by adopting the data processing method including the above steps S810 to S820, the data to be processed with the same dimension information in each data stream slice is subjected to the second merge process to obtain multiple merged data streams Fragmentation; multiple merged data stream fragments are cached on disk. According to the solution of the embodiment of the present application, after the data flow fragmentation is subjected to the second merging process and then cached on the disk, it can facilitate the consumption of memory resources by the first merging process in the subsequent steps, and the data of the merged data stream fragmentation The amount is reduced, so as to achieve the purpose of reducing the resource consumption of the disk for data stream fragmentation cache.

In an embodiment, as shown in FIG. 9 , the data processing method is further described, and the data processing method may also include but not limited to step S910.

Step S910: Filter the data to be processed according to a preset filter condition, wherein the preset filter condition includes: the index information of the data to be processed is smaller than a preset index value threshold.

In this step, the index information refers to the information in the non-dimensional information field of the data to be processed. In one embodiment, referring to FIG. The data information corresponding to the number of TCP downstream packets, TCP upstream traffic, TCP downstream traffic, and TCP service duration are all index information, or referring to Figure 14, the preset business type number and district/county number are dimension information, then the response average delay, display The data information corresponding to the average delay, response success rate, display success rate, and total traffic are all indicator information.

It should be noted that the received data stream may be directly receiving data from the network, or reading data in blocks from the database. Filtering the data to be processed refers to removing the data to be processed that does not meet the preset filtering conditions, so as to reduce the impact of abnormal data to be processed on data processing.

It should also be noted that the preset filtering conditions include: the index information of the data to be processed is less than the preset index value threshold. In one embodiment, referring to FIG. 13 , if the preset TCP service duration threshold is set to 3, the Except for the data to be processed whose TCP service duration is equal to 2, that is, data 4 in Figure 13, or referring to Figure 14, the preset response success rate threshold is 90, and the data to be processed with a response success rate of 89 is removed, that is, in Figure 14 data3.

It should also be noted that the filtering of the data to be processed according to the preset filtering conditions can be performed after receiving the data stream, or filtering according to the data to be processed before the data stream is fragmented and cached on the disk. In this way, the amount of data stream fragments cached in the disk is reduced, and the purpose of improving the utilization rate of the disk space is achieved.

In this embodiment, by adopting the data processing method including the above step S910, the data to be processed is filtered according to the preset filter condition, wherein the preset filter condition includes: the index information of the data to be processed is less than the preset index value threshold. According to the solution of the embodiment of the present application, the data to be processed that does not meet the preset filtering conditions are removed, the impact of abnormal data to be processed on data processing is reduced, and the purpose of improving the accuracy of data processing is achieved. The amount of data is also reduced, so as to achieve the purpose of improving the utilization rate of the memory.

In an embodiment, as shown in FIG. 10 , step S140 is further described, and step S140 may also include but not limited to step S1010 and step S1020 .

Step S1010: When the amount of data to be processed cached in the disk is greater than a preset number threshold, extract the data to be processed in each data flow fragment from the disk to the memory.

In this step, the preset number threshold is set manually, and the user can set the preset number threshold according to the size of the disk space and the size of the memory space, so as to reduce the situation that the data cache occupies too much disk space and achieve a reasonable increase in disk consumption. The purpose of rationality and memory consumption, thereby improving the utilization rate of disk and memory.

Step S1020: Obtain the total time of the slice, and when the total time of the slice is greater than the preset time threshold, extract the data to be processed in each data flow slice from the disk to the memory.

In this step, the preset time threshold is set manually. In one embodiment, the system receives network data from the network in real time. The user needs to analyze the data within the preset time period to obtain the total time of the slice. When the slice If the total time is greater than the preset time threshold, the data to be processed that needs to be analyzed has all been sliced, and the slicing is completed, and the data to be processed in each data stream slice is extracted from the disk to the memory, so as to improve the data received from the data stream. The purpose of the justification for the processing of the data to be processed.

It should be noted that, in one embodiment, referring to FIG. 15 , the preset time threshold is the time statistics granularity T ₀ , and the preset time interval refers to the data aggregation slice granularity T _c . When the data flow is received, the data statistics granularity T ₀ ends, that is to say, the total time of slicing is greater than the preset time threshold, and the data to be processed in each data stream slice is extracted from the disk to the memory, waiting for subsequent merging of the data to be processed, and the data statistics granularity is within T ₀ The data to be processed is all the data that the user needs to enter the memory for merging.

In this embodiment, by adopting the data processing method including the above steps S1010 to S1020, when the number of data to be processed cached in the disk is greater than the preset number threshold, the data to be processed in each data flow fragment is divided from The disk is extracted to the memory; or, the total time of the slice is obtained, and when the total time of the slice is greater than the preset time threshold, the data to be processed in each data flow slice is extracted from the disk to the memory. According to the solution of the embodiment of the present application, when the amount of data to be processed cached in the disk is greater than the preset quantity threshold, or when the total time of slicing is greater than the preset time threshold, the data to be processed in each data stream is segmented Extracting from the disk to the memory reduces the situation that the data cache occupies too much disk space, facilitates the merging of the data to be processed in the subsequent steps, and achieves the purpose of improving the utilization of the memory.

In addition, an embodiment of the present application also provides a data processing device, which includes: a memory, a processor, and a computer program stored in the memory and operable on the processor.

The processor and memory can be connected by a bus or other means.

As a non-transitory computer-readable storage medium, memory can be used to store non-transitory software programs and non-transitory computer-executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory may include memory located remotely from the processor, which remote memory may be connected to the processor via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software programs and instructions required to realize the data processing method of the above-mentioned embodiment are stored in the memory, and when executed by the processor, the data processing method in the above-mentioned embodiment is executed, for example, the above-described execution in FIG. 1 Method steps S110 to S150, method steps S210 to S220 in FIG. 2, method steps S151 to S154 in FIG. 3, method steps S1521 to S1522 in FIG. 4, method steps S1531 to S1532 in FIG. Method steps S610 to S620, method steps S710 to S720 in FIG. 7 , method steps S810 to S820 in FIG. 8 , method steps S910 in FIG. 9 , method steps S1010 to S1020 in FIG. 10 .

In addition, an embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by a processor or a controller, for example, by the above-mentioned Execution by a processor in the device embodiment can cause the processor to execute the data processing method in the above embodiment, for example, execute the method steps S110 to S150 in FIG. 1 and the method steps S210 to S220 in FIG. 2 described above. , method steps S151 to S154 in Fig. 3, method steps S1521 to S1522 in Fig. 4, method steps S1531 to S1532 in Fig. 5, method steps S610 to S620 in Fig. 6, method steps S710 to S720 in Fig. 7 , method steps S810 to S820 in FIG. 8 , method steps S910 in FIG. 9 , and method steps S1010 to S1020 in FIG. 10 .

The embodiment of the present application includes: receiving a data stream, wherein the data stream includes a plurality of data to be processed, and the data to be processed includes dimension information; slicing the data stream at preset time intervals to obtain multiple data stream fragments; The flow slices are cached on the disk; the data to be processed in each data flow slice is extracted from the disk to the memory; the data to be processed with the same dimension information is first merged in the memory to obtain the target data set. According to the solution of the embodiment of the present application, the data flow slices are cached on the disk, which saves the storage cost of the memory; the data to be processed in each data flow slice is extracted from the disk to the memory, and the data with the same dimension information is stored in the memory The data to be processed is first merged to obtain the target data set, reducing the use of memory to directly read the data stream. The number of data stream fragments is reduced after merging, thereby reducing the data volume of the target data set and further reducing the direct use of memory. Read the data to be processed in the data stream to achieve the purpose of improving the utilization of memory.

Those skilled in the art can understand that all or some of the steps and systems in the methods disclosed above can be implemented as software, firmware, hardware and an appropriate combination thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit . Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. permanent, removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer. In addition, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

The above is a specific description of several embodiments of the application, but the application is not limited to the above-mentioned embodiments, and those skilled in the art can also make various equivalent deformations or replacements without violating the spirit of the application. Equivalent modifications or replacements are all within the scope defined by the claims of the present application.

Claims

A data processing method, comprising:

receiving a data stream, wherein the data stream includes a plurality of data to be processed, and the data to be processed includes dimension information;

Slicing the data stream at preset time intervals to obtain multiple data stream fragments;

Caching the plurality of data stream fragments on a disk;

Extracting the data to be processed in each of the data stream fragments from the disk to memory;

Performing a first merging process on the data to be processed having the same dimension information in the memory to obtain a target data set.
The data processing method according to claim 1, wherein said buffering said plurality of data stream slices on a disk comprises:

sorting the data to be processed in each of the data flow fragments according to the dimension information;

Caching the multiple sorted data stream fragments on a disk.
The data processing method according to claim 2, wherein the data to be processed in the data stream fragmentation is sorted, and the data to be processed with the same dimension information is sorted in the memory 1. Merge processing to obtain the target data set, including:

traversing the data to be processed in each of the data stream fragments;

Acquiring the data to be processed according to the dimension information;

Obtain target data according to the quantity information of the data to be processed;

A target data set is obtained according to the target data.
The data processing method according to claim 3, wherein said obtaining said data to be processed according to said sorting and said dimension information comprises:

When the sorting is sorting from small to large according to the dimensional information, acquiring the data to be processed with the smallest dimensional information;

or,

When the sorting is sorting from large to small according to the dimensional information, the data to be processed with the largest dimensional information is acquired.
The data processing method according to claim 3, wherein said obtaining the target data according to the quantity information of the data to be processed comprises:

When the number of the data to be processed is greater than one, combining the data to be processed to obtain target data;

or,

When the number of the data to be processed is equal to one, the data to be processed is determined as the target data.
The data processing method according to claim 1, wherein the data processing method further comprises:

Obtain the address information of each data flow fragment cached on the disk;

saving the address information in the memory.
The data processing method according to claim 1, wherein said extracting the data to be processed in each of the data stream fragments from the disk to the memory comprises:

reading the address information of each of the data stream fragments from the memory;

Extracting the data to be processed in each of the data stream fragments from the disk to memory according to the address information.
The data processing method according to claim 1, wherein said buffering said plurality of data stream slices on a disk comprises:

performing a second merge process on the data to be processed having the same dimension information in each of the data stream fragments to obtain multiple merged data stream fragments;

and caching the multiple merged data stream fragments on the disk.
The data processing method according to claim 1, wherein the data to be processed further includes index information, and the data processing method further includes:

The data to be processed is filtered according to a preset filter condition, wherein the preset filter condition includes: the index information of the data to be processed is smaller than a preset index value threshold.
The data processing method according to claim 1, wherein said extracting the data to be processed in each of the data stream fragments from the disk to the memory comprises:

When the quantity of the data to be processed cached in the disk is greater than a preset quantity threshold, extracting the data to be processed in each of the data stream fragments from the disk to the memory;

or,

The total time of the slice is obtained, and when the total time of the slice is greater than a preset time threshold, the data to be processed in each data flow slice is extracted from the disk to the memory.
A data processing device, comprising: a memory, a processor, and a computer program stored on the memory and operable on the processor, when the processor executes the computer program, it realizes any one of claims 1 to 10 data processing method.
A computer-readable storage medium storing computer-executable instructions for executing the data processing method according to any one of claims 1-10.