WO2023109302A1 - Data processing method and device, and storage medium - Google Patents

Data processing method and device, and storage medium Download PDF

Info

Publication number
WO2023109302A1
WO2023109302A1 PCT/CN2022/125989 CN2022125989W WO2023109302A1 WO 2023109302 A1 WO2023109302 A1 WO 2023109302A1 CN 2022125989 W CN2022125989 W CN 2022125989W WO 2023109302 A1 WO2023109302 A1 WO 2023109302A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
processed
memory
disk
data stream
Prior art date
Application number
PCT/CN2022/125989
Other languages
French (fr)
Chinese (zh)
Inventor
杨伟伟
占义忠
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2023109302A1 publication Critical patent/WO2023109302A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0866Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches for peripheral storage systems, e.g. disk cache
    • G06F12/0868Data transfer between cache memory and other subsystems, e.g. storage devices or host systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0866Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches for peripheral storage systems, e.g. disk cache
    • G06F12/0871Allocation or management of cache space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1041Resource optimization
    • G06F2212/1044Space efficiency improvement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/22Employing cache memory using specific memory technology
    • G06F2212/224Disk storage
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of big data, in particular to a data processing method, its device, and a storage medium.
  • Embodiments of the present application provide a data processing method and device thereof, and a storage medium.
  • the embodiment of the present application provides a data processing method, including: receiving a data stream, wherein the data stream includes a plurality of data to be processed, and the data to be processed includes dimension information; Slicing the data stream to obtain multiple data stream slices; caching the multiple data stream slices on a disk; extracting the data to be processed in each of the data stream slices from the disk to the memory; Performing a first merging process on the data to be processed having the same dimension information in the memory to obtain a target data set.
  • the embodiment of the present application also provides a data processing device, including: a memory, a processor, and a computer program stored in the memory and operable on the processor, and the computer program is implemented when the processor executes the computer program.
  • a data processing device including: a memory, a processor, and a computer program stored in the memory and operable on the processor, and the computer program is implemented when the processor executes the computer program.
  • the embodiment of the present application further provides a computer-readable storage medium storing computer-executable instructions, and the computer-executable instructions are used to execute the data processing method described in the first aspect above.
  • Fig. 1 is a flowchart of a data processing method provided by an embodiment of the present application
  • Fig. 2 is the flowchart of the method of step S130 among Fig. 1;
  • Fig. 3 is the flowchart of the method of step S150 in Fig. 1;
  • Fig. 4 is the flowchart of the method of step S152 in Fig. 3;
  • Fig. 5 is the flowchart of the method of step S153 among Fig. 3;
  • FIG. 6 is a flowchart of a data processing method provided by another embodiment of the present application.
  • Fig. 7 is the flowchart of the method of step S140 in Fig. 1;
  • Fig. 8 is the flowchart of another embodiment of the method of step S130 in Fig. 1;
  • FIG. 9 is a flow chart of a data processing method provided in another embodiment of the present application.
  • FIG. 10 is a flowchart of another embodiment of the method of step S140 in FIG. 1;
  • Fig. 11 is an example diagram of a data structure to be processed in a data processing method provided by another embodiment of the present application.
  • Fig. 12 is an example diagram of data to be processed in a data processing method provided by another embodiment of the present application.
  • Fig. 13 is an example diagram of data to be processed in a data processing method provided by another embodiment of the present application.
  • FIG. 14 is an example diagram of data to be processed in a data processing method provided in another embodiment of the present application.
  • Fig. 15 is an example diagram of a data flow of a data processing method provided by an embodiment of the present application.
  • the present application provides a data processing method, its device, and a storage medium, wherein the data processing method includes: receiving a data stream, wherein the data stream includes a plurality of data to be processed, and the data to be processed includes dimension information; The data stream is sliced to obtain multiple data stream fragments; multiple data stream fragments are cached on the disk; the data to be processed in each data stream fragment is extracted from the disk to the memory; the same dimension information is stored in the memory The data to be processed are combined for the first time to obtain the target data set.
  • the data flow slices are cached on the disk, which saves the storage cost of the memory; the data to be processed in each data flow slice is extracted from the disk to the memory, and the data with the same dimension information is stored in the memory
  • the data to be processed is first merged to obtain the target data set, reducing the use of memory to directly read the data stream.
  • the number of data stream fragments is reduced after merging, thereby reducing the data volume of the target data set and further reducing the direct use of memory.
  • the purpose of improving the utilization of memory is achieved.
  • FIG. 1 is a flow chart of a data processing method provided by an embodiment of the present application.
  • the data processing method may include but not limited to steps S110 , S120 , S130 , S140 , and S150 .
  • Step S110 Receive a data stream, wherein the data stream includes a plurality of data to be processed, and the data to be processed includes dimension information.
  • the data to be processed may be any data in related technologies, may be network data with key-value pairs, or may be relational data in a relational database.
  • the dimension information can be the data information corresponding to the fields in the artificially divided data.
  • the field refers to the field name
  • the length refers to the length of the data in the field. If field 1 and field 2 are used as dimensions, the dimension information includes data 11 to data m1, and data 12 to data m2.
  • the data in the user number field and the cell number field can be preset as dimension information, that is to say, the dimension information includes 44600001 and 25681 in data 1, 44600002 and 25682, 44600001 and 25682 in data 3, 44600002 and 25684 in data 4, 44600003 and 25683 in data 5, or, referring to Figure 14, a total of four pieces of data, the data in the business type number field and the district/county number field can be preset
  • the data in is dimensional information, that is to say, the dimensional information includes 32 and 1 in data 1, 23 and 2 in data 2, 21 and 2 in data 3, and 15 and 3 in data 4.
  • the structure of the received data can be the data structure shown in Figure 11 or Figure 13 or Figure 14, which can be converted into a memory data format after processing, such as computer C language structure , TLV data structure in the communication field, etc., or the data structure shown in FIG. 12 .
  • the data stream may be a data stream formed by receiving data from the network in real time, or a data stream formed when data is read from a database.
  • the data to be processed is obtained from the database.
  • the fields and lengths included in the data structure of the data to be processed are shown in Figure 13. Multiple pieces of data can correspond to different fields, refer to Data 1 to Data 5 in Figure 13 Or data 1 to data 4 in Figure 14, data records can be accessed through fields.
  • the purpose of receiving the data stream is to facilitate slicing of the data stream in subsequent steps.
  • Step S120 Slicing the data stream at preset time intervals to obtain multiple data stream fragments.
  • the preset time refers to the artificially preset time.
  • the user sets the preset time according to the size of the disk, thereby controlling the size of the multiple data stream fragments obtained and reducing the problem of insufficient disk space.
  • the time for memory to read data flow fragments will also be reduced, achieving the purpose of improving memory utilization.
  • the data stream is sliced at preset time intervals, and the start time can be set when the data stream is received, and the current timestamp can be obtained in real time, so that the interval time can be obtained.
  • the interval time is equal to the preset time, Slice the data stream.
  • the data volume of the data stream fragments obtained by slicing the data stream is less than the data volume of the data stream, and the obtained data stream fragmentation can fragment the data volume of the data stream and reduce the direct reading of data.
  • the purpose of the flow situation is less than the data volume of the data stream, and the obtained data stream fragmentation can fragment the data volume of the data stream and reduce the direct reading of data.
  • the data flow is sliced at preset intervals, and the waiting time can be set for the data flow.
  • the time interval between receiving the next piece of data to be processed is greater than the waiting time.
  • a fixed slicing time can also be set, and the data stream is sliced at every slicing time interval, so as to achieve the purpose of dividing the data stream into multiple data stream fragments.
  • Step S130 sharding and caching multiple data streams on disk.
  • the disk refers to any disk capable of storing data in the related art, which is not specifically limited again.
  • Caching multiple data stream fragments on the disk can be either separately caching the data stream fragments on the disk, or continuously caching multiple data stream fragments on the disk.
  • the purpose of caching multiple data stream fragments on disk is to reduce the memory usage of data stream fragments and achieve the purpose of improving memory utilization.
  • caching multiple data stream fragments on the disk may be saving the data stream fragments as files to the disk, or compressing the data stream fragments into compressed files and caching them on the disk.
  • Step S140 Extract the data to be processed in each data flow fragment from the disk to the memory.
  • the data exchange fragments include data to be processed.
  • the data to be processed in each data flow fragment is extracted from the disk to the memory, which can be sequentially extracted from each data flow fragment, reducing the situation that the data flow fragments are directly loaded into the memory without merging, so as to improve the utilization of memory the goal of.
  • the data flow fragments are extracted from the disk to the memory, and the data flow fragments can be directly read from the disk into the memory through the data reading method in the related art, which is not specifically limited here.
  • Step S150 Perform the first merging process on the data to be processed with the same dimension information in the memory to obtain the target data set.
  • the data to be processed includes dimension information.
  • the dimension information may include multiple dimension values, and having the same dimension information means that the multiple dimension values are all the same.
  • the first merging process refers to merging the data to be processed with the same dimension information into one piece of data.
  • the merging process can be to merge other information in the data to be processed that does not belong to the dimension information, and the merging method can be to add other information separately , subtraction or division, etc., as long as it can combine data to be processed with the same dimension information into one piece of data to be processed, again no specific limitation is made.
  • the target data set refers to the data set obtained after the first merging process, which is also a data set cached in memory. Since the received data stream is subjected to the first merging process to obtain the target data set, it has the same dimension The data to be processed of the information is merged, the data volume of the originally received data stream is reduced, and the target data set read by the memory occupies less space, so as to achieve the purpose of improving the utilization rate of the memory.
  • the data stream is received by adopting the data processing method including the above steps S110 to S150, wherein the data stream includes a plurality of data to be processed, and the data to be processed includes dimension information; the data stream is processed at intervals of preset time Slice to obtain multiple data stream fragments; cache multiple data stream fragments on disk; extract the data to be processed in each data stream fragment from disk to memory; The data is first merged to obtain the target data set.
  • the data flow slices are cached on the disk, which saves the storage cost of the memory; the data to be processed in each data flow slice is extracted from the disk to the memory, and the data with the same dimension information is stored in the memory
  • the data to be processed is first merged to obtain the target data set, reducing the use of memory to directly read the data stream.
  • the number of data stream fragments is reduced after merging, thereby reducing the data volume of the target data set and further reducing the direct use of memory.
  • the purpose of improving the utilization of memory is achieved.
  • step S130 is further described, and step S130 may also include, but not limited to, step S210 and step S220 .
  • Step S210 Sort the data to be processed in each data flow fragment according to the dimension information.
  • sorting the data to be processed in each data flow slice according to the dimension information refers to sorting the data to be processed in each data slice according to the size of the dimension information.
  • the sorting can be sorted according to the dimension information from small to large, or according to the dimension information from large to small.
  • the dimension information includes multiple dimension data, it will be sorted according to the size of each dimension information in turn.
  • the data in dimension a is sorted first, and then the data in dimension b is sorted to obtain multiple ordered data flow fragments, so as to facilitate the subsequent first merge processing of the data to be processed , you only need to sequentially read the data in each data stream shard before merging, so as to improve the merging efficiency.
  • Step S220 cache multiple sorted data stream fragments on disk.
  • the sorted data stream fragments have an order, and multiple sorted data stream fragments are cached on the disk, which can facilitate the first merge processing of the data to be processed in the subsequent steps, and only need to read each data stream sequentially
  • the data to be processed in the shards can be merged, which saves the memory consumption of reading and comparing separately, improves the efficiency of merging, and thus achieves the purpose of improving memory utilization.
  • the data in each data stream slice is sorted according to the dimension information, and multiple sorted data stream slices are cached on the disk.
  • the data stream fragments are first sorted and then cached on the disk.
  • the sorted data stream fragments can be convenient for each data stream fragmentation in the subsequent steps Merge processing, improve merge efficiency, so as to achieve the purpose of improving memory utilization.
  • step S150 is further described, and step S150 may also include, but not limited to, step S151 , step S152 , step S153 , and step S154 .
  • Step S151 Traverse the data to be processed in each data stream fragment.
  • the data stream fragments may also include multiple pieces of data to be processed. Traversing the data to be processed in each data stream fragment refers to sequentially extracting the data to be processed in each data stream fragment into the memory, which can facilitate the processing of the data to be processed in each data stream fragment in subsequent steps. merge.
  • Step S152 Obtain the data to be processed according to the dimension information.
  • the data to be processed refers to the data to be processed respectively obtained from each data stream fragment. Since the data to be processed in the data stream fragments are sorted, the data to be processed in each data stream fragment is ordered Yes, obtaining the data to be processed according to the dimension information refers to obtaining the data to be processed according to the size of the dimension information, so that the data to be processed can be directly merged and processed, reducing the need to obtain the dimension information of the data to be processed and the fragmentation of each data stream In the case of comparing the dimension information of all the data to be processed, it is convenient to obtain the target data in the subsequent steps.
  • the data to be processed is obtained according to the dimension information.
  • the N pieces of data to be processed in the compressed file formed by the corresponding data stream fragmentation are respectively read to generate a result set ⁇ RS1, RS2,..., RSk ⁇ , where k refers to the number of data stream fragments.
  • the value of N may be different, that is to say, the amount of data to be processed in each data stream fragment may be different.
  • Step S153 Obtain the target data according to the quantity information of the data to be processed.
  • the target data is obtained according to the quantity information of the data to be processed, which means that there are multiple data to be processed with the same dimension information, and the data to be processed is merged into the target data. If the dimension information of the processed data is the same, the data to be processed is directly determined as the target data.
  • the data to be processed in the data stream fragmentation is sorted, the data to be processed can be obtained according to the dimension information, and the target data can be obtained according to the quantity information of the data to be processed in the traversal process, which reduces the number of pending data to be obtained
  • the dimension information of each data flow fragment is compared with the dimension information of all data to be processed in each data flow fragment to improve the efficiency of merging, so as to achieve the purpose of improving the utilization rate of memory.
  • Step S154 Obtain the target data set according to the target data.
  • the target data refers to the data obtained after combining the data to be processed, and the target data set includes multiple target data.
  • the data to be processed is merged to obtain the target data, the occurrence of the same dimension information in the target data is reduced, thereby reducing the amount of data in the target data set read by the memory, thereby improving the utilization rate of the memory the goal of.
  • the data to be processed in each data flow fragment is traversed, the data to be processed is obtained according to the dimension information, and the target data is obtained according to the quantity information of the data to be processed , get the target data set according to the target data.
  • the data to be processed is obtained according to the dimension information, and the target data is obtained according to the quantity information of the data to be processed, so as to obtain the target data set, which simplifies the operation of the first merge processing of the data to be processed, and reduces the memory Firstly, the consumption of processing operations is merged, so as to achieve the purpose of improving the utilization rate of memory.
  • the N pieces of data to be processed in the compressed file formed by each data flow fragment are respectively read, and the result set ⁇ RS1, RS2,...,RSk ⁇ is generated.
  • N The value of can be inconsistent, that is to say, the amount of data to be processed in the data stream fragments can be different.
  • Record the merged result set as RSmer; sequentially read i pieces of data to be processed from the result set ⁇ RS1, RS2,...,RSk ⁇ (denoted as R1, R2,...,Ri), if ⁇ RS1 ,RS2,...,RSk ⁇ , the number of data to be processed in some result sets is less than 1, then i ⁇ k, otherwise i k.
  • the data to be processed may include indicator information, and the indicator information will be combined and calculated during the merger.
  • the data flow fragments are merged according to the above steps, and the data to be processed in all the data flow fragments of the queue Qmer to be merged are merged to obtain the target data set.
  • step S152 is further described, and step S152 may also include, but not limited to, step S1521 and step S1522 .
  • Step S1521 When sorting is based on dimension information from small to large, obtain the data to be processed with the smallest dimension information.
  • sorting from small to large according to the dimension information refers to comparing the size of the dimension information through any sorting method in related technologies, and sorting according to the dimension information from small to large.
  • the data type of the dimension information is not a value, it can be sorted according to the letter of the dimension information from small to large, as long as the data to be processed can form an orderly sequence of dimension information from small to large, and no specific details are given here. limited.
  • the minimum data to be processed is to facilitate the merging of the data to be processed in the subsequent steps, reduce the memory consumption of the first merging process, and thus achieve the purpose of improving the utilization rate of the memory.
  • the read dimension information of the i pieces of data to be processed is compared to obtain the data to be processed with the smallest dimension data value.
  • Data can be organized through a data structure with one keyword corresponding to multiple values, that is, one key (key) corresponds to multiple values (values). After sorting, only the first element of the result needs to be taken out, corresponding to dimension information Minimal pending data.
  • Step S1522 When sorting is based on dimension information from large to small, obtain the data to be processed with the largest dimension information.
  • sorting according to the dimension information from large to small refers to comparing the size of the dimension information through any sorting method in related technologies, and sorting according to the dimension information from large to small.
  • the data type of the dimension information is not a numerical value, it can be sorted according to the letter of the dimension information from large to small, as long as the data to be processed can form an orderly sequence of dimension information from large to small, and will not be detailed here limited.
  • the dimension information is obtained
  • the largest data to be processed is to facilitate the merging of the data to be processed in the subsequent steps, reduce the memory consumption of the first merging process, and thus achieve the purpose of improving the utilization rate of the memory.
  • the data to be processed with the smallest dimension information is obtained, or, when the sorting is based on the dimension information by Sort from large to small to obtain the data to be processed with the largest dimension information.
  • the data to be processed with different dimension information is selected according to different sorting methods and dimension information, thereby reducing the consumption of memory by the first merging process, thereby achieving the purpose of improving memory utilization.
  • step S153 is further described, and step S153 may also include but not limited to step S1531 and step S1532 .
  • Step S1531 When the number of data to be processed is greater than one, merge the data to be processed to obtain the target data.
  • the data to be processed in the data stream fragmentation is sorted, when the number of data to be processed is greater than one, the data to be processed can be combined to obtain the target data.
  • the data to be processed can be extracted first, When the number of data to be processed with the same dimension information is more than one, it will be merged to form the target data, and the data stream fragmentation for merging the extracted data to be processed will be traversed to extract the next data to be processed, and the data to be processed will be combined with the data according to the dimension information.
  • Merge target data and data to be processed in other data stream fragments until there is only one data to be processed with the same dimensional information where merging includes but is not limited to calculating other information in the data to be processed that does not belong to dimensional information Simple or complex mathematical algorithms such as sum, average, count, maximum/minimum value, etc., are not specifically limited here. Therefore, the efficiency of the first merging process is improved, so as to achieve the purpose of improving the utilization rate of the memory.
  • Step S1532 When the number of data to be processed is equal to one, determine the data to be processed as target data.
  • the data to be processed is combined to obtain the target data; or, when the number of data to be processed is equal to one, the data to be processed is combined
  • the processed data is determined as target data.
  • it is judged according to the quantity information of the data to be processed whether there is data to be processed with the same dimension information in the data stream fragmentation, so as to determine the target data and reduce the number of data to be processed directly extracted from the disk to the memory. The situation that the data is compared with the data to be processed in each data flow fragment improves the efficiency of the first merging process, thereby achieving the purpose of improving the utilization rate of the memory.
  • the data processing method is further described, and the data processing method may also include but not limited to step S610 and step S620 .
  • Step S610 Acquiring the address information of each data flow fragment cached on the disk.
  • the address information of the data stream fragments in the disk may be obtained after the data stream fragments are cached in the disk, or the free address information in the disk may be obtained before the data stream fragments are cached in the disk , and then cache the data stream fragments in the corresponding disk address of the address information.
  • the purpose of obtaining the address information of each data flow fragment cached on the disk is to facilitate the storage of the address information in the memory in the subsequent steps.
  • Step S620 Save the address information in memory.
  • the address information is stored in the memory.
  • the data stream fragments are obtained in the memory according to the address information, which reduces the use of memory to store data stream fragments. Generated consumption, so as to achieve the purpose of improving memory utilization.
  • the address information of each data flow fragment cached on the disk is obtained; and the address information is stored in the memory.
  • the address information is stored in the memory.
  • the memory directly obtains the data stream fragments according to the address information, reducing the use of memory to store data. The consumption generated by stream sharding, so as to achieve the purpose of improving the utilization rate of memory.
  • step S140 is further described, and step S140 may also include, but not limited to, step S710 and step S720 .
  • Step S710 Read the address information of each data stream fragment from the memory.
  • the address information refers to the address information of each data flow fragment in the disk.
  • each data flow fragment is saved in the disk in the form of a file, and the address information refers to the address information of each data flow fragment.
  • the path information of the file formed by the fragmentation, or the data stream fragmentation may be stored in a data table of the database, and the address information refers to the path information of each data table.
  • the data stream fragments are cached in the disk.
  • the memory needs to first obtain the address information of the data stream fragments cached in the disk, and read each data stream fragment according to the address information.
  • Slices the memory only caches address information, thereby saving the consumption of using memory to store data flow fragments, and achieving the purpose of improving memory utilization.
  • Step S720 extract the data to be processed in each data flow fragment from the disk to the memory according to the address information.
  • the extracted data to be processed comes from different data flow fragments. Extracting the data to be processed from the disk to the memory can write the data to be processed into the cache queue of the memory in order, which is convenient for subsequent processing of the data to be processed. Combine processing.
  • each data flow fragment is cached in the memory, and the data to be processed in each data flow fragment is extracted from the disk to the memory according to the address information, reducing the need for unmerged data flow fragments to directly enter
  • the condition of the internal memory facilitates subsequent first merge processing of the extracted data to be processed, reducing the consumption of data storage by using the internal memory, thereby achieving the purpose of improving the utilization rate of the internal memory.
  • the address information of each data stream fragment is read from the memory; the data to be processed in each data stream fragment is read from the disk according to the address information fetched into memory.
  • the address information of each data stream fragment is cached in the memory, and the data to be processed in each data stream fragment is extracted from the disk to the memory through the address information, so as to facilitate the merging of the data to be processed in subsequent steps, Reduce the consumption of memory for data storage, so as to achieve the purpose of improving memory utilization.
  • step S130 is further described, and step S130 may also include but not limited to step S810 and step S820 .
  • Step S810 Perform a second merging process on the data to be processed with the same dimension information in each data stream slice, and obtain multiple merged data stream slices.
  • the data to be processed with the same dimension information in each data stream slice may be subjected to a second merging process.
  • the second merging process refers to merging the data to be processed in each data flow fragment, so that the dimension information of each data to be processed in the same data flow fragment is not the same, thereby reducing the first step in the subsequent steps. Consolidate the consumption of memory resources to achieve the purpose of improving memory utilization.
  • Step S820 slicing and caching the multiple combined data streams on the disk.
  • the data stream fragments cached in the disk undergo the second merge process, which reduces the consumption of memory resources by the first merge process in the subsequent steps, and the data volume of the merged data stream fragments is reduced, thereby reducing The purpose of disk resource consumption for data stream shard cache.
  • the data to be processed with the same dimension information in each data stream slice is subjected to the second merge process to obtain multiple merged data streams Fragmentation; multiple merged data stream fragments are cached on disk.
  • the data flow fragmentation is subjected to the second merging process and then cached on the disk, it can facilitate the consumption of memory resources by the first merging process in the subsequent steps, and the data of the merged data stream fragmentation The amount is reduced, so as to achieve the purpose of reducing the resource consumption of the disk for data stream fragmentation cache.
  • the data processing method is further described, and the data processing method may also include but not limited to step S910.
  • Step S910 Filter the data to be processed according to a preset filter condition, wherein the preset filter condition includes: the index information of the data to be processed is smaller than a preset index value threshold.
  • the index information refers to the information in the non-dimensional information field of the data to be processed.
  • the data information corresponding to the number of TCP downstream packets, TCP upstream traffic, TCP downstream traffic, and TCP service duration are all index information, or referring to Figure 14, the preset business type number and district/county number are dimension information, then the response average delay, display The data information corresponding to the average delay, response success rate, display success rate, and total traffic are all indicator information.
  • the received data stream may be directly receiving data from the network, or reading data in blocks from the database.
  • Filtering the data to be processed refers to removing the data to be processed that does not meet the preset filtering conditions, so as to reduce the impact of abnormal data to be processed on data processing.
  • the preset filtering conditions include: the index information of the data to be processed is less than the preset index value threshold.
  • the preset TCP service duration threshold is set to 3
  • the preset response success rate threshold is 90
  • the data to be processed with a response success rate of 89 is removed, that is, in Figure 14 data3.
  • the filtering of the data to be processed according to the preset filtering conditions can be performed after receiving the data stream, or filtering according to the data to be processed before the data stream is fragmented and cached on the disk. In this way, the amount of data stream fragments cached in the disk is reduced, and the purpose of improving the utilization rate of the disk space is achieved.
  • the data to be processed is filtered according to the preset filter condition, wherein the preset filter condition includes: the index information of the data to be processed is less than the preset index value threshold.
  • the preset filter condition includes: the index information of the data to be processed is less than the preset index value threshold.
  • step S140 is further described, and step S140 may also include but not limited to step S1010 and step S1020 .
  • Step S1010 When the amount of data to be processed cached in the disk is greater than a preset number threshold, extract the data to be processed in each data flow fragment from the disk to the memory.
  • the preset number threshold is set manually, and the user can set the preset number threshold according to the size of the disk space and the size of the memory space, so as to reduce the situation that the data cache occupies too much disk space and achieve a reasonable increase in disk consumption.
  • Step S1020 Obtain the total time of the slice, and when the total time of the slice is greater than the preset time threshold, extract the data to be processed in each data flow slice from the disk to the memory.
  • the preset time threshold is set manually.
  • the system receives network data from the network in real time. The user needs to analyze the data within the preset time period to obtain the total time of the slice. When the slice If the total time is greater than the preset time threshold, the data to be processed that needs to be analyzed has all been sliced, and the slicing is completed, and the data to be processed in each data stream slice is extracted from the disk to the memory, so as to improve the data received from the data stream. The purpose of the justification for the processing of the data to be processed.
  • the preset time threshold is the time statistics granularity T 0
  • the preset time interval refers to the data aggregation slice granularity T c .
  • the data processing method including the above steps S1010 to S1020, when the number of data to be processed cached in the disk is greater than the preset number threshold, the data to be processed in each data flow fragment is divided from The disk is extracted to the memory; or, the total time of the slice is obtained, and when the total time of the slice is greater than the preset time threshold, the data to be processed in each data flow slice is extracted from the disk to the memory.
  • the data to be processed in each data stream is segmented Extracting from the disk to the memory reduces the situation that the data cache occupies too much disk space, facilitates the merging of the data to be processed in the subsequent steps, and achieves the purpose of improving the utilization of the memory.
  • an embodiment of the present application also provides a data processing device, which includes: a memory, a processor, and a computer program stored in the memory and operable on the processor.
  • the processor and memory can be connected by a bus or other means.
  • memory can be used to store non-transitory software programs and non-transitory computer-executable programs.
  • the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices.
  • the memory may include memory located remotely from the processor, which remote memory may be connected to the processor via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the non-transitory software programs and instructions required to realize the data processing method of the above-mentioned embodiment are stored in the memory, and when executed by the processor, the data processing method in the above-mentioned embodiment is executed, for example, the above-described execution in FIG. 1 Method steps S110 to S150, method steps S210 to S220 in FIG. 2, method steps S151 to S154 in FIG. 3, method steps S1521 to S1522 in FIG. 4, method steps S1531 to S1532 in FIG. Method steps S610 to S620, method steps S710 to S720 in FIG. 7 , method steps S810 to S820 in FIG. 8 , method steps S910 in FIG. 9 , method steps S1010 to S1020 in FIG. 10 .
  • an embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by a processor or a controller, for example, by the above-mentioned Execution by a processor in the device embodiment can cause the processor to execute the data processing method in the above embodiment, for example, execute the method steps S110 to S150 in FIG. 1 and the method steps S210 to S220 in FIG. 2 described above. , method steps S151 to S154 in Fig. 3, method steps S1521 to S1522 in Fig. 4, method steps S1531 to S1532 in Fig. 5, method steps S610 to S620 in Fig. 6, method steps S710 to S720 in Fig. 7 , method steps S810 to S820 in FIG. 8 , method steps S910 in FIG. 9 , and method steps S1010 to S1020 in FIG. 10 .
  • the embodiment of the present application includes: receiving a data stream, wherein the data stream includes a plurality of data to be processed, and the data to be processed includes dimension information; slicing the data stream at preset time intervals to obtain multiple data stream fragments; The flow slices are cached on the disk; the data to be processed in each data flow slice is extracted from the disk to the memory; the data to be processed with the same dimension information is first merged in the memory to obtain the target data set.
  • the data flow slices are cached on the disk, which saves the storage cost of the memory; the data to be processed in each data flow slice is extracted from the disk to the memory, and the data with the same dimension information is stored in the memory
  • the data to be processed is first merged to obtain the target data set, reducing the use of memory to directly read the data stream.
  • the number of data stream fragments is reduced after merging, thereby reducing the data volume of the target data set and further reducing the direct use of memory. Read the data to be processed in the data stream to achieve the purpose of improving the utilization of memory.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

Abstract

The present application discloses a data processing method and device, and a storage medium. The data processing method comprises: receiving a data stream, the data stream comprising a plurality of pieces of data to be processed, and said data comprising dimension information (S110); slicing the data stream at a preset time interval to obtain a plurality of data stream fragments (S120); caching the plurality of data stream fragments in a disk (S130); extracting said data in each data stream fragment from the disk to a memory (S140); and performing first merging processing on said data having the same dimension information in the memory to obtain a target data set (S150).

Description

数据处理方法及其装置、存储介质Data processing method and its device, storage medium
相关申请的交叉引用Cross References to Related Applications
本申请基于申请号为202111532433.8、申请日为2021年12月15日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。This application is based on a Chinese patent application with application number 202111532433.8 and a filing date of December 15, 2021, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference into this application.
技术领域technical field
本申请涉及大数据技术领域,尤其是一种数据处理方法及其装置、存储介质。The present application relates to the technical field of big data, in particular to a data processing method, its device, and a storage medium.
背景技术Background technique
随着万物互联时代到来,传感器、智能手机、可穿戴设备以及智能家电等设备成为万物互联的一部分,并产生海量的数据。在传统的对数据的离线计算中,一般先将数据保存到存储介质,然后按照一定调度策略批量处理这些数据。然而,读取大量的数据需要消耗大量的内存资源,如何提高内存的利用率是一大技术难题。With the advent of the Internet of Everything era, devices such as sensors, smartphones, wearable devices, and smart home appliances have become part of the Internet of Everything and generate massive amounts of data. In the traditional offline computing of data, the data is generally saved to the storage medium first, and then processed in batches according to a certain scheduling strategy. However, reading a large amount of data requires a large amount of memory resources, and how to improve memory utilization is a major technical problem.
发明内容Contents of the invention
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。The following is an overview of the topics described in detail in this article. This summary is not intended to limit the scope of the claims.
本申请实施例提供了一种数据处理方法及其装置、存储介质。Embodiments of the present application provide a data processing method and device thereof, and a storage medium.
第一方面,本申请实施例提供了一种数据处理方法,包括:接收数据流,其中,所述数据流包括多个待处理数据,所述待处理数据包括维度信息;间隔预设时间对所述数据流进行切片,得到多个数据流分片;将所述多个数据流分片缓存于磁盘;将各个所述数据流分片中的所述待处理数据从所述磁盘提取至内存;在所述内存中对具有相同的所述维度信息的所述待处理数据进行第一合并处理,得到目标数据集。In the first aspect, the embodiment of the present application provides a data processing method, including: receiving a data stream, wherein the data stream includes a plurality of data to be processed, and the data to be processed includes dimension information; Slicing the data stream to obtain multiple data stream slices; caching the multiple data stream slices on a disk; extracting the data to be processed in each of the data stream slices from the disk to the memory; Performing a first merging process on the data to be processed having the same dimension information in the memory to obtain a target data set.
第二方面,本申请实施例还提供了一种数据处理装置,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上第一方面所述的数据处理方法。In the second aspect, the embodiment of the present application also provides a data processing device, including: a memory, a processor, and a computer program stored in the memory and operable on the processor, and the computer program is implemented when the processor executes the computer program. The data processing method as described in the first aspect above.
第三方面,本申请实施例还提供了一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行如上第一方面所述的数据处理方法。In a third aspect, the embodiment of the present application further provides a computer-readable storage medium storing computer-executable instructions, and the computer-executable instructions are used to execute the data processing method described in the first aspect above.
本申请的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本申请而了解。本申请的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。Additional features and advantages of the application will be set forth in the description which follows, and, in part, will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
附图说明Description of drawings
附图用来提供对本申请技术方案的进一步理解,并且构成说明书的一部分,与本申请的实施例一起用于解释本申请的技术方案,并不构成对本申请技术方案的限制。The accompanying drawings are used to provide a further understanding of the technical solution of the present application, and constitute a part of the specification, and are used together with the embodiments of the present application to explain the technical solution of the present application, and do not constitute a limitation to the technical solution of the present application.
图1是本申请一个实施例提供的数据处理方法的流程图;Fig. 1 is a flowchart of a data processing method provided by an embodiment of the present application;
图2是图1中步骤S130的方法的流程图;Fig. 2 is the flowchart of the method of step S130 among Fig. 1;
图3是图1中步骤S150的方法的流程图;Fig. 3 is the flowchart of the method of step S150 in Fig. 1;
图4是图3中步骤S152的方法的流程图;Fig. 4 is the flowchart of the method of step S152 in Fig. 3;
图5是图3中步骤S153的方法的流程图;Fig. 5 is the flowchart of the method of step S153 among Fig. 3;
图6是本申请另一个实施例提供的数据处理方法的流程图;FIG. 6 is a flowchart of a data processing method provided by another embodiment of the present application;
图7是图1中步骤S140的方法的流程图;Fig. 7 is the flowchart of the method of step S140 in Fig. 1;
图8是图1中步骤S130的方法的另一个实施例的流程图;Fig. 8 is the flowchart of another embodiment of the method of step S130 in Fig. 1;
图9是本申请又一个实施例提供的数据处理方法的流程图;FIG. 9 is a flow chart of a data processing method provided in another embodiment of the present application;
图10是图1中步骤S140的方法的另一个实施例的流程图;FIG. 10 is a flowchart of another embodiment of the method of step S140 in FIG. 1;
图11是本申请另一个实施例提供的数据处理方法的待处理数据结构示例图;Fig. 11 is an example diagram of a data structure to be processed in a data processing method provided by another embodiment of the present application;
图12是本申请另一个实施例提供的数据处理方法的待处理数据示例图;Fig. 12 is an example diagram of data to be processed in a data processing method provided by another embodiment of the present application;
图13是本申请又一个实施例提供的数据处理方法的待处理数据示例图;Fig. 13 is an example diagram of data to be processed in a data processing method provided by another embodiment of the present application;
图14是本申请再另外一个实施例提供的数据处理方法的待处理数据示例图;FIG. 14 is an example diagram of data to be processed in a data processing method provided in another embodiment of the present application;
图15是本申请一个实施例提供的数据处理方法的数据流的实例图。Fig. 15 is an example diagram of a data flow of a data processing method provided by an embodiment of the present application.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的实施例仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the embodiments described here are only used to explain the present application, not to limit the present application.
需要说明的是,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于流程图中的顺序执行所示出或描述的步骤。说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。It should be noted that although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than in the flowchart. The terms "first", "second" and the like in the specification and claims and the above drawings are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence.
本申请提供了一种数据处理方法及其装置、存储介质,其中,数据处理方法包括:接收数据流,其中,数据流包括多个待处理数据,待处理数据包括维度信息;间隔预设时间对数据流进行切片,得到多个数据流分片;将多个数据流分片缓存于磁盘;将各个数据流分片中的待处理数据从磁盘提取至内存;在内存中对具有相同的维度信息的待处理数据进行第一合并处理,得到目标数据集。根据本申请实施例的方案,数据流分片缓存于磁盘,节省了内存的存储成本;将各个数据流分片中的待处理数据从磁盘提取至内存,在内存中对具有相同的维度信息的待处理数据进行第一合并处理,得到目标数据集,减少利用内存直接读取数据流的情况,数据流分片经过合并后数量变少,从而减少目标数据集的数据量,进一步减少直接使用内存读取具有大量数据的数据集的情况,达到提高内存的利用率的目的。The present application provides a data processing method, its device, and a storage medium, wherein the data processing method includes: receiving a data stream, wherein the data stream includes a plurality of data to be processed, and the data to be processed includes dimension information; The data stream is sliced to obtain multiple data stream fragments; multiple data stream fragments are cached on the disk; the data to be processed in each data stream fragment is extracted from the disk to the memory; the same dimension information is stored in the memory The data to be processed are combined for the first time to obtain the target data set. According to the solution of the embodiment of the present application, the data flow slices are cached on the disk, which saves the storage cost of the memory; the data to be processed in each data flow slice is extracted from the disk to the memory, and the data with the same dimension information is stored in the memory The data to be processed is first merged to obtain the target data set, reducing the use of memory to directly read the data stream. The number of data stream fragments is reduced after merging, thereby reducing the data volume of the target data set and further reducing the direct use of memory. In the case of reading a data set with a large amount of data, the purpose of improving the utilization of memory is achieved.
下面结合附图,对本申请实施例作进一步阐述。The embodiments of the present application will be further described below in conjunction with the accompanying drawings.
如图1所示,图1是本申请一个实施例提供的数据处理方法的流程图,该数据处理方法可以包括但不限于有步骤S110、步骤S120、步骤S130、步骤S140、步骤S150。As shown in FIG. 1 , FIG. 1 is a flow chart of a data processing method provided by an embodiment of the present application. The data processing method may include but not limited to steps S110 , S120 , S130 , S140 , and S150 .
步骤S110:接收数据流,其中,数据流包括多个待处理数据,待处理数据包括维度信息。Step S110: Receive a data stream, wherein the data stream includes a plurality of data to be processed, and the data to be processed includes dimension information.
本步骤中,待处理数据可以是相关技术中的任意数据,可以是具有键值对的网络数据,也可以是关系数据库中的关系数据。维度信息可以是人为划分的数据中的字段对应的数据信息,在一个实施方式中,参照图11,一共有n个字段以及m条数据,其中,数据11至数据1n指的是一条完整的待处理数据,字段指的是字段名称,长度指的是该字段的数据的长度,将字段1和字段2作为维度,则维度信息包括数据11至数据m1、数据12至数据m2。或者, 参照图13,一共有五条数据,可以预设用户编号字段的数据以及小区编号字段的数据为维度信息,即是说,维度信息包括数据1中的44600001和25681、数据2中的44600002和25682、数据3中的44600001和25682、数据4中的44600002和25684、数据5中的44600003和25683,或者,参照图14,共四条数据,可以预设业务类型编号字段的数据以及区县编号字段的数据为维度信息,即是说,维度信息包括数据1中的32和1、数据2中的23和2、数据3中的21和2、数据4中的15和3。In this step, the data to be processed may be any data in related technologies, may be network data with key-value pairs, or may be relational data in a relational database. The dimension information can be the data information corresponding to the fields in the artificially divided data. In one embodiment, referring to FIG. To process data, the field refers to the field name, and the length refers to the length of the data in the field. If field 1 and field 2 are used as dimensions, the dimension information includes data 11 to data m1, and data 12 to data m2. Or, referring to Figure 13, there are five pieces of data in total, and the data in the user number field and the cell number field can be preset as dimension information, that is to say, the dimension information includes 44600001 and 25681 in data 1, 44600002 and 25682, 44600001 and 25682 in data 3, 44600002 and 25684 in data 4, 44600003 and 25683 in data 5, or, referring to Figure 14, a total of four pieces of data, the data in the business type number field and the district/county number field can be preset The data in is dimensional information, that is to say, the dimensional information includes 32 and 1 in data 1, 23 and 2 in data 2, 21 and 2 in data 3, and 15 and 3 in data 4.
需要说明的是,在一个实施方式中,接收到的数据的结构可以是如图11或者图13或者图14所示的数据结构,可以先经过处理转成内存数据格式,例如计算机C语言结构体、通信领域TLV数据结构等,或者如图12所示的数据结构。It should be noted that, in one embodiment, the structure of the received data can be the data structure shown in Figure 11 or Figure 13 or Figure 14, which can be converted into a memory data format after processing, such as computer C language structure , TLV data structure in the communication field, etc., or the data structure shown in FIG. 12 .
还需要说明的是,数据流可以是实时从网络上接收数据形成的数据流,也可以是从数据库中读取数据时形成的数据流。在一个实施方式中,待处理数据从数据库获取,待处理数据的数据结构包括的字段、长度如图13所示,多条数据可以和不同字段分别对应,参照图13中的数据1至数据5或者图14中的数据1至数据4,数据记录可以通过字段进行访问。接收数据流是为了便于后续步骤中对数据流进行切片。It should also be noted that the data stream may be a data stream formed by receiving data from the network in real time, or a data stream formed when data is read from a database. In one embodiment, the data to be processed is obtained from the database. The fields and lengths included in the data structure of the data to be processed are shown in Figure 13. Multiple pieces of data can correspond to different fields, refer to Data 1 to Data 5 in Figure 13 Or data 1 to data 4 in Figure 14, data records can be accessed through fields. The purpose of receiving the data stream is to facilitate slicing of the data stream in subsequent steps.
步骤S120:间隔预设时间对数据流进行切片,得到多个数据流分片。Step S120: Slicing the data stream at preset time intervals to obtain multiple data stream fragments.
本步骤中,预设时间指的是人为预设的时间,在一个实施方式中,用户根据磁盘大小设置预设时间,从而控制得到的多个数据流分片的大小,减少了磁盘空间不足的情况,并且,内存读取数据流分片的时间也会减少,达到提高内存的利用率的目的。In this step, the preset time refers to the artificially preset time. In one embodiment, the user sets the preset time according to the size of the disk, thereby controlling the size of the multiple data stream fragments obtained and reducing the problem of insufficient disk space. In addition, the time for memory to read data flow fragments will also be reduced, achieving the purpose of improving memory utilization.
在一个实施方式中,间隔预设时间对数据流进行切片,可以是在接收数据流的时候设置有开始时间,实时获取当前的时间戳,从而可以得到间隔时间,当间隔时间等于预设时间,对数据流进行切片。In one embodiment, the data stream is sliced at preset time intervals, and the start time can be set when the data stream is received, and the current timestamp can be obtained in real time, so that the interval time can be obtained. When the interval time is equal to the preset time, Slice the data stream.
还可以理解的是,对数据流进行切片得到的数据流分片的数据量少于数据流的数据量,得到数据流分片能够起到将数据流的数据量分片,减少直接读取数据流的情况的目的。It can also be understood that the data volume of the data stream fragments obtained by slicing the data stream is less than the data volume of the data stream, and the obtained data stream fragmentation can fragment the data volume of the data stream and reduce the direct reading of data. The purpose of the flow situation.
还可以理解的是,间隔预设时间对数据流进行切片,可以为数据流设置等待时间,当数据流接收到一条待处理数据,直到接收下一条待处理数据的时间间隔大于等待时间,也可以对数据流进行切片。另外,也可以设置固定的切片时间,每间隔切片时间则对数据流进行切片,以达到将数据流分成多个数据流分片的目的。It can also be understood that the data flow is sliced at preset intervals, and the waiting time can be set for the data flow. When the data flow receives a piece of data to be processed, the time interval between receiving the next piece of data to be processed is greater than the waiting time. Slice the data stream. In addition, a fixed slicing time can also be set, and the data stream is sliced at every slicing time interval, so as to achieve the purpose of dividing the data stream into multiple data stream fragments.
步骤S130:将多个数据流分片缓存于磁盘。Step S130: sharding and caching multiple data streams on disk.
本步骤中,磁盘指的是相关技术中的任意能够存储数据的磁盘,再次不作具体限定。将多个数据流分片缓存于磁盘,可以是将数据流分片分开缓存于磁盘,也可以是将多个数据流分片连续缓存于磁盘。将多个数据流分片缓存于磁盘是为了减少数据流分片占用内存的情况,达到提高内存的利用率的目的。In this step, the disk refers to any disk capable of storing data in the related art, which is not specifically limited again. Caching multiple data stream fragments on the disk can be either separately caching the data stream fragments on the disk, or continuously caching multiple data stream fragments on the disk. The purpose of caching multiple data stream fragments on disk is to reduce the memory usage of data stream fragments and achieve the purpose of improving memory utilization.
可以理解的是,将多个数据流分片缓存于磁盘,可以是将数据流分片保存为文件的形式保存至磁盘,也可以是将数据流分片压缩成压缩文件缓存于磁盘。It can be understood that caching multiple data stream fragments on the disk may be saving the data stream fragments as files to the disk, or compressing the data stream fragments into compressed files and caching them on the disk.
步骤S140:将各个数据流分片中的待处理数据从磁盘提取至内存。Step S140: Extract the data to be processed in each data flow fragment from the disk to the memory.
本步骤中,由于经过对数据流切片得到数据流分片,数据交流分片中包括有待处理数据。将各个数据流分片中的待处理数据从磁盘提取至内存,可以是依次提取各个数据流分片,减少数据流分片未经过合并直接全部加载到内存的情况,以达到提高内存的利用率的目的。In this step, since the data stream fragments are obtained by slicing the data stream, the data exchange fragments include data to be processed. The data to be processed in each data flow fragment is extracted from the disk to the memory, which can be sequentially extracted from each data flow fragment, reducing the situation that the data flow fragments are directly loaded into the memory without merging, so as to improve the utilization of memory the goal of.
可以理解的是,数据流分片从磁盘提取至内存,可以是直接通过相关技术中的数据读取 方式将数据流分片从磁盘读取进入内存,在此不作具体限定。It can be understood that the data flow fragments are extracted from the disk to the memory, and the data flow fragments can be directly read from the disk into the memory through the data reading method in the related art, which is not specifically limited here.
步骤S150:在内存中对具有相同的维度信息的待处理数据进行第一合并处理,得到目标数据集。Step S150: Perform the first merging process on the data to be processed with the same dimension information in the memory to obtain the target data set.
本步骤中,待处理数据均包括有维度信息,在一个实施方式中,维度信息可以是包括多个维度值,具有相同的维度信息指的是多个维度值均相同。第一合并处理指的是将具有相同的维度信息的待处理数据合并成为一条数据,合并处理可以是将待处理数据中不属于维度信息的其他信息进行合并,合并方式可以是其他信息分别相加、相减或者相除等,只要能够起到将具有相同的维度信息的待处理数据合并成为一条待处理数据即可,再次不作具体限定。In this step, the data to be processed includes dimension information. In one embodiment, the dimension information may include multiple dimension values, and having the same dimension information means that the multiple dimension values are all the same. The first merging process refers to merging the data to be processed with the same dimension information into one piece of data. The merging process can be to merge other information in the data to be processed that does not belong to the dimension information, and the merging method can be to add other information separately , subtraction or division, etc., as long as it can combine data to be processed with the same dimension information into one piece of data to be processed, again no specific limitation is made.
可以理解的是,目标数据集指的是经过第一合并处理后得到的数据集,也是内存缓存的数据集,由于接收到的数据流经过第一合并处理后得到目标数据集,具有相同的维度信息的待处理数据合并,原本接收到的数据流的数据量变少,内存读取的目标数据集占用空间较少,从而达到提高内存的利用率的目的。It can be understood that the target data set refers to the data set obtained after the first merging process, which is also a data set cached in memory. Since the received data stream is subjected to the first merging process to obtain the target data set, it has the same dimension The data to be processed of the information is merged, the data volume of the originally received data stream is reduced, and the target data set read by the memory occupies less space, so as to achieve the purpose of improving the utilization rate of the memory.
本实施例中,通过采用包括有上述步骤S110至步骤S150的数据处理方法,接收数据流,其中,数据流包括多个待处理数据,待处理数据包括维度信息;间隔预设时间对数据流进行切片,得到多个数据流分片;将多个数据流分片缓存于磁盘;将各个数据流分片中的待处理数据从磁盘提取至内存;在内存中对具有相同的维度信息的待处理数据进行第一合并处理,得到目标数据集。根据本申请实施例的方案,数据流分片缓存于磁盘,节省了内存的存储成本;将各个数据流分片中的待处理数据从磁盘提取至内存,在内存中对具有相同的维度信息的待处理数据进行第一合并处理,得到目标数据集,减少利用内存直接读取数据流的情况,数据流分片经过合并后数量变少,从而减少目标数据集的数据量,进一步减少直接使用内存读取具有大量数据的数据集的情况,达到提高内存的利用率的目的。In this embodiment, the data stream is received by adopting the data processing method including the above steps S110 to S150, wherein the data stream includes a plurality of data to be processed, and the data to be processed includes dimension information; the data stream is processed at intervals of preset time Slice to obtain multiple data stream fragments; cache multiple data stream fragments on disk; extract the data to be processed in each data stream fragment from disk to memory; The data is first merged to obtain the target data set. According to the solution of the embodiment of the present application, the data flow slices are cached on the disk, which saves the storage cost of the memory; the data to be processed in each data flow slice is extracted from the disk to the memory, and the data with the same dimension information is stored in the memory The data to be processed is first merged to obtain the target data set, reducing the use of memory to directly read the data stream. The number of data stream fragments is reduced after merging, thereby reducing the data volume of the target data set and further reducing the direct use of memory. In the case of reading a data set with a large amount of data, the purpose of improving the utilization of memory is achieved.
在一实施例中,如图2所示,对步骤S130进行进一步的说明,步骤S130还可以包括但不限于有步骤S210、步骤S220。In an embodiment, as shown in FIG. 2 , step S130 is further described, and step S130 may also include, but not limited to, step S210 and step S220 .
步骤S210:根据维度信息对各个数据流分片中的待处理数据进行排序。Step S210: Sort the data to be processed in each data flow fragment according to the dimension information.
本步骤中,根据维度信息对各个数据流分片中的待处理数据进行排序,指的是将各个数据里分片中的待处理数据按照维度信息大小进行排序。In this step, sorting the data to be processed in each data flow slice according to the dimension information refers to sorting the data to be processed in each data slice according to the size of the dimension information.
需要说明的是,排序可以是根据维度信息从小到大排序,也可以是根据维度信息从大到小排序,当维度信息包括多个维度数据,则依次按照各个维度信息的大小进行排序,在一个实施方式中,参照图12,先对维度a中的数据进行排序,再对维度b中的数据进行排序,得到多个有序的数据流分片,从而便于后续对待处理数据进行第一合并处理,只需要顺序读取各个数据流分片中的数据就可以进行合并,达到提高合并效率的目的。It should be noted that the sorting can be sorted according to the dimension information from small to large, or according to the dimension information from large to small. When the dimension information includes multiple dimension data, it will be sorted according to the size of each dimension information in turn. In the implementation manner, referring to FIG. 12 , the data in dimension a is sorted first, and then the data in dimension b is sorted to obtain multiple ordered data flow fragments, so as to facilitate the subsequent first merge processing of the data to be processed , you only need to sequentially read the data in each data stream shard before merging, so as to improve the merging efficiency.
步骤S220:将多个排序后的数据流分片缓存于磁盘。Step S220: cache multiple sorted data stream fragments on disk.
本步骤中,排序后的数据流分片具有顺序,将多个排序后的数据流分片缓存于磁盘,可以便于后续步骤中对待处理数据进行第一合并处理,只需要顺序读取各个数据流分片中的待处理数据就可以进行合并,节省了读取以及分别比较的内存消耗,提高合并效率,从而达到提高内存利用率的目的。In this step, the sorted data stream fragments have an order, and multiple sorted data stream fragments are cached on the disk, which can facilitate the first merge processing of the data to be processed in the subsequent steps, and only need to read each data stream sequentially The data to be processed in the shards can be merged, which saves the memory consumption of reading and comparing separately, improves the efficiency of merging, and thus achieves the purpose of improving memory utilization.
本实施例中,通过采用包括有上述步骤S210至步骤S220的数据处理方法,根据维度信息对各个数据流分片中的数据进行排序,将多个排序后的数据流分片缓存于磁盘。根据本申请实施例的方案,先对数据流分片进行排序再缓存于磁盘,当读取数据流分片进入内存,经 过排序后的数据流分片能够便于后续步骤中对各个数据流分片的合并处理,提高合并效率,从而达到提高内存利用率的目的。In this embodiment, by adopting the data processing method including the above steps S210 to S220, the data in each data stream slice is sorted according to the dimension information, and multiple sorted data stream slices are cached on the disk. According to the solution of the embodiment of the present application, the data stream fragments are first sorted and then cached on the disk. When the data stream fragments are read and entered into the memory, the sorted data stream fragments can be convenient for each data stream fragmentation in the subsequent steps Merge processing, improve merge efficiency, so as to achieve the purpose of improving memory utilization.
在一实施例中,如图3所示,对步骤S150进行进一步的说明,步骤S150还可以包括但不限于有步骤S151、步骤S152、步骤S153、步骤S154。In an embodiment, as shown in FIG. 3 , step S150 is further described, and step S150 may also include, but not limited to, step S151 , step S152 , step S153 , and step S154 .
步骤S151:遍历各个数据流分片中的待处理数据。Step S151: Traverse the data to be processed in each data stream fragment.
本步骤中,由于对数据流进行切片得到数据流分片,数据流分片中也可以包括多个待处理数据。遍历各个数据流分片那中的待处理数据,指的是依次将各个数据流分片中的待处理数据提取到内存中,能够便于后续步骤中对各个数据流分片中的待处理数据进行合并。In this step, since the data stream is sliced to obtain data stream fragments, the data stream fragments may also include multiple pieces of data to be processed. Traversing the data to be processed in each data stream fragment refers to sequentially extracting the data to be processed in each data stream fragment into the memory, which can facilitate the processing of the data to be processed in each data stream fragment in subsequent steps. merge.
步骤S152:根据维度信息获取待处理数据。Step S152: Obtain the data to be processed according to the dimension information.
本步骤中,待处理数据指的是分别从各个数据流分片中获取的待处理数据,由于数据流分片中的待处理数据经过排序,各个数据流分片中的待处理数据是有序的,根据维度信息获取待处理数据,指的是根据维度信息的大小获取待处理数据,从而可以直接对待处理数据进行合并处理,减少了将获取的待处理数据的维度信息与各个数据流分片中的所有待处理数据的维度信息进行比较的情况,便于后续步骤中得到目标数据。In this step, the data to be processed refers to the data to be processed respectively obtained from each data stream fragment. Since the data to be processed in the data stream fragments are sorted, the data to be processed in each data stream fragment is ordered Yes, obtaining the data to be processed according to the dimension information refers to obtaining the data to be processed according to the size of the dimension information, so that the data to be processed can be directly merged and processed, reducing the need to obtain the dimension information of the data to be processed and the fragmentation of each data stream In the case of comparing the dimension information of all the data to be processed, it is convenient to obtain the target data in the subsequent steps.
可以理解的是,根据维度信息获取待处理数据,在一个实施方式中,分别读取对应数据流分片形成的压缩文件的N条待处理数据,生成结果集{RS1,RS2,...,RSk},其中,k指的是数据流分片的数量。对于不同文件,N的值可以不一致,即是说,每个数据流分片中的待处理数据的数据量可以是不一样的。It can be understood that the data to be processed is obtained according to the dimension information. In one embodiment, the N pieces of data to be processed in the compressed file formed by the corresponding data stream fragmentation are respectively read to generate a result set {RS1, RS2,..., RSk}, where k refers to the number of data stream fragments. For different files, the value of N may be different, that is to say, the amount of data to be processed in each data stream fragment may be different.
步骤S153:根据待处理数据的数量信息得到目标数据。Step S153: Obtain the target data according to the quantity information of the data to be processed.
本步骤中,根据待处理数据的数量信息得到目标数据,指的是存在多个维度信息相同的待处理数据,则把待处理数据合并成目标数据,如果待处理数据的维度信息不与其他待处理数据的维度信息相同,则直接把该待处理数据确定为目标数据。In this step, the target data is obtained according to the quantity information of the data to be processed, which means that there are multiple data to be processed with the same dimension information, and the data to be processed is merged into the target data. If the dimension information of the processed data is the same, the data to be processed is directly determined as the target data.
需要说明的是,由于数据流分片中的待处理数据经过排序,根据维度信息获取待处理数据,可以根据遍历过程中的待处理数据的数量信息得到目标数据,减少了将获取的待处理数据的维度信息与各个数据流分片中的所有待处理数据的维度信息进行比较的情况,提供合并效率,从而达到提高内存的利用率的目的。It should be noted that since the data to be processed in the data stream fragmentation is sorted, the data to be processed can be obtained according to the dimension information, and the target data can be obtained according to the quantity information of the data to be processed in the traversal process, which reduces the number of pending data to be obtained The dimension information of each data flow fragment is compared with the dimension information of all data to be processed in each data flow fragment to improve the efficiency of merging, so as to achieve the purpose of improving the utilization rate of memory.
步骤S154:根据目标数据得到目标数据集。Step S154: Obtain the target data set according to the target data.
本步骤中,目标数据指的是待处理数据经过合并后得到的数据,目标数据集包括多个目标数据。In this step, the target data refers to the data obtained after combining the data to be processed, and the target data set includes multiple target data.
需要说明的是,由于待处理数据经过合并后得到目标数据,减少了目标数据中出现具有相同的维度信息的情况,从而减少内存读取的目标数据集的数据量,从而达到提高内存的利用率的目的。It should be noted that since the data to be processed is merged to obtain the target data, the occurrence of the same dimension information in the target data is reduced, thereby reducing the amount of data in the target data set read by the memory, thereby improving the utilization rate of the memory the goal of.
本实施例中,通过采用包括有上述步骤S151至步骤S154的数据处理方法,遍历各个数据流分片中的待处理数据,根据维度信息获取待处理数据,根据待处理数据的数量信息得到目标数据,根据目标数据得到目标数据集。根据本申请实施例的方案,根据维度信息得到待处理数据,根据待处理数据的数量信息得到目标数据,从而得到目标数据集,简化了对待处理数据进行第一合并处理的操作,减少了内存在第一合并处理操作的消耗,从而达到提高内存的利用率的目的。In this embodiment, by adopting the data processing method including the above steps S151 to S154, the data to be processed in each data flow fragment is traversed, the data to be processed is obtained according to the dimension information, and the target data is obtained according to the quantity information of the data to be processed , get the target data set according to the target data. According to the scheme of the embodiment of the present application, the data to be processed is obtained according to the dimension information, and the target data is obtained according to the quantity information of the data to be processed, so as to obtain the target data set, which simplifies the operation of the first merge processing of the data to be processed, and reduces the memory Firstly, the consumption of processing operations is merged, so as to achieve the purpose of improving the utilization rate of memory.
值得注意的是,在一个实施方式中,分别读取各个数据流分片形成的压缩文件的N条待 处理数据,生成结果集{RS1,RS2,...,RSk},对于不同文件,N的值可以不一致,即是说,数据流分片中的待处理数据的数量可以是不相同的。记合并后的结果集为RSmer;依次从结果集{RS1,RS2,...,RSk}中分别读取i条待处理数据(记作R1,R2,...,Ri),如果{RS1,RS2,...,RSk}中部分结果集中的待处理数据的数量不足1条,则i<k,否则i=k。对读取到的i条待处理数据R1,R2,...,Ri进行维度信息比较,记作Procmin(R1,R2,...,Ri),得到维度数据值最小的待处理数据,记作RSmin;考虑到Procmin过程的待处理数据数可能较多,可以采用算法进行排序生成所需结果,这种结果可以通过具有一个关键字对应多个值的数据结构进行组织,即一个key(键)对应多个value(值)的形式,排序结束后只需要取出结果集中的第一个待处理数据,记作RSmin;若RSmin对应的待处理数据的数量有多个,则需将这些待处理数据进行合并,得到RSmin'=Procmerge(RSmin)。待处理数据可以包括指标信息,合并时会对指标信息进行合并计算,指标信息的合并计算方法包括但不限于求和、求平均、计数、最大/小值等简单或者较为复杂的数学算法;若RSmin只有一条待处理数据,则RSmin'=RSmin;然后将RSmin'写入RSmer中;继续从RSmin所在的结果集中分别读取1条待处理数据进行补位,重复以上合并操作,直到过程中的待处理数据的数量i=0为止,表示结果集{RS1,RS2,...,RSk}的数据都处理完毕。当存在多次第一合并处理,根据以上步骤对数据流分片进行合并,完成待合并队列Qmer全部数据流分片中的待处理数据的合并,得到目标数据集。It is worth noting that, in one embodiment, the N pieces of data to be processed in the compressed file formed by each data flow fragment are respectively read, and the result set {RS1, RS2,...,RSk} is generated. For different files, N The value of can be inconsistent, that is to say, the amount of data to be processed in the data stream fragments can be different. Record the merged result set as RSmer; sequentially read i pieces of data to be processed from the result set {RS1, RS2,...,RSk} (denoted as R1, R2,...,Ri), if {RS1 ,RS2,...,RSk}, the number of data to be processed in some result sets is less than 1, then i<k, otherwise i=k. Compare the dimension information of the read i pieces of data R1, R2,...,Ri to be processed, and record it as Procmin(R1,R2,...,Ri), to obtain the data to be processed with the smallest dimension data value, record As RSmin; considering that the number of data to be processed in the Procmin process may be large, an algorithm can be used to sort and generate the required results. This result can be organized through a data structure with one key corresponding to multiple values, that is, a key (key ) corresponds to the form of multiple values. After sorting, only the first data to be processed in the result set needs to be taken out, and recorded as RSmin; if there are multiple data to be processed corresponding to RSmin, these pending The data were merged to give RSmin'=Procmerge(RSmin). The data to be processed may include indicator information, and the indicator information will be combined and calculated during the merger. The combined calculation method of the indicator information includes but is not limited to simple or complex mathematical algorithms such as summation, averaging, counting, and maximum/minimum values; RSmin has only one piece of data to be processed, then RSmin'=RSmin; then write RSmin' into RSmer; continue to read one piece of data to be processed from the result set where RSmin is located for filling, repeat the above merge operation until the Until the number of data to be processed i=0, it means that all the data in the result set {RS1, RS2, . . . , RSk} have been processed. When there are multiple first merging processes, the data flow fragments are merged according to the above steps, and the data to be processed in all the data flow fragments of the queue Qmer to be merged are merged to obtain the target data set.
在一实施例中,如图4所示,对步骤S152进行进一步的说明,步骤S152还可以包括但不限于有步骤S1521、步骤S1522。In an embodiment, as shown in FIG. 4 , step S152 is further described, and step S152 may also include, but not limited to, step S1521 and step S1522 .
步骤S1521:当排序为根据维度信息由小到大排序,获取维度信息最小的待处理数据。Step S1521: When sorting is based on dimension information from small to large, obtain the data to be processed with the smallest dimension information.
本步骤中,根据维度信息由小到大排序,指的是通过相关技术中的任意排序方式,比较维度信息的大小,根据维度信息由小到大排序。当所述维度信息的数据类型不是数值,可以根据该维度信息的字母由小到大排序,只要能够使得待处理数据形成有序的维度信息由小到大的序列即可,在此不做具体限定。In this step, sorting from small to large according to the dimension information refers to comparing the size of the dimension information through any sorting method in related technologies, and sorting according to the dimension information from small to large. When the data type of the dimension information is not a value, it can be sorted according to the letter of the dimension information from small to large, as long as the data to be processed can form an orderly sequence of dimension information from small to large, and no specific details are given here. limited.
需要说明的是,由于从磁盘提取至内存中的待处理数据来自于不同的数据流分片,并且,数据流分片中的待处理数据经过根据维度信息由小到大的排序,获取维度信息最小的待处理数据,是为了便于后续步骤中对待处理数据进行合并,减少第一合并处理对内存的消耗,从而达到提高内存的利用率的目的。It should be noted that since the data to be processed extracted from the disk to the memory comes from different data stream fragments, and the data to be processed in the data stream fragments are sorted from small to large according to the dimension information, the dimension information is obtained The minimum data to be processed is to facilitate the merging of the data to be processed in the subsequent steps, reduce the memory consumption of the first merging process, and thus achieve the purpose of improving the utilization rate of the memory.
还需要说明的是,在一个实施方式中,共有i个数据流分片,对读取到的i条待处理数据的维度信息进行比较,得到维度数据值最小的待处理数据,这种待处理数据可以是通过具有一个关键字对应多个值的数据结构进行组织,即一个key(键)对应多个value(值)的形式,排序结束后只需要取出结果的第一个元素,对应维度信息最小的待处理数据。It should also be noted that, in one embodiment, there are i data stream fragments in total, and the read dimension information of the i pieces of data to be processed is compared to obtain the data to be processed with the smallest dimension data value. Data can be organized through a data structure with one keyword corresponding to multiple values, that is, one key (key) corresponds to multiple values (values). After sorting, only the first element of the result needs to be taken out, corresponding to dimension information Minimal pending data.
步骤S1522:当排序为根据维度信息由大到小排序,获取维度信息最大的待处理数据。Step S1522: When sorting is based on dimension information from large to small, obtain the data to be processed with the largest dimension information.
本步骤中,根据维度信息由大到小排序,指的是通过相关技术中的任意排序方式,比较维度信息的大小,根据维度信息由大到小排序。当所述维度信息的数据类型不是数值,可以根据该维度信息的字母由大到小排序,只要能够使得待处理数据形成有序的维度信息由大到小的序列即可,在此不做具体限定。In this step, sorting according to the dimension information from large to small refers to comparing the size of the dimension information through any sorting method in related technologies, and sorting according to the dimension information from large to small. When the data type of the dimension information is not a numerical value, it can be sorted according to the letter of the dimension information from large to small, as long as the data to be processed can form an orderly sequence of dimension information from large to small, and will not be detailed here limited.
需要说明的是,由于从磁盘提取至内存中的待处理数据来自于不同的数据流分片,并且,数据流分片中的待处理数据经过根据维度信息由大到小的排序,获取维度信息最大的待处理数据,是为了便于后续步骤中对待处理数据进行合并,减少第一合并处理对内存的消耗,从 而达到提高内存的利用率的目的。It should be noted that since the data to be processed extracted from the disk to the memory comes from different data stream fragments, and the data to be processed in the data stream fragments are sorted from large to small according to the dimension information, the dimension information is obtained The largest data to be processed is to facilitate the merging of the data to be processed in the subsequent steps, reduce the memory consumption of the first merging process, and thus achieve the purpose of improving the utilization rate of the memory.
本实施例中,通过采用包括有上述步骤S1521至步骤S1522的数据处理方法,当排序为根据维度信息由小到大排序,获取维度信息最小的待处理数据,或者,当排序为根据维度信息由大到小排序,获取维度信息最大的待处理数据。根据本申请实施例的方案,根据不同的排序方式以及维度信息选取具有不同的维度信息待处理数据,从而减少第一合并处理对内存的消耗,从而达到提高内存的利用率的目的。In this embodiment, by adopting the data processing method including the above steps S1521 to S1522, when the sorting is based on the dimension information from small to large, the data to be processed with the smallest dimension information is obtained, or, when the sorting is based on the dimension information by Sort from large to small to obtain the data to be processed with the largest dimension information. According to the solutions of the embodiments of the present application, the data to be processed with different dimension information is selected according to different sorting methods and dimension information, thereby reducing the consumption of memory by the first merging process, thereby achieving the purpose of improving memory utilization.
在一实施例中,如图5所示,对步骤S153进行进一步的说明,步骤S153还可以包括但不限于有步骤S1531、步骤S1532。In an embodiment, as shown in FIG. 5 , step S153 is further described, and step S153 may also include but not limited to step S1531 and step S1532 .
步骤S1531:当待处理数据的数量大于一个,对待处理数据合并得到目标数据。Step S1531: When the number of data to be processed is greater than one, merge the data to be processed to obtain the target data.
本步骤中,由于数据流分片中的待处理数据经过排序,当待处理数据的数量大于一个,则可以对待处理数据合并得到目标数据,在一个实施方式中,可以是先提取待处理数据,当具有相同的维度信息的待处理数据的数量大于一个,则合并形成目标数据,对提取待处理数据进行合并的数据流分片遍历提取下一个待处理数据,根据维度信息将该待处理数据与目标数据以及其他数据流分片中的待处理数据进行合并,直到只有一个具有相同维度信息的待处理数据为止,其中,合并包括但不限于对待处理数据中的不属于维度信息的其他信息进行求和、求平均、计数、最大/小值等简单或者较为复杂的其他数学算法,在此不作具体限定。从而提高了第一合并处理的效率,从而达到提高内存的利用率的目的。In this step, since the data to be processed in the data stream fragmentation is sorted, when the number of data to be processed is greater than one, the data to be processed can be combined to obtain the target data. In one embodiment, the data to be processed can be extracted first, When the number of data to be processed with the same dimension information is more than one, it will be merged to form the target data, and the data stream fragmentation for merging the extracted data to be processed will be traversed to extract the next data to be processed, and the data to be processed will be combined with the data according to the dimension information. Merge target data and data to be processed in other data stream fragments until there is only one data to be processed with the same dimensional information, where merging includes but is not limited to calculating other information in the data to be processed that does not belong to dimensional information Simple or complex mathematical algorithms such as sum, average, count, maximum/minimum value, etc., are not specifically limited here. Therefore, the efficiency of the first merging process is improved, so as to achieve the purpose of improving the utilization rate of the memory.
步骤S1532:当待处理数据的数量等于一个,将待处理数据确定为目标数据。Step S1532: When the number of data to be processed is equal to one, determine the data to be processed as target data.
本步骤中,当提取的待处理数据的数量等于一个,在提取至内存中的数据流分片中只有一个具有相同维度信息的待处理数据,则将该待处理数据直接作为目标数据,从而提高了第一合并处理的效率,从而达到提高内存的利用率的目的。In this step, when the number of extracted data to be processed is equal to one, and there is only one data to be processed with the same dimension information in the data flow fragment extracted to the memory, the data to be processed is directly used as the target data, thereby improving The efficiency of the first merge processing is improved, so as to achieve the purpose of improving the utilization rate of the memory.
本实施例中,通过采用包括有上述步骤S1531至步骤S1532的数据处理方法,当待处理数据的数量大于一个,对待处理数据合并得到目标数据;或者,当待处理数据的数量等于一个,将待处理数据确定为目标数据。根据本申请实施例的方案,根据待处理数据的数量信息判断数据流分片中是否存在具有相同的维度信息的待处理数据,从而确定目标数据,减少了直接利用由磁盘提取至内存的待处理数据对各个数据流分片中的待处理数据进行比较的情况,提高了第一合并处理的效率,从而达到提高内存的利用率的目的。In this embodiment, by adopting the data processing method including the above steps S1531 to S1532, when the number of data to be processed is greater than one, the data to be processed is combined to obtain the target data; or, when the number of data to be processed is equal to one, the data to be processed is combined The processed data is determined as target data. According to the solution of the embodiment of the present application, it is judged according to the quantity information of the data to be processed whether there is data to be processed with the same dimension information in the data stream fragmentation, so as to determine the target data and reduce the number of data to be processed directly extracted from the disk to the memory. The situation that the data is compared with the data to be processed in each data flow fragment improves the efficiency of the first merging process, thereby achieving the purpose of improving the utilization rate of the memory.
在一实施例中,如图6所示,对数据处理方法进行进一步的说明,该数据处理方法还可以包括但不限于有步骤S610、步骤S620。In an embodiment, as shown in FIG. 6 , the data processing method is further described, and the data processing method may also include but not limited to step S610 and step S620 .
步骤S610:获取各个数据流分片缓存在磁盘的地址信息。Step S610: Acquiring the address information of each data flow fragment cached on the disk.
本步骤中,可以是在将数据流分片缓存于磁盘之后,获取数据流分片在磁盘中的地址信息,也可以是在将数据流分片缓存于磁盘之前,获取磁盘中空闲的地址信息,再将数据流分片缓存于该地址信息的对应的磁盘地址。获取各个数据流分片缓存在磁盘的地址信息是为了便于后续步骤中将地址信息保存在内存。In this step, the address information of the data stream fragments in the disk may be obtained after the data stream fragments are cached in the disk, or the free address information in the disk may be obtained before the data stream fragments are cached in the disk , and then cache the data stream fragments in the corresponding disk address of the address information. The purpose of obtaining the address information of each data flow fragment cached on the disk is to facilitate the storage of the address information in the memory in the subsequent steps.
步骤S620:将地址信息保存在内存。Step S620: Save the address information in memory.
本步骤中,由于数据流分片缓存于磁盘,将地址信息保存在内存,当需要读取数据流分片,在内存中根据地址信息获取数据流分片,减少了利用内存存储数据流分片产生的消耗,从而达到提高内存的利用率的目的。In this step, since the data stream fragments are cached on the disk, the address information is stored in the memory. When the data stream fragments need to be read, the data stream fragments are obtained in the memory according to the address information, which reduces the use of memory to store data stream fragments. Generated consumption, so as to achieve the purpose of improving memory utilization.
本实施例中,通过采用包括有上述步骤S610至步骤S620的数据处理方法,获取各个数 据流分片缓存在磁盘的地址信息;将地址信息保存在内存。根据本申请实施例的方案,由于数据流分片缓存于磁盘,将地址信息保存在内存,当需要读取数据流分片,内存直接根据地址信息获取数据流分片,减少了利用内存存储数据流分片产生的消耗,从而达到提高内存的利用率的目的。In this embodiment, by adopting the data processing method including the above steps S610 to S620, the address information of each data flow fragment cached on the disk is obtained; and the address information is stored in the memory. According to the solution of the embodiment of the present application, since the data stream fragments are cached on the disk, the address information is stored in the memory. When the data stream fragments need to be read, the memory directly obtains the data stream fragments according to the address information, reducing the use of memory to store data. The consumption generated by stream sharding, so as to achieve the purpose of improving the utilization rate of memory.
在一实施例中,如图7所示,对步骤S140进行进一步的说明,步骤S140还可以包括但不限于有步骤S710、步骤S720。In an embodiment, as shown in FIG. 7 , step S140 is further described, and step S140 may also include, but not limited to, step S710 and step S720 .
步骤S710:从内存中读取各个数据流分片的地址信息。Step S710: Read the address information of each data stream fragment from the memory.
本步骤中,地址信息指的是各个数据流分片在磁盘中的地址信息,在一个实施方式中,将各个数据流分片以文件的形式保存在磁盘中,地址信息指的是各个数据流分片形成的文件的路径信息,或者,数据流分片可以是保存在数据库的数据表中,地址信息指的是各个数据表的路径信息。In this step, the address information refers to the address information of each data flow fragment in the disk. In one embodiment, each data flow fragment is saved in the disk in the form of a file, and the address information refers to the address information of each data flow fragment. The path information of the file formed by the fragmentation, or the data stream fragmentation may be stored in a data table of the database, and the address information refers to the path information of each data table.
需要说明的是,数据流分片缓存于磁盘中,当内存中需要读取数据流分片,内存需要先获取数据流分片缓存于磁盘中的地址信息,根据地址信息读取各个数据流分片,内存只是缓存地址信息,从而节省利用内存存储数据流分片产生的消耗,达到提高内存利用率的目的。It should be noted that the data stream fragments are cached in the disk. When the memory needs to read the data stream fragments, the memory needs to first obtain the address information of the data stream fragments cached in the disk, and read each data stream fragment according to the address information. Slices, the memory only caches address information, thereby saving the consumption of using memory to store data flow fragments, and achieving the purpose of improving memory utilization.
步骤S720:根据地址信息将各个数据流分片中的待处理数据从磁盘提取至内存。Step S720: extract the data to be processed in each data flow fragment from the disk to the memory according to the address information.
本步骤中,提取出来的待处理数据来自不同的数据流分片,将待处理数据从磁盘提取至内存可以将待处理数据按顺序写入内存的缓存队列中,便于后续对待处理数据进行第一合并处理。In this step, the extracted data to be processed comes from different data flow fragments. Extracting the data to be processed from the disk to the memory can write the data to be processed into the cache queue of the memory in order, which is convenient for subsequent processing of the data to be processed. Combine processing.
需要说明的是,内存中缓存有各个数据流分片的地址信息,根据地址信息将各个数据流分片中的待处理数据从磁盘提取至内存,减少了未合并的各个数据流分片直接进入内存的情况,便于后续对提取出来的待处理数据进行第一合并处理,减少利用内存对数据的存储的消耗,从而达到提高内存的利用率的目的。It should be noted that the address information of each data flow fragment is cached in the memory, and the data to be processed in each data flow fragment is extracted from the disk to the memory according to the address information, reducing the need for unmerged data flow fragments to directly enter The condition of the internal memory facilitates subsequent first merge processing of the extracted data to be processed, reducing the consumption of data storage by using the internal memory, thereby achieving the purpose of improving the utilization rate of the internal memory.
本实施例中,通过采用包括有上述步骤S710至步骤S720的数据处理方法,从内存中读取各个数据流分片的地址信息;根据地址信息将各个数据流分片中的待处理数据从磁盘提取至内存。根据本申请实施例的方案,内存中缓存各个数据流分片的地址信息,通过地址信息将将各个数据流分片中的待处理数据从磁盘提取至内存,便于后续步骤对待处理数据进行合并,减少利用内存对数据的存储的消耗,从而达到提高内存的利用率的目的。In this embodiment, by adopting the data processing method including the above steps S710 to S720, the address information of each data stream fragment is read from the memory; the data to be processed in each data stream fragment is read from the disk according to the address information fetched into memory. According to the solution of the embodiment of the present application, the address information of each data stream fragment is cached in the memory, and the data to be processed in each data stream fragment is extracted from the disk to the memory through the address information, so as to facilitate the merging of the data to be processed in subsequent steps, Reduce the consumption of memory for data storage, so as to achieve the purpose of improving memory utilization.
在一实施例中,如图8所示,对步骤S130进行进一步的说明,步骤S130还可以包括但不限于有步骤S810、步骤S820。In an embodiment, as shown in FIG. 8 , step S130 is further described, and step S130 may also include but not limited to step S810 and step S820 .
步骤S810:将各个数据流分片中具有相同的维度信息的待处理数据进行第二合并处理,得到多个合并后的数据流分片。Step S810: Perform a second merging process on the data to be processed with the same dimension information in each data stream slice, and obtain multiple merged data stream slices.
本步骤中,可以是在对数据流进行切片之后,将各个数据流分片中具有相同的维度信息的待处理数据进行第二合并处理。第二合并处理指的是对各个数据流分片中的待处理数据进行合并,以达到在同一数据流分片中的各个待处理数据具有的维度信息均不相同,从而减少后续步骤中第一合并处理对内存资源的消耗,达到提高内存的利用率的目的。In this step, after the data streams are sliced, the data to be processed with the same dimension information in each data stream slice may be subjected to a second merging process. The second merging process refers to merging the data to be processed in each data flow fragment, so that the dimension information of each data to be processed in the same data flow fragment is not the same, thereby reducing the first step in the subsequent steps. Consolidate the consumption of memory resources to achieve the purpose of improving memory utilization.
步骤S820:将多个合并后的数据流分片缓存于磁盘。Step S820: slicing and caching the multiple combined data streams on the disk.
本步骤中,缓存于磁盘中的数据流分片经过第二合并处理,减少后续步骤中第一合并处理对内存资源的消耗,并且,合并后的数据流分片的数据量减少,从而达到减少磁盘对数据流分片缓存的资源消耗的目的。In this step, the data stream fragments cached in the disk undergo the second merge process, which reduces the consumption of memory resources by the first merge process in the subsequent steps, and the data volume of the merged data stream fragments is reduced, thereby reducing The purpose of disk resource consumption for data stream shard cache.
本实施例中,通过采用包括有上述步骤S810至步骤S820的数据处理方法,将各个数据流分片中具有相同的维度信息的待处理数据进行第二合并处理,得到多个合并后的数据流分片;将多个合并后的数据流分片缓存于磁盘。根据本申请实施例的方案,将数据流分片进行第二合并处理后再缓存于磁盘,能够便于后续步骤中第一合并处理对内存资源的消耗,并且,合并后的数据流分片的数据量减少,从而达到减少磁盘对数据流分片缓存的资源消耗的目的。In this embodiment, by adopting the data processing method including the above steps S810 to S820, the data to be processed with the same dimension information in each data stream slice is subjected to the second merge process to obtain multiple merged data streams Fragmentation; multiple merged data stream fragments are cached on disk. According to the solution of the embodiment of the present application, after the data flow fragmentation is subjected to the second merging process and then cached on the disk, it can facilitate the consumption of memory resources by the first merging process in the subsequent steps, and the data of the merged data stream fragmentation The amount is reduced, so as to achieve the purpose of reducing the resource consumption of the disk for data stream fragmentation cache.
在一实施例中,如图9所示,对数据处理方法进行进一步的说明,该数据处理方法还可以包括但不限于有步骤S910。In an embodiment, as shown in FIG. 9 , the data processing method is further described, and the data processing method may also include but not limited to step S910.
步骤S910:根据预设的过滤条件对待处理数据进行过滤,其中,预设的过滤条件包括:待处理数据的指标信息小于预设的指标值阈值。Step S910: Filter the data to be processed according to a preset filter condition, wherein the preset filter condition includes: the index information of the data to be processed is smaller than a preset index value threshold.
本步骤中,指标信息指的是待处理数据的非维度信息字段中的信息,在一个实施方式中,参照图13,预设用户编号信息以及小区编号信息为维度信息,则TCP上行包数、TCP下行包数、TCP上行流量、TCP下行流量以及TCP业务时长对应的数据信息均属于指标信息,或者参照图14,预设业务类型编号以及区县编号为维度信息,则响应平均时延、显示平均时延、响应成功率、显示成功率和总流量对应的数据信息均属于指标信息。In this step, the index information refers to the information in the non-dimensional information field of the data to be processed. In one embodiment, referring to FIG. The data information corresponding to the number of TCP downstream packets, TCP upstream traffic, TCP downstream traffic, and TCP service duration are all index information, or referring to Figure 14, the preset business type number and district/county number are dimension information, then the response average delay, display The data information corresponding to the average delay, response success rate, display success rate, and total traffic are all indicator information.
需要说明的是,接收的数据流可以是从网络上直接接收数据,也可以是从数据库中分块读取数据。对待处理数据进行过滤指的是将不符合预设的过滤条件的待处理数据移除,减少异常待处理数据对数据处理的影响的情况。It should be noted that the received data stream may be directly receiving data from the network, or reading data in blocks from the database. Filtering the data to be processed refers to removing the data to be processed that does not meet the preset filtering conditions, so as to reduce the impact of abnormal data to be processed on data processing.
还需要说明的是,预设的过滤条件包括:待处理数据的指标信息小于预设的指标值阈值,在一个实施方式中,参照图13,设置预设的TCP业务时长阈值为3,则移除TCP业务时长等于2的待处理数据,即图13中的数据4,或者参照图14,预设响应成功率阈值为90,则移除响应成功率为89的待处理数据,即图14中的数据3。It should also be noted that the preset filtering conditions include: the index information of the data to be processed is less than the preset index value threshold. In one embodiment, referring to FIG. 13 , if the preset TCP service duration threshold is set to 3, the Except for the data to be processed whose TCP service duration is equal to 2, that is, data 4 in Figure 13, or referring to Figure 14, the preset response success rate threshold is 90, and the data to be processed with a response success rate of 89 is removed, that is, in Figure 14 data3.
还需要说明的是,根据预设的过滤条件对待处理数据进行过滤,可以是在接收数据流之后对待处理数据进行过滤,也可以是将数据流分片缓存于磁盘之前根据对待处理数据进行过滤,从而减少缓存于磁盘中的数据流分片的数据量,达到提高磁盘空间的利用率的目的。It should also be noted that the filtering of the data to be processed according to the preset filtering conditions can be performed after receiving the data stream, or filtering according to the data to be processed before the data stream is fragmented and cached on the disk. In this way, the amount of data stream fragments cached in the disk is reduced, and the purpose of improving the utilization rate of the disk space is achieved.
本实施例中,通过采用包括有上述步骤S910的数据处理方法,根据预设的过滤条件对待处理数据进行过滤,其中,预设的过滤条件包括:待处理数据的指标信息小于预设的指标值阈值。根据本申请实施例的方案,将不符合预设的过滤条件的待处理数据移除,减少异常待处理数据对数据处理的影响的情况,达到提高数据处理的准确性的目的,并且,待处理数据的数据量也减少,从而达到提高内存的利用率的目的。In this embodiment, by adopting the data processing method including the above step S910, the data to be processed is filtered according to the preset filter condition, wherein the preset filter condition includes: the index information of the data to be processed is less than the preset index value threshold. According to the solution of the embodiment of the present application, the data to be processed that does not meet the preset filtering conditions are removed, the impact of abnormal data to be processed on data processing is reduced, and the purpose of improving the accuracy of data processing is achieved. The amount of data is also reduced, so as to achieve the purpose of improving the utilization rate of the memory.
在一实施例中,如图10所示,对步骤S140进行进一步的说明,步骤S140还可以包括但不限于有步骤S1010、步骤S1020。In an embodiment, as shown in FIG. 10 , step S140 is further described, and step S140 may also include but not limited to step S1010 and step S1020 .
步骤S1010:当缓存于磁盘中的待处理数据的数量大于预设的数量阈值,将各个数据流分片中的待处理数据从磁盘提取至内存。Step S1010: When the amount of data to be processed cached in the disk is greater than a preset number threshold, extract the data to be processed in each data flow fragment from the disk to the memory.
本步骤中,预设的数量阈值由人为设置,可以是用户根据磁盘空间的大小以及内存空间的大小设置预设的数量阈值,减少数据缓存占用磁盘空间过大的情况,达到提高磁盘消耗的合理性以及内存消耗合理性的目的,从而提高磁盘以及内存的利用率。In this step, the preset number threshold is set manually, and the user can set the preset number threshold according to the size of the disk space and the size of the memory space, so as to reduce the situation that the data cache occupies too much disk space and achieve a reasonable increase in disk consumption. The purpose of rationality and memory consumption, thereby improving the utilization rate of disk and memory.
步骤S1020:获取切片的总时间,当切片的总时间大于预设的时间阈值,将各个数据流分片中的待处理数据从磁盘提取至内存。Step S1020: Obtain the total time of the slice, and when the total time of the slice is greater than the preset time threshold, extract the data to be processed in each data flow slice from the disk to the memory.
本步骤中,预设的时间阈值由人为设置,在一个实施方式中,系统实时从网络上接收网 络数据,用户需要对预设的时间段内的数据进行分析,获取切片的总时间,当切片的总时间大于预设的时间阈值,需要分析的待处理数据已经全部进行切片,切片结束,将各个数据流分片中的待处理数据从磁盘提取至内存,达到提高从数据流中接收到的待处理数据的处理的合理性的目的。In this step, the preset time threshold is set manually. In one embodiment, the system receives network data from the network in real time. The user needs to analyze the data within the preset time period to obtain the total time of the slice. When the slice If the total time is greater than the preset time threshold, the data to be processed that needs to be analyzed has all been sliced, and the slicing is completed, and the data to be processed in each data stream slice is extracted from the disk to the memory, so as to improve the data received from the data stream. The purpose of the justification for the processing of the data to be processed.
需要说明的是,在一个实施方式中,参照图15,预设的时间阈值为时间统计粒度T 0,间隔的预设时间指的是数据聚集切片粒度T c,接收数据流,当数据统计粒度T 0结束,即是说,切片的总时间大于预设的时间阈值,将各个数据流分片中的待处理数据从磁盘提取至内存,等待后续对待处理数据进行合并,数据统计粒度T 0内的待处理数据就是用户需要的进入内存进行合并的全部数据。 It should be noted that, in one embodiment, referring to FIG. 15 , the preset time threshold is the time statistics granularity T 0 , and the preset time interval refers to the data aggregation slice granularity T c . When the data flow is received, the data statistics granularity T 0 ends, that is to say, the total time of slicing is greater than the preset time threshold, and the data to be processed in each data stream slice is extracted from the disk to the memory, waiting for subsequent merging of the data to be processed, and the data statistics granularity is within T 0 The data to be processed is all the data that the user needs to enter the memory for merging.
本实施例中,通过采用包括有上述步骤S1010至步骤S1020的数据处理方法,当缓存于磁盘中的待处理数据的数量大于预设的数量阈值,将各个数据流分片中的待处理数据从磁盘提取至内存;或者,获取切片的总时间,当切片的总时间大于预设的时间阈值,将各个数据流分片中的待处理数据从磁盘提取至内存。根据本申请实施例的方案,当缓存于磁盘中的待处理数据的数量大于预设的数量阈值,或者当切片的总时间大于预设的时间阈值,将各个数据流分片中的待处理数据从磁盘提取至内存,减少了数据缓存占用磁盘空间过大的情况,便于后续步骤中对待处理数据进行合并,达到提高内存的利用率的目的。In this embodiment, by adopting the data processing method including the above steps S1010 to S1020, when the number of data to be processed cached in the disk is greater than the preset number threshold, the data to be processed in each data flow fragment is divided from The disk is extracted to the memory; or, the total time of the slice is obtained, and when the total time of the slice is greater than the preset time threshold, the data to be processed in each data flow slice is extracted from the disk to the memory. According to the solution of the embodiment of the present application, when the amount of data to be processed cached in the disk is greater than the preset quantity threshold, or when the total time of slicing is greater than the preset time threshold, the data to be processed in each data stream is segmented Extracting from the disk to the memory reduces the situation that the data cache occupies too much disk space, facilitates the merging of the data to be processed in the subsequent steps, and achieves the purpose of improving the utilization of the memory.
另外,本申请的一个实施例还提供了一种数据处理装置,该数据处理装置包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序。In addition, an embodiment of the present application also provides a data processing device, which includes: a memory, a processor, and a computer program stored in the memory and operable on the processor.
处理器和存储器可以通过总线或者其他方式连接。The processor and memory can be connected by a bus or other means.
存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外,存储器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器可包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至该处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。As a non-transitory computer-readable storage medium, memory can be used to store non-transitory software programs and non-transitory computer-executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory may include memory located remotely from the processor, which remote memory may be connected to the processor via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
实现上述实施例的数据处理方法所需的非暂态软件程序以及指令存储在存储器中,当被处理器执行时,执行上述实施例中的数据处理方法,例如,执行以上描述的图1中的方法步骤S110至S150、图2中的方法步骤S210至S220、图3中的方法步骤S151至S154、图4中的方法步骤S1521至S1522、图5中的方法步骤S1531至S1532、图6中的方法步骤S610至S620、图7中的方法步骤S710至S720、图8中的方法步骤S810至S820、图9中的方法步骤S910、图10中的方法步骤S1010至S1020。The non-transitory software programs and instructions required to realize the data processing method of the above-mentioned embodiment are stored in the memory, and when executed by the processor, the data processing method in the above-mentioned embodiment is executed, for example, the above-described execution in FIG. 1 Method steps S110 to S150, method steps S210 to S220 in FIG. 2, method steps S151 to S154 in FIG. 3, method steps S1521 to S1522 in FIG. 4, method steps S1531 to S1532 in FIG. Method steps S610 to S620, method steps S710 to S720 in FIG. 7 , method steps S810 to S820 in FIG. 8 , method steps S910 in FIG. 9 , method steps S1010 to S1020 in FIG. 10 .
此外,本申请的一个实施例还提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令被一个处理器或控制器执行,例如,被上述装置实施例中的一个处理器执行,可使得上述处理器执行上述实施例中的数据处理方法,例如,执行以上描述的图1中的方法步骤S110至S150、图2中的方法步骤S210至S220、图3中的方法步骤S151至S154、图4中的方法步骤S1521至S1522、图5中的方法步骤S1531至S1532、图6中的方法步骤S610至S620、图7中的方法步骤S710至S720、图8中的方法步骤S810至S820、图9中的方法步骤S910、图10中的方法步骤S1010至S1020。In addition, an embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by a processor or a controller, for example, by the above-mentioned Execution by a processor in the device embodiment can cause the processor to execute the data processing method in the above embodiment, for example, execute the method steps S110 to S150 in FIG. 1 and the method steps S210 to S220 in FIG. 2 described above. , method steps S151 to S154 in Fig. 3, method steps S1521 to S1522 in Fig. 4, method steps S1531 to S1532 in Fig. 5, method steps S610 to S620 in Fig. 6, method steps S710 to S720 in Fig. 7 , method steps S810 to S820 in FIG. 8 , method steps S910 in FIG. 9 , and method steps S1010 to S1020 in FIG. 10 .
本申请实施例包括:接收数据流,其中,数据流包括多个待处理数据,待处理数据包括维度信息;间隔预设时间对数据流进行切片,得到多个数据流分片;将多个数据流分片缓存 于磁盘;将各个数据流分片中的待处理数据从磁盘提取至内存;在内存中对具有相同的维度信息的待处理数据进行第一合并处理,得到目标数据集。根据本申请实施例的方案,数据流分片缓存于磁盘,节省了内存的存储成本;将各个数据流分片中的待处理数据从磁盘提取至内存,在内存中对具有相同的维度信息的待处理数据进行第一合并处理,得到目标数据集,减少利用内存直接读取数据流的情况,数据流分片经过合并后数量变少,从而减少目标数据集的数据量,进一步减少直接使用内存读取数据流中的待处理数据的情况,达到提高内存的利用率的目的。The embodiment of the present application includes: receiving a data stream, wherein the data stream includes a plurality of data to be processed, and the data to be processed includes dimension information; slicing the data stream at preset time intervals to obtain multiple data stream fragments; The flow slices are cached on the disk; the data to be processed in each data flow slice is extracted from the disk to the memory; the data to be processed with the same dimension information is first merged in the memory to obtain the target data set. According to the solution of the embodiment of the present application, the data flow slices are cached on the disk, which saves the storage cost of the memory; the data to be processed in each data flow slice is extracted from the disk to the memory, and the data with the same dimension information is stored in the memory The data to be processed is first merged to obtain the target data set, reducing the use of memory to directly read the data stream. The number of data stream fragments is reduced after merging, thereby reducing the data volume of the target data set and further reducing the direct use of memory. Read the data to be processed in the data stream to achieve the purpose of improving the utilization of memory.
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统可以被实施为软件、固件、硬件及其适当的组合。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。Those skilled in the art can understand that all or some of the steps and systems in the methods disclosed above can be implemented as software, firmware, hardware and an appropriate combination thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit . Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. permanent, removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer. In addition, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .
以上是对本申请的若干实施方式进行了具体说明,但本申请并不局限于上述实施方式,熟悉本领域的技术人员在不违背本申请精神的前提下还可作出种种的等同变形或替换,这些等同的变形或替换均包含在本申请权利要求所限定的范围内。The above is a specific description of several embodiments of the application, but the application is not limited to the above-mentioned embodiments, and those skilled in the art can also make various equivalent deformations or replacements without violating the spirit of the application. Equivalent modifications or replacements are all within the scope defined by the claims of the present application.

Claims (12)

  1. 一种数据处理方法,包括:A data processing method, comprising:
    接收数据流,其中,所述数据流包括多个待处理数据,所述待处理数据包括维度信息;receiving a data stream, wherein the data stream includes a plurality of data to be processed, and the data to be processed includes dimension information;
    间隔预设时间对所述数据流进行切片,得到多个数据流分片;Slicing the data stream at preset time intervals to obtain multiple data stream fragments;
    将所述多个数据流分片缓存于磁盘;Caching the plurality of data stream fragments on a disk;
    将各个所述数据流分片中的所述待处理数据从所述磁盘提取至内存;Extracting the data to be processed in each of the data stream fragments from the disk to memory;
    在所述内存中对具有相同的所述维度信息的所述待处理数据进行第一合并处理,得到目标数据集。Performing a first merging process on the data to be processed having the same dimension information in the memory to obtain a target data set.
  2. 如权利要求1所述的数据处理方法,其中,所述将所述多个数据流分片缓存于磁盘,包括:The data processing method according to claim 1, wherein said buffering said plurality of data stream slices on a disk comprises:
    根据所述维度信息对各个所述数据流分片中的待处理数据进行排序;sorting the data to be processed in each of the data flow fragments according to the dimension information;
    将所述多个排序后的数据流分片缓存于磁盘。Caching the multiple sorted data stream fragments on a disk.
  3. 如权利要求2所述的数据处理方法,其中,所述数据流分片中的待处理数据经过排序,所述在所述内存中对具有相同的所述维度信息的所述待处理数据进行第一合并处理,得到目标数据集,包括:The data processing method according to claim 2, wherein the data to be processed in the data stream fragmentation is sorted, and the data to be processed with the same dimension information is sorted in the memory 1. Merge processing to obtain the target data set, including:
    遍历各个所述数据流分片中的待处理数据;traversing the data to be processed in each of the data stream fragments;
    根据所述维度信息获取所述待处理数据;Acquiring the data to be processed according to the dimension information;
    根据所述待处理数据的数量信息得到目标数据;Obtain target data according to the quantity information of the data to be processed;
    根据所述目标数据得到目标数据集。A target data set is obtained according to the target data.
  4. 如权利要求3所述的数据处理方法,其中,所述根据所述排序以及所述维度信息获取所述待处理数据,包括:The data processing method according to claim 3, wherein said obtaining said data to be processed according to said sorting and said dimension information comprises:
    当所述排序为根据所述维度信息由小到大排序,获取所述维度信息最小的所述待处理数据;When the sorting is sorting from small to large according to the dimensional information, acquiring the data to be processed with the smallest dimensional information;
    或者,or,
    当所述排序为根据所述维度信息由大到小排序,获取所述维度信息最大的所述待处理数据。When the sorting is sorting from large to small according to the dimensional information, the data to be processed with the largest dimensional information is acquired.
  5. 根据权利要求3所述的数据处理方法,其中,所述根据所述待处理数据的数量信息得到目标数据,包括:The data processing method according to claim 3, wherein said obtaining the target data according to the quantity information of the data to be processed comprises:
    当所述待处理数据的数量大于一个,对所述待处理数据合并得到目标数据;When the number of the data to be processed is greater than one, combining the data to be processed to obtain target data;
    或者,or,
    当所述待处理数据的数量等于一个,将所述待处理数据确定为目标数据。When the number of the data to be processed is equal to one, the data to be processed is determined as the target data.
  6. 如权利要求1所述的数据处理方法,其中,所述数据处理方法还包括:The data processing method according to claim 1, wherein the data processing method further comprises:
    获取各个所述数据流分片缓存在磁盘的地址信息;Obtain the address information of each data flow fragment cached on the disk;
    将所述地址信息保存在所述内存。saving the address information in the memory.
  7. 如权利要求1所述的数据处理方法,其中,所述将各个所述数据流分片中的所述待处理数据从所述磁盘提取至内存,包括:The data processing method according to claim 1, wherein said extracting the data to be processed in each of the data stream fragments from the disk to the memory comprises:
    从所述内存中读取各个所述数据流分片的所述地址信息;reading the address information of each of the data stream fragments from the memory;
    根据所述地址信息将各个所述数据流分片中的所述待处理数据从所述磁盘提取至内存。Extracting the data to be processed in each of the data stream fragments from the disk to memory according to the address information.
  8. 如权利要求1所述的数据处理方法,其中,所述将所述多个数据流分片缓存于磁盘,包括:The data processing method according to claim 1, wherein said buffering said plurality of data stream slices on a disk comprises:
    将各个所述数据流分片中具有相同的所述维度信息的所述待处理数据进行第二合并处理,得到多个合并后的数据流分片;performing a second merge process on the data to be processed having the same dimension information in each of the data stream fragments to obtain multiple merged data stream fragments;
    将所述多个合并后的数据流分片缓存于磁盘。and caching the multiple merged data stream fragments on the disk.
  9. 如权利要求1所述的数据处理方法,其中,所述待处理数据还包括指标信息,所述数据处理方法还包括:The data processing method according to claim 1, wherein the data to be processed further includes index information, and the data processing method further includes:
    根据预设的过滤条件对所述待处理数据进行过滤,其中,所述预设的过滤条件包括:所述待处理数据的指标信息小于预设的指标值阈值。The data to be processed is filtered according to a preset filter condition, wherein the preset filter condition includes: the index information of the data to be processed is smaller than a preset index value threshold.
  10. 如权利要求1所述的数据处理方法,其中,所述将各个所述数据流分片中的所述待处理数据从所述磁盘提取至内存,包括:The data processing method according to claim 1, wherein said extracting the data to be processed in each of the data stream fragments from the disk to the memory comprises:
    当缓存于所述磁盘中的所述待处理数据的数量大于预设的数量阈值,将各个所述数据流分片中的所述待处理数据从所述磁盘提取至内存;When the quantity of the data to be processed cached in the disk is greater than a preset quantity threshold, extracting the data to be processed in each of the data stream fragments from the disk to the memory;
    或者,or,
    获取切片的总时间,当所述切片的总时间大于预设的时间阈值,将各个所述数据流分片中的所述待处理数据从所述磁盘提取至内存。The total time of the slice is obtained, and when the total time of the slice is greater than a preset time threshold, the data to be processed in each data flow slice is extracted from the disk to the memory.
  11. 一种数据处理装置,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如权利要求1至10任意一项所述的数据处理方法。A data processing device, comprising: a memory, a processor, and a computer program stored on the memory and operable on the processor, when the processor executes the computer program, it realizes any one of claims 1 to 10 data processing method.
  12. 一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行权利要求1至10任意一项所述的数据处理方法。A computer-readable storage medium storing computer-executable instructions for executing the data processing method according to any one of claims 1-10.
PCT/CN2022/125989 2021-12-15 2022-10-18 Data processing method and device, and storage medium WO2023109302A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111532433.8 2021-12-15
CN202111532433.8A CN116263747A (en) 2021-12-15 2021-12-15 Data processing method and device and storage medium

Publications (1)

Publication Number Publication Date
WO2023109302A1 true WO2023109302A1 (en) 2023-06-22

Family

ID=86722493

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/125989 WO2023109302A1 (en) 2021-12-15 2022-10-18 Data processing method and device, and storage medium

Country Status (2)

Country Link
CN (1) CN116263747A (en)
WO (1) WO2023109302A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106326461A (en) * 2016-08-30 2017-01-11 杭州东方通信软件技术有限公司 Real time processing guarantee method and system based on network signaling record
US20170083378A1 (en) * 2015-09-18 2017-03-23 Salesforce.Com, Inc. Managing processing of long tail task sequences in a stream processing framework
CN109726209A (en) * 2018-09-07 2019-05-07 网联清算有限公司 Log aggregation method and device
CN111651510A (en) * 2020-05-14 2020-09-11 拉扎斯网络科技(上海)有限公司 Data processing method and device, electronic equipment and computer readable storage medium
CN112685368A (en) * 2020-12-30 2021-04-20 成都科来网络技术有限公司 Method and system for processing complete session of super-large data packet file and readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170083378A1 (en) * 2015-09-18 2017-03-23 Salesforce.Com, Inc. Managing processing of long tail task sequences in a stream processing framework
CN106326461A (en) * 2016-08-30 2017-01-11 杭州东方通信软件技术有限公司 Real time processing guarantee method and system based on network signaling record
CN109726209A (en) * 2018-09-07 2019-05-07 网联清算有限公司 Log aggregation method and device
CN111651510A (en) * 2020-05-14 2020-09-11 拉扎斯网络科技(上海)有限公司 Data processing method and device, electronic equipment and computer readable storage medium
CN112685368A (en) * 2020-12-30 2021-04-20 成都科来网络技术有限公司 Method and system for processing complete session of super-large data packet file and readable storage medium

Also Published As

Publication number Publication date
CN116263747A (en) 2023-06-16

Similar Documents

Publication Publication Date Title
WO2020024799A1 (en) Method for aggregation optimization of time series data
US9690842B2 (en) Analyzing frequently occurring data items
EP3812915A1 (en) Big data statistics at data-block level
US11138183B2 (en) Aggregating data in a mediation system
WO2016141735A1 (en) Cache data determination method and device
CN107704203B (en) Deletion method, device and equipment for aggregated large file and computer storage medium
CN106033324B (en) Data storage method and device
WO2015024474A1 (en) Rapid calculation method for electric power reliability index based on multithread processing of cache data
CN111782707B (en) Data query method and system
US11625412B2 (en) Storing data items and identifying stored data items
CN111522786A (en) Log processing system and method
WO2023155849A1 (en) Sample deletion method and apparatus based on time decay, and storage medium
EP3726397A1 (en) Join query method and system for multiple time sequences under columnar storage
CN106990914B (en) Data deleting method and device
CN114328545A (en) Data storage and query method, device and database system
CN115408149A (en) Time sequence storage engine memory design and distribution method and device
Zhang et al. Efficient incremental computation of aggregations over sliding windows
CN107346270B (en) Method and system for real-time computation based radix estimation
WO2023109302A1 (en) Data processing method and device, and storage medium
US11789639B1 (en) Method and apparatus for screening TB-scale incremental data
CN106599005B (en) Data archiving method and device
WO2023071367A1 (en) Processing method and apparatus for communication service data, and computer storage medium
CN108153805A (en) A kind of method, the system of efficient cleaning Hbase time series datas
CN110990394B (en) Method, device and storage medium for counting number of rows of distributed column database table
CN113625959B (en) Data processing method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22906054

Country of ref document: EP

Kind code of ref document: A1