CA3144051A1 - Data sorting method, device, and system - Google Patents

Data sorting method, device, and system Download PDF

Info

Publication number
CA3144051A1
CA3144051A1 CA3144051A CA3144051A CA3144051A1 CA 3144051 A1 CA3144051 A1 CA 3144051A1 CA 3144051 A CA3144051 A CA 3144051A CA 3144051 A CA3144051 A CA 3144051A CA 3144051 A1 CA3144051 A1 CA 3144051A1
Authority
CA
Canada
Prior art keywords
data
sampling
data block
cutting
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CA3144051A
Other languages
French (fr)
Inventor
Weijian Yu
Zhiwei Wang
Qiao XIE
Qian Sun
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
10353744 Canada Ltd
Original Assignee
10353744 Canada Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 10353744 Canada Ltd filed Critical 10353744 Canada Ltd
Publication of CA3144051A1 publication Critical patent/CA3144051A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Pertaining to the field of big data processing technology, the present invention makes public a data sorting method, and corresponding device and system. The method comprises: partitioning received to-be-processed data into at least two first data blocks; sampling each first data block, and obtaining sampling data to which each first data block corresponds; calculating a weight value of the sampling data to which each first data block corresponds according to a data volume of each first data block; determining, in the sampling data, cutting-point data for partitioning the to-be-processed data to generate at least two second data blocks based on the weight value of each sampling data; partitioning the to-be-processed data to generate the second data blocks by the use of the cutting-point data; and sorting data in each second data block, and obtaining a sorting result.

Description

DATA SORTING METHOD, DEVICE, AND SYSTEM
BACKGROUND OF THE INVENTION
Technical Field [0001] The present invention relates to the field of big data processing technology, and more particularly to a data sorting method, a data sorting device, and a corresponding system.
Description of Related Art
[0002] The scenarios of grouping and sorting or grouping to get the top N
number of sth. are frequently encountered in such data processing processes as data cleaning and data analyses. It is usual in the state of the art to partition data into a plurality of data blocks according to grouped fields and to place the various data blocks in corresponding threads to be sorted therein, but this practice tends to engender the problem of imbalanced distribution of data volumes in the threads in the case of relatively large magnitude of data volumes and relatively less grouped fields, threads assigned with relatively large data volumes are often processed longer, and the circumstance of OOM (out of memory) occurs in severe cases.
SUMMARY OF THE INVENTION
[0003] In order to overcome problems pending in the state of the art, embodiments of the present invention provide a data sorting method, a data sorting device, and a corresponding system. The technical solutions are as follows.
[0004] According to the first aspect, there is provided a data sorting method that comprises:
[0005] partitioning received to-be-processed data into at least two first data blocks;

Date Recue/Date Received 2021-12-24
[0006] sampling each first data block, and obtaining sampling data to which each first data block corresponds;
[0007] calculating a weight value of the sampling data to which each first data block corresponds according to a data volume of each first data block;
[0008] determining, in the sampling data, cutting-point data for partitioning the to-be-processed data to generate at least two second data blocks based on the weight value of each sampling data;
[0009] partitioning the to-be-processed data to generate the second data blocks by the use of the cutting-point data; and
[0010] sorting data in each second data block, and obtaining a sorting result.
[0011] Further, the step of determining, in the sampling data, cutting-point data for partitioning the to-be-processed data to generate at least two second data blocks based on the weight value of each sampling data includes:
[0012] grouping and sorting all sampling data according to grouped fields and sorted fields of the to-be-processed data;
[0013] traversing the sorted sampling data, and accumulating weight values of the sampling data currently traversed, to obtain an accumulated weight;
[0014] comparing the accumulated weight with a weight maximum to which each second data block corresponds; and
[0015] determining, when the accumulated weight is not smaller than the weight maximum, the last sampling data currently traversed as the cutting-point data, and determining the currently compared second data block as the second data block to which the cutting-point data corresponds.
[0016] Further, a process of calculating a weight maximum to which each second data block corresponds includes:
[0017] calculating a cut step according to a sum of weight values of the sampling data and the number of the second data blocks; and Date Recue/Date Received 2021-12-24
[0018] calculating the weight maximum according to the cut step and a serial number to which each second data block corresponds.
[0019] Further, the number of the second data blocks is determined according to a data volume threshold of each second data block and a total data volume of the to-be-processed data.
[0020] Further, the step of sorting data in each second data block, and obtaining a sorting result includes:
[0021] determining a starting location of the grouped fields in each second data block; and
[0022] grouping and sorting data in the second data block according to the grouped fields and the sorted fields in accordance with the starting location of the grouped fields in the second data block.
[0023] Further, the step of determining a starting location of the grouped fields in each second data block includes:
[0024] obtaining all the second data blocks and a total data volume to which the grouped fields correspond; and
[0025] determining a starting location of the grouped fields in each second data block according to the total data volume to which the grouped fields correspond.
[0026] According to the second aspect, there is provided a data sorting device that comprises:
[0027] a first data block generating module, for partitioning received to-be-processed data into at least two first data blocks;
[0028] a sampling module, for sampling each first data block, and obtaining sampling data to which each first data block corresponds;
[0029] a weight value calculating module, for calculating a weight value of the sampling data to which each first data block corresponds according to a data volume of each first data block;
[0030] a cutting-point data determining module, for determining, in the sampling data, cutting-Date Recue/Date Received 2021-12-24 point data for partitioning the to-be-processed data to generate at least two second data blocks based on the weight value of each sampling data;
[0031] a second data block generating module, for partitioning the to-be-processed data to generate the second data blocks by the use of the cutting-point data; and
[0032] a sorting result generating module, for sorting data in each second data block, and obtaining a sorting result.
[0033] Further, the cutting-point data determining module comprises:
[0034] a sampling data sorting module, for grouping and sorting all sampling data according to grouped fields and sorted fields of the to-be-processed data;
[0035] an accumulation calculating module, for traversing the sorted sampling data, and accumulating weight values of the sampling data currently traversed, to obtain an accumulated weight; and
[0036] a comparing module, for comparing the accumulated weight with a weight maximum to which each second data block corresponds, determining, when the accumulated weight reaches the weight maximum, the last sampling data currently traversed as the cutting-point data, and determining the currently compared second data block as the second data block to which the cutting-point data corresponds.
[0037] Further, the cutting-point data determining module further includes a weight maximum calculating module for:
[0038] calculating a cut step according to a sum of weight values of the sampling data and the number of the second data blocks; and
[0039] calculating the weight maximum according to the cut step and a serial number to which each second data block corresponds.
[0040] Further, the cutting-point data determining module further includes a second data block number determining module, for:
[0041] determining the number of the second data blocks according to a data volume threshold Date Recue/Date Received 2021-12-24 of each second data block and a total data volume of the to-be-processed data.
[0042] Further, the sorting result generating module includes:
[0043] a data starting location determining module, for determining a starting location of the grouped fields in each second data block; and
[0044] a sorting module, for grouping and sorting data in the second data block according to the grouped fields and the sorted fields in accordance with the starting location of the grouped fields in the second data block.
[0045] Further, the data starting location determining module is specifically employed for:
[0046] obtaining all the second data blocks and a total data volume to which the grouped fields correspond; and
[0047] determining a starting location of the grouped fields in each second data block according to the total data volume to which the grouped fields correspond.
[0048] According to the third aspect, there is provided a computer system that comprises:
[0049] one or more processor(s); and
[0050] a memory, associated with the one or more processor(s), wherein the memory is employed to store a program instruction, and the program instruction executes the method according to any item of the aforementioned first aspect when it is read and executed by the one or more processor(s).
[0051] The technical solutions provided by the embodiments of the present invention bring about the following advantageous effects.
[0052] By calculating weights of sampling data to determine cutting information to partition and sort the to-be-processed data, technical solutions made public by the present invention solve the problem of imbalanced grouping and sorting distribution caused by imbalanced data volumes in first data blocks.
Date Recue/Date Received 2021-12-24
[0053] By solving the problem of imbalanced grouping and sorting distribution, technical solutions made public by the present invention hence solve the problem of OOM
(out of memory) caused by unduly much data involved in one or more data blocks.
[0054] By determining cutting information to partition and sort the to-be-processed data through the sampling data, technical solutions made public by the present invention greatly reduce the computational amount of the sorting, and support grouping and sorting of massive data.
BRIEF DESCRIPTION OF THE DRAWINGS
[0055] To more clearly describe the technical solutions in the embodiments of the present invention, drawings required to illustrate the embodiments will be briefly introduced below. Apparently, the drawings introduced below are merely directed to some embodiments of the present invention, while persons ordinarily skilled in the art may further acquire other drawings on the basis of these drawings without spending creative effort in the process.
[0056] Fig. 1 is a flowchart illustrating a data sorting method provided by an embodiment of the present invention;
[0057] Fig. 2 is a view illustrating a data sorting process provided by an embodiment of the present invention;
[0058] Fig. 3 is a view illustrating a process of generating second data blocks provided by an embodiment of the present invention;
[0059] Fig. 4 is a view illustrating a process of sorting second data blocks provided by an Date Recue/Date Received 2021-12-24 embodiment of the present invention;
[0060] Fig. 5 is a view schematically illustrating the structure of a data sorting device provided by an embodiment of the present invention; and
[0061] Fig. 6 is a view schematically illustrating the structure of a computer system provided by an embodiment of the present invention.
DETAILED DESCRIPTION OF THE INVENTION
[0062] To make more lucid and clear the objectives, technical solutions and advantages of the present invention, the technical solutions in the embodiments of the present invention will be clearly and comprehensively described below with reference to the accompanying drawings in the embodiments of the present invention. Apparently, the embodiments as described are merely partial, rather than the entire, embodiments of the present invention.
Any other embodiments makeable by persons ordinarily skilled in the art on the basis of the embodiments in the present invention without creative effort shall all fall within the protection scope of the present invention.
[0063] In order to overcome the problem of large operating pressure occurring in the case of data sorting with relatively large magnitude of data volumes and relatively less grouped fields, the present invention proposes a data sorting method and corresponding device and system, with technical solutions specified as follows.
[0064] As shown in Fig. 1, the data sorting method comprises the following steps.
[0065] Si - partitioning received to-be-processed data into at least two first data blocks.
[0066] The to-be-processed data can be data read from a database or a data file. Cutting the to-Date Recue/Date Received 2021-12-24 be-processed data into first data blocks in this step can be the cutting with data file as unit, namely one data file one data block, and it is also possible to cut by self-definition.
[0067] Specifically, the sorting scenario disclosed in the embodiments of the present invention can be processed by using a spark big data component, but this is not limited to spark. As shown in Fig. 2, data blocks in the spark component are partitions, the first data block is P1, data file 1 is cut into partition 10 that contains 30 pieces of data, data file 2 is cut into partition 11 that contains 20 pieces of data, and data file 3 is cut into partition 12 that contains 10 pieces of data.
[0068] Take for example of the to-be-processed data being score data of male and female users, and the to-be-processed data can be as shown in the following Table 1:
Table 1 User ID Gender Score First Data Block Serial Number A male 92 partition 10 ... ... ... partition 10 B female 21 partition 10 ... ... ... partition 10 C male 45 partition 11 ... ... ... partition 11 D female 95 partition 11 ... ... ... partition 11 E male 86 partition 12 ... ... ... partition 12 F female 88 partition 12 ... ... ... partition 12
[0069] S2 - sampling each first data block, and obtaining sampling data to which each first data Date Recue/Date Received 2021-12-24 block corresponds.
[0070] The reservoir sampling method is employed to sample the first data block, and the reservoir sampling method is mainly applied to equal probability sampling in which the data stream is extremely long or unknown, and data in the data stream can be accessed only once. Moreover, the sampling quantity is identical in the various first data blocks.
[0071] Specifically, as shown in Fig. 2, the various first data blocks are respectively sampled by using the reservoir sampling method to obtain sampling data.
[0072] Take for the aforementioned example of the to-be-processed data being score data of male and female users, the sampling data as obtained are as shown in the following Table 2:
Table 2 User ID Gender Score First Data Block Serial Number A male 92 partion10 female 21 partion10 male 45 parti on 11 female 95 parti on 11 male 86 partion12 female 88 partion12
[0073] S3 - calculating a weight value of the sampling data to which each first data block corresponds according to a data volume of each first data block.
[0074] The specific calculation expression of the weight value is:
Ni weighti = _______________________________________ TotaIN

Date Recue/Date Received 2021-12-24
[0075] where weight, is the weight value of the sampling data to which each first data block corresponds, N is the data volume in each first data block, TotalN is the total data volume of all first data blocks, namely the total data volume of the to-be-processed data, and i is the code of a first data block.
[0076] Take for the aforementioned example of the to-be-processed data being score data of male and female users, in each of the aforementioned first data blocks, the weight value to which partition10 corresponds is 1/2, the weight value to which partitionll corresponds is 1/3, and the weight value to which partition12 corresponds is 1/6.
[0077] S4 - determining, in the sampling data, cutting-point data for partitioning the to-be-processed data to generate at least two second data blocks based on the weight value of each sampling data.
[0078] The second data blocks are the finally partitioned data blocks of the to-be-processed data, and the cutting-point data is the piece of data sorted last in the various second data blocks when the to-be-processed data is partitioned into second data blocks.
Accordingly, the cutting-point data is a cutting point between second data blocks. As shown in Fig. 3, the to-be-processed data is partitioned into the second data blocks P2 according to the cutting-point data determined from the sampling data of the first data blocks P1.
[0079] In one embodiment, the specific method of determining the cutting information includes:
[0080] grouping and sorting all sampling data according to grouped fields and sorted fields of the to-be-processed data;
[0081] traversing the sorted sampling data, and accumulating weight values of the sampling data currently traversed, to obtain an accumulated weight;
[0082] comparing the accumulated weight with a weight maximum to which each second data block corresponds; and
[0083] determining, when the accumulated weight is not smaller than the weight maximum, the Date Recue/Date Received 2021-12-24 last sampling data currently traversed as the cutting-point data, and determining the currently compared second data block as the second data block to which the cutting-point data corresponds.
[0084] With respect to the aforementioned grouping and sorting all sampling data according to grouped fields and sorted fields of the to-be-processed data, take for the aforementioned example of the to-be-processed data being scores of males and females, the grouped fields are: male, female, the sorted fields are: scores, and the grouping and sorting result of the sampling data is as shown in the following table 3:
Table 3 User ID Gender Score First Data Block Serial Number weight/
A male 92 partition10 1/2 E male 86 partition12 1/6 C male 45 partition10 1/2 D female 95 partitionll F female 88 partition12 B female 21 partitionll
[0085] The process of accumulating the weight values of the sampling data is to sequentially accumulate according to the arrangement sequence of the sampling data. During the accumulating process the accumulated weight is compared with weight maximums sequentially in an increasing order of the weight maximums. Weight maximums to which the various second data blocks are not equal, they can be artificially set, and can also be obtained through calculation.
[0086] Preferably, in an embodiment, the process of calculating a weight maximum to which each second data block corresponds includes:
[0087] calculating a cut step according to a sum of weight values of the sampling data and the Date Recue/Date Received 2021-12-24 number of the second data blocks; and
[0088] calculating the weight maximum according to the cut step and a serial number to which each second data block corresponds.
[0089] The cut step is a difference between weight maximums of two adjacent second data blocks, and is a basic unit for partitioning the to-be-processed data into second data blocks, its specific calculation expression is as follows:
SUM (weighti) step = ____________________________________________ PN
[0090] where step is the cut step, SUM(weight) is the sum of weight values of the sampling data to which the various first data blocks correspond, and PN is the number of the second data blocks.
[0091] The serial numbers to which the second data blocks correspond are related to the number of the second data blocks, for instance, the serial number of the first second data block is 1, the serial number of the second second data block is 2, and so on so forth.
[0092] The step of calculating the weight maximum according to the cut step and a serial number to which the each second data block corresponds is specifically: taking a product of the cut step and the serial number to which the each second data block corresponds as the cut step, and the calculation expression is: step*ni, where ni is the serial number to which the each second data block corresponds, and the valuation range of ni is [1,PN].
[0093] In one embodiment, the number of the second data blocks is determined according to a data volume threshold of each second data block and a total data volume of the to-be-processed data. The specific calculation expression is as follows:
TotaIN
PN = __________________________________________ Date Recue/Date Received 2021-12-24
[0094] where PN is the number of the second data blocks, TotalN is the total data volume of the to-be-processed data, k is the data volume threshold of the second data blocks, and k is consistent with respect to the various second data blocks. Take for the aforementioned example of the to-be-processed data being score data of male and female users, if k is 30, 60 = 3.
then PN =
[0095] The volume data threshold of the second data blocks is the maximum value of data volumes storable by the various second data blocks preset by the user, so the number of the second data blocks is calculated in combination with the total data volume of the to-be-processed data.
[0096] Take for the aforementioned example of the to-be-processed data being score data of male and female users, the cut step as calculated is:

¨2 x 2 + ¨ x 2 + ¨ x 2 2 step= =
[0097] The grouped and sorted sampling data are traversed, the weight values corresponding thereto are accumulated, when the accumulated value is greater than or equal to step*n, it is then determined that the sampling data to which the weight value of the last addend, in the various addends in calculating the accumulated value, corresponds as the cutting-point data.
[0098] Likewise take for example of the to-be-processed data being scores of males and females, the accumulating process is:

¨ + ¨ = ¨ ' so user E is the cutting-point data of the first data block in the second data blocks, namely the last data in partiti0n20.

Date Recue/Date Received 2021-12-24
[0099] Accumulation is continued, -1 + -1 + -1 + -1 > -2 x 2, so user D is the cutting-point data of the second data block in the second data blocks, namely the last data in partition21.
[0100] Accumulation is continued, -1 + -1 + -1 + -1 + -1 + -1 = 2, so user B
is the cutting-point data of the third data block in the second data blocks, namely the last data in partition22.
[0101] The calculation principle for determining the cutting information according to the sampling data in the embodiments of the present invention is as follows.
[0102] The weight value embodies the proportion of the data volume of each first data block in the total data volume, so with respect to a first data block with relatively small data volume, the weight value to which its sampling data corresponds is also correspondingly relatively small. Determination of the cutting-point data through the accumulated weight value mainly takes into consideration the influence on the determination of the cutting-point data through the sampling data by different data volumes in the various first data blocks, whereby the to-be-processed data is more balanced in its partition into the second data blocks.
[0103] S5 - partitioning the to-be-processed data to generate the second data blocks by the use of the cutting-point data.
[0104] After the cutting-point data has been determined, the various data in the to-be-processed data is respectively matched with the cutting-point data, such matching is simultaneous with the matching of the grouped fields and the sorted fields, in which the to-be-processed data whose grouped fields are consistent and whose values of sorted fields are greater than or equal to the values of the cutting-point sorted fields is partitioned in the second data blocks to which the cutting-point data corresponds.

Date Recue/Date Received 2021-12-24
[0105] S6 - sorting data in each second data block, and obtaining a sorting result.
[0106] The to-be-processed data between the various second data blocks is orderly distributed after the processing of steps Si to S5, but the to-be-processed data in the various second data blocks is unorderly, therefore, as shown in Fig. 4, it is required to perform in-block sorting to the to-be-processed data in the second data blocks.
[0107] In one embodiment, there is data in which different grouped fields exist in one second data block, it is therefore required to determine the location of the sorting starting data of the grouped fields in the second data block, and the specific technical solution includes:
[0108] determining a starting location of the grouped fields in each second data block; and
[0109] grouping and sorting data in the second data block according to the grouped fields and the sorted fields in accordance with the starting location of the grouped fields in each second data block, to obtain a sorting result.
[0110] With respect to the method of determining a starting location of the grouped fields of the to-be-processed data in the second data block, the present invention makes public the following two embodiments.
[0111] In one embodiment, obtaining a starting location of the grouped fields of the to-be-processed data in the second data block includes:
[0112] determining a second data block that contains a plurality of grouped fields according to the number of grouped fields contained in each second data block; and
[0113] determining a starting location of the grouped fields in the second data block according to the number of grouped fields contained in the second data block that contains a plurality of grouped fields.
[0114] After the data volumes of the grouped fields contained in the second data block have been well determined, it is possible to determine that the last data volume to which the same Date Recue/Date Received 2021-12-24 grouped fields correspond is the ending location of the grouped fields in the second data block, then the next location is the starting location of another grouped fields in the second data block. Take for example the grouped fields being: male, female, the sorted fields being: scores:
[0115] counting the data volume of the male grouped fields in each second data block, for instance, 0->[ {male, 500,000}1, N->[{male, 200,000}, {female, 300,000}1....
Seen as such, the second data block with the serial number of N contains data of grouped fields as male and also contains data of grouped fields as female. Then the starting location of the data of grouped fields as female is 200,000+1 in this second data block.
[0116] In one embodiment, obtaining a starting location of the grouped fields of the to-be-processed data in each second data block includes:
[0117] obtaining all the second data blocks and a total data volume to which the grouped fields correspond; and
[0118] determining a starting location of the grouped fields in each second data block according to the total data volume to which the grouped fields correspond.
[0119] Take for example the grouped fields being: male, female, the sorted fields being: scores:
[0120] counting the data volume of the male grouped fields in each second data block, for instance, 0->[{male, 500,000}1, N->[{male, 200,000}, {female, 300,000}1....
The above information is converted into the information:
{male->[ {0,0}, {1,500,000},...],female->[ {N,0}, {N+1,300,000}1, whence can be obtained the starting locations of the grouped fields as male and female in each second data block.
[0121] Relative to the previous method of determining the starting location of grouped fields in a second data block, it is not required in this embodiment to search for any second data block containing a plurality of grouped fields, whereas the various second data blocks are uniformly processed.

Date Recue/Date Received 2021-12-24
[0122] As shown in Fig. 5, based on the aforementioned data sorting method, the present invention further discloses a data sorting device that comprises the following modules.
[0123] A first data block generating module 501 is employed for partitioning received to-be-processed data into at least two first data blocks.
[0124] The to-be-processed data can be data read from a database or a data file. Cutting the to-be-processed data into first data blocks in this step can be the cutting with data file as unit, and it is also possible to cut by self-definition.
[0125] A sampling module 502 is employed for sampling each first data block, and obtaining sampling data to which each first data block corresponds.
[0126] The sampling module 502 mainly employs the reservoir sampling method to equally sample various first data blocks.
[0127] A weight value calculating module 503 is employed for calculating a weight value of the sampling data to which each first data block corresponds according to a data volume of each first data block.
[0128] The weight value of the sampling data to which the first data block corresponds is a proportion of the data volume of the first data block to the total data volume of the to-be-processed data, and the specific calculation expression is:
N
weighti = -TotaIN
[0129] where weight, is the weight value of the sampling data to which each first data block corresponds, Ni is the data volume in each first data block, TotalN is the total data volume of all first data blocks, namely the total data volume of the to-be-processed data, and i is Date Recue/Date Received 2021-12-24 the code of a first data block.
[0130] A cutting-point data determining module 504 is employed for determining, in the sampling data, cutting-point data for partitioning the to-be-processed data to generate at least two second data blocks based on the weight value of each sampling data.
[0131] The cutting-point data is the piece of data sorted last in the various second data blocks when the to-be-processed data is partitioned into second data blocks.
Accordingly, the cutting-point data is a cutting point between second data blocks.
[0132] In one embodiment, the cutting-point data determining module 504 includes:
[0133] a sampling data sorting module, for grouping and sorting all sampling data according to grouped fields and sorted fields of the to-be-processed data;
[0134] an accumulation calculating module, for traversing the sorted sampling data, and accumulating weight values of the sampling data currently traversed, to obtain an accumulated weight; and
[0135] a comparing module, for comparing the accumulated weight with a weight maximum to which each second data block corresponds, determining, when the accumulated weight reaches the weight maximum, the last sampling data currently traversed as the cutting-point data, and determining the currently compared second data block as the second data block to which the cutting-point data corresponds.
[0136] In one embodiment, the cutting-point data determining module 504 further includes a weight maximum calculating module for:
[0137] calculating a cut step according to a sum of weight values of the sampling data and the number of the second data blocks; and
[0138] calculating the weight maximum according to the cut step and a serial number to which each second data block corresponds.

Date Recue/Date Received 2021-12-24
[0139] The cut step is a difference between weight maximums of two adjacent second data blocks, and is a basic unit for partitioning the to-be-processed data into second data blocks, its specific calculation expression is as follows:
SUM (weighti) step = ____________________________________________ PN
[0140] where step is the cut step, SUM(weight) is the sum of weight values of the sampling data to which the various first data blocks correspond, and PN is the number of the second data blocks.
[0141] Calculating the weight maximum according to the cut step and a serial number to which the each second data block corresponds is specifically: taking a product of the cut step and the serial number to which the each second data block corresponds as the cut step, and the calculation expression is: step*ni, where ni is the serial number to which the each second data block corresponds, and the valuation range of ni is [1,PN].
[0142] In one embodiment, the cutting-point data determining module 504 further includes:
[0143] a second data block number determining module, for determining the number of the second data blocks according to a data volume threshold of each second data block and a total data volume of the to-be-processed data.
[0144] The specific calculation expression is as follows:
TotaIN
PN = __________________________________________
[0145] where PN is the number of the second data blocks, TotalN is the total data volume of the to-be-processed data, k is the data volume threshold of the second data blocks, and k is consistent with respect to the various second data blocks.
[0146] A second data block generating module 505 is employed for partitioning the to-be-Date Recue/Date Received 2021-12-24 processed data to generate the second data blocks by the use of the cutting-point data.
[0147] After the cutting-point data has been determined, the various data in the to-be-processed data is respectively matched with the cutting-point data, such matching is simultaneous with the matching of the grouped fields and the sorted fields, in which the to-be-processed data whose grouped fields are consistent and whose values of sorted fields are greater than or equal to the values of the cutting-point sorted fields is partitioned in the second data blocks to which the cutting-point data corresponds.
[0148] A sorting result generating module 506 is employed for sorting data in each second data block, and obtaining a sorting result.
[0149] In one embodiment, the sorting result generating module 506 includes:
[0150] a data starting location determining module, for determining a starting location of the grouped fields in each second data block; and
[0151] a sorting module, for grouping and sorting data in the second data block according to the grouped fields and the sorted fields in accordance with the starting location of the grouped fields in the second data block.
[0152] In one embodiment, the data starting location determining module is specifically employed for:
[0153] determining a second data block that contains a plurality of grouped fields according to the number of grouped fields contained in each second data block; and
[0154] determining a starting location of the grouped fields in the second data block according to the number of grouped fields contained in the second data block that contains a plurality of grouped fields.
[0155] In one embodiment, the data starting location determining module is specifically employed for:
Date Recue/Date Received 2021-12-24
[0156] obtaining all the second data blocks and a total data volume to which the grouped fields correspond; and
[0157] determining a starting location of the grouped fields in each second data block according to the total data volume to which the grouped fields correspond.
[0158] The device disclosed in the embodiments of the present invention can specifically be formed by a Driver node and exector nodes. Specifically, the work is divided between the two as described below.
[0159] After the received to-be-processed data has been partitioned into at least two first data blocks, the first data blocks are forwarded to the exector nodes to be equally sampled, and data is generated with the format (partitionIdxIdl, N, samArray<T>) to be transmitted to the Driver node, wherein partitionIdxIdl is the partition number of a first data block, Ni represents the data volume contained in the first data block, samArray<T>
is a sampling data set, in which T is a data format with grouped fields and sorted fields.
[0160] The Driver node summarizes and calculates N, to obtain a total data volume TotaIN of the to-be-processed data, and calculates the number PN of the second data blocks according to the TotalN and a preset data volume threshold of the second data blocks.
The weight of each sampling data is calculated and obtained via Ni/TotalN. The cut step is calculated via SUM(weight)/PN, the sampling data set samArray<T> is traversed, the corresponding cut step is accumulated at the same time, when the accumulated value is greater than or equal to step*n (ne[1,PN1) and inconsistent with the previously obtained cutting-point data, it is then determined that the data to which the last weight, which is accumulated by the current calculation, corresponds is the cutting-point data, the second data block serial number corresponding thereto is the serial number of the second data block in the cutting information, and the process continues so on and so forth to find all cutting information array<index,T> in the sampling data set, in which index is the serial number of the second data block to which the cutting-point data corresponds.

Date Recue/Date Received 2021-12-24
[0161] The Driver node broadcasts the cutting information array<index,T> to each exector node, in which each piece of to-be-processed data in the first data block matches with the cutting information array<index,T>, and the to-be-processed data is partitioned to the corresponding second data block according to the matching result. The exector node traverses the second data blocks to obtain (partitionIdxId2,array<M,N>) and summarizes it to the Driver node, in which partitionIdxId2 is the serial number of a second data block, M is grouped fields, and N is the corresponding magnitude of the data volume to which the grouped fields correspond.
[0162] The Driver node summarizes and generates array (partitionIdxId2,array<M,N>), converts it into map<M,map<pidex,N0>> and broadcasts the same to each exector node, in which NO is the starting order of group M sorted in partition number pidex.
[0163] The exector node sorts the data in the various second data blocks according to the grouped fields and sorted fields, obtains ordered data within each partition, traverses M value and serial number in each second data block, obtains the starting NO of the current group in the current data block, and obtains the serial number of group M in this partition according to NO.
[0164] Based on the aforementioned data sorting method, the present invention further provides a computer system that comprises:
[0165] one or more processor(s); and
[0166] a memory, associated with the one or more processor(s), wherein the memory is employed to store a program instruction, and the program instruction executes the aforementioned data sorting method when it is read and executed by the one or more processor(s).
[0167] Fig. 6 exemplarily illustrates the framework of the computer system that can specifically include a processor 610, a video display adapter 611, a magnetic disk driver 612, an Date Recue/Date Received 2021-12-24 input/output interface 613, a network interface 614, and a memory 620. The processor 610, the video display adapter 611, the magnetic disk driver 612, the input/output interface 613, the network interface 614, and the memory 620 can be communicably connected with one another via a communication bus 630.
[0168] The processor 610 can be embodied as a general CPU (Central Processing Unit), a microprocessor, an ASIC (Application Specific Integrated Circuit), or one or more integrated circuit(s) for executing relevant program(s) to realize the technical solutions provided by the present application.
[0169] The memory 620 can be embodied in such a form as an ROM (Read Only Memory), an RAM (Random Access Memory), a static storage device, or a dynamic storage device.
The memory 620 can store an operating system 621 for controlling the running of an electronic equipment 600, and a basic input/output system 622 (BIOS) for controlling lower-level operations of the electronic equipment 600. In addition, the memory 620 can also store a web browser 623, a data storage management system 624, and an equipment identification information processing system 625, etc. The equipment identification information processing system 625 can be an application program that specifically realizes the aforementioned various step operations in the embodiments of the present application. To sum it up, when the technical solutions provided by the present application are to be realized via software or firmware, the relevant program codes are stored in the memory 620, and invoked and executed by the processor 610.
[0170] The input/output interface 613 is employed to connect with an input/output module to realize input and output of information. The input/output module can be equipped in the device as a component part (not shown in the drawings), and can also be externally connected with the device to provide corresponding functions. The input means can include a keyboard, a mouse, a touch screen, a microphone, and various sensors etc., and Date Recue/Date Received 2021-12-24 the output means can include a display screen, a loudspeaker, a vibrator, an indicator light etc.
[0171] The network interface 614 is employed to connect to a communication module (not shown in the drawings) to realize intercommunication between the current device and other devices. The communication module can realize communication in a wired mode (via USB, network cable, for example) or in a wireless mode (via mobile network, WIFI, Bluetooth, etc.).
[0172] The bus 630 includes a passageway transmitting information between various component parts of the device (such as the processor 610, the video display adapter 611, the magnetic disk driver 612, the input/output interface 613, the network interface 614, and the memory 620).
[0173] Additionally, the electronic equipment 600 may further obtain information of specific collection conditions from a virtual resource object collection condition information database for judgment on conditions, and so on.
[0174] As should be noted, although merely the processor 610, the video display adapter 611, the magnetic disk driver 612, the input/output interface 613, the network interface 614, the memory 620, and the bus 630 are illustrated for the aforementioned device, the device may further include other component parts prerequisite for realizing normal running during specific implementation. In addition, as can be understood by persons skilled in the art, the aforementioned device may as well only include component parts necessary for realizing the solutions of the present application, without including the entire component parts as illustrated.
[0175] As can be known through the description to the aforementioned embodiments, it is clearly learnt by person skilled in the art that the present application can be realized through Date Recue/Date Received 2021-12-24 software plus a general hardware platform. Based on such understanding, the technical solutions of the present application, or the contributions made thereby over the state of the art, can be essentially embodied in the form of a software product, and such a computer software product can be stored in a storage medium, such as an ROM/RAM, a magnetic disk, an optical disk etc., and includes plural instructions enabling a computer equipment (such as a personal computer, a server, or a network device etc.) to execute the methods described in various embodiments or some sections of the embodiments of the present application.
[0176] The various embodiments are progressively described in the Description, identical or similar sections among the various embodiments can be inferred from one another, and each embodiment stresses what is different from other embodiments.
Particularly, with respect to the system or system embodiment, since it is essentially similar to the method embodiment, its description is relatively simple, and the relevant sections thereof can be inferred from the corresponding sections of the method embodiment. The system or system embodiment as described above is merely exemplary in nature, units therein described as separate parts can be or may not be physically separate, parts displayed as units can be or may not be physical units, that is to say, they can be located in a single site, or distributed over a plurality of network units. It is possible to base on practical requirements to select partial modules or the entire modules to realize the objectives of the embodied solutions. It is understandable and implementable by persons ordinarily skilled in the art without spending creative effort in the process.
[0177] Technical solutions provided by the embodiments of the present invention bring about the following advantageous effects.
[0178] By calculating weights of sampling data to determine cutting information to partition and sort the to-be-processed data, technical solutions made public by the present invention solve the problem of imbalanced grouping and sorting distribution caused by imbalanced Date Recue/Date Received 2021-12-24 data volumes in first data blocks.
[0179] By solving the problem of imbalanced grouping and sorting distribution, technical solutions made public by the present invention hence solve the problem of OOM
(out of memory) caused by unduly much data involved in one or more data blocks.
[0180] By determining cutting information to partition and sort the to-be-processed data through the sampling data, technical solutions made public by the present invention greatly reduce the computational amount of the sorting, and support grouping and sorting of massive data.
[0181] All the above optional technical solutions can be randomly combined to form optional embodiments of the present invention, to which no repetition is made thereto in this context.
[0182] What is described above is merely directed to preferred embodiments of the present invention, and is not meant to restrict the present invention. Any amendment, equivalent replacement and improvement makeable within the spirit and principle of the present invention shall all fall within the protection scope of the present invention.

Date Recue/Date Received 2021-12-24

Claims (10)

What is claimed is:
1. A data sorting method, characterized in comprising:
partitioning received to-be-processed data into at least two first data blocks;
sampling each first data block, and obtaining sampling data to which each first data block corresponds;
calculating a weight value of the sampling data to which each first data block corresponds according to a data volume of each first data block;
determining, in the sampling data, cutting-point data for partitioning the to-be-processed data to generate at least two second data blocks based on the weight value of each sampling data;
partitioning the to-be-processed data to generate the second data blocks by the use of the cutting-point data; and sorting data in each second data block, and obtaining a sorting result.
2. The method according to Claim 1, characterized in that the step of determining, in the sampling data, cutting-point data for partitioning the to-be-processed data to generate at least two second data blocks based on the weight value of each sampling data includes:
grouping and sorting all sampling data according to grouped fields and sorted fields of the to-be-processed data;
traversing the sorted sampling data, and accumulating weight values of the sampling data currently traversed, to obtain an accumulated weight;
comparing the accumulated weight with a weight maximum to which each second data block corresponds; and determining, when the accumulated weight is not smaller than the weight maximum, the last sampling data currently traversed as the cutting-point data, and determining the currently compared second data block as the second data block to which the cutting-point data corresponds.

Date Recue/Date Received 2021-12-24
3. The method according to Claim 2, characterized in that a process of calculating a weight maximum to which each second data block corresponds includes:
calculating a cut step according to a sum of weight values of the sampling data and the number of the second data blocks; and calculating the weight maximum according to the cut step and a serial number to which each second data block corresponds.
4. The method according to Claim 3, characterized in that the number of the second data blocks is determined according to a data volume threshold of each second data block and a total data volume of the to-be-processed data.
5. The method according to anyone of Claims 1 to 4, characterized in that the step of sorting data in each second data block, and obtaining a sorting result includes:
determining a starting location of the grouped fields in each second data block; and grouping and sorting data in the second data block according to the grouped fields and the sorted fields in accordance with the starting location of the grouped fields in the second data block.
6. The method according to Claim 5, characterized in that the step of determining a starting location of the grouped fields in each second data block includes:
obtaining all the second data blocks and a total data volume to which the grouped fields correspond; and determining a starting location of the grouped fields in each second data block according to the total data volume to which the grouped fields correspond.
7. A data sorting device, characterized in comprising:
a first data block generating module, for partitioning received to-be-processed data into at least two first data blocks;

Date Recue/Date Received 2021-12-24 a sampling module, for sampling each first data block, and obtaining sampling data to which each first data block corresponds;
a weight value calculating module, for calculating a weight value of the sampling data to which each first data block corresponds according to a data volume of each first data block;
a cutting-point data determining module, for determining, in the sampling data, cutting-point data for partitioning the to-be-processed data to generate at least two second data blocks based on the weight value of each sampling data;
a second data block generating module, for partitioning the to-be-processed data to generate the second data blocks by the use of the cutting-point data; and a sorting result generating module, for sorting data in each second data block, and obtaining a sorting result.
8. The device according to Claim 7, characterized in that the cutting-point data determining module includes:
a sampling data sorting module, for grouping and sorting all sampling data according to grouped fields and sorted fields of the to-be-processed data;
an accumulation calculating module, for traversing the sorted sampling data, and accumulating weight values of the sampling data currently traversed, to obtain an accumulated weight; and a comparing module, for comparing the accumulated weight with a weight maximum to which each second data block corresponds, determining, when the accumulated weight reaches the weight maximum, the last sampling data currently traversed as the cutting-point data, and determining the currently compared second data block as the second data block to which the cutting-point data corresponds.
9. The device according to Claim 8, characterized in that the cutting-point data determining module further includes a weight maximum calculating module for:
calculating a cut step according to a sum of weight values of the sampling data and the number of the second data blocks; and Date Recue/Date Received 2021-12-24 calculating the weight maximum according to the cut step and a serial number to which each second data block corresponds.
10. A computer system, characterized in comprising:
one or more processor(s); and a memory, associated with the one or more processor(s), wherein the memory is employed to store a program instruction, and the program instruction executes the method according to anyone of Claims 1 to 6 when it is read and executed by the one or more processor(s).
Date Recue/Date Received 2021-12-24
CA3144051A 2020-12-28 2021-12-24 Data sorting method, device, and system Pending CA3144051A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011579026.8 2020-12-28
CN202011579026.8A CN112612614A (en) 2020-12-28 2020-12-28 Data sorting method, device and system

Publications (1)

Publication Number Publication Date
CA3144051A1 true CA3144051A1 (en) 2022-06-28

Family

ID=75248287

Family Applications (1)

Application Number Title Priority Date Filing Date
CA3144051A Pending CA3144051A1 (en) 2020-12-28 2021-12-24 Data sorting method, device, and system

Country Status (2)

Country Link
CN (1) CN112612614A (en)
CA (1) CA3144051A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113672619B (en) * 2021-08-17 2024-02-06 天津南大通用数据技术股份有限公司 Method for segmenting data according to hash rule to make data more uniform

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8954967B2 (en) * 2011-05-31 2015-02-10 International Business Machines Corporation Adaptive parallel data processing
CN110263059B (en) * 2019-05-24 2021-05-11 湖南大学 Spark-Streaming intermediate data partitioning method and device, computer equipment and storage medium
CN111104225A (en) * 2019-12-23 2020-05-05 杭州安恒信息技术股份有限公司 Data processing method, device, equipment and medium based on MapReduce
CN112000467A (en) * 2020-07-24 2020-11-27 广东技术师范大学 Data tilt processing method and device, terminal equipment and storage medium

Also Published As

Publication number Publication date
CN112612614A (en) 2021-04-06

Similar Documents

Publication Publication Date Title
CN109933617B (en) Data processing method, data processing device, related equipment and related medium
CN114817651B (en) Data storage method, data query method, device and equipment
CN115794916A (en) Data processing method, device, equipment and storage medium for multi-source data fusion
CA3144051A1 (en) Data sorting method, device, and system
CN108268503B (en) Database storage and query method and device
CN113987086A (en) Data processing method, data processing device, electronic device, and storage medium
CN110019193B (en) Similar account number identification method, device, equipment, system and readable medium
CN107656927B (en) Feature selection method and device
CN115269519A (en) Log detection method and device and electronic equipment
CN114884813A (en) Network architecture determination method and device, electronic equipment and storage medium
CN113868373A (en) Word cloud generation method and device, electronic equipment and storage medium
CN112861034B (en) Method, device, equipment and storage medium for detecting information
CN114359610B (en) Entity classification method, device, equipment and storage medium
CN114579573B (en) Information retrieval method, information retrieval device, electronic equipment and storage medium
CN115221339B (en) Method, device, equipment and medium for constructing regional knowledge graph
CN111125685A (en) Method and device for predicting network security situation
Feng et al. The edge weight computation with mapreduce for extracting weighted graphs
CN117473188B (en) Display data rendering method and device, electronic equipment and storage medium
US20220107949A1 (en) Method of optimizing search system
CN116821160A (en) Correlation updating method, device, equipment and medium based on user behavior track information
CN115563103A (en) Multi-dimensional aggregation method, system, electronic device and storage medium
CN116578646A (en) Time sequence data synchronization method, device, equipment and storage medium
CN114297486A (en) Information recommendation method and device, electronic equipment and storage medium
CN113392328A (en) Page data processing method, device, equipment and medium
CN105989156B (en) Method, equipment and system for providing search results

Legal Events

Date Code Title Description
EEER Examination request

Effective date: 20220916

EEER Examination request

Effective date: 20220916

EEER Examination request

Effective date: 20220916

EEER Examination request

Effective date: 20220916