CN111930720A

CN111930720A - Data tilt processing method, system, electronic device and medium

Info

Publication number: CN111930720A
Application number: CN202010863679.2A
Authority: CN
Inventors: 沈佳佳; 于美丽; 张帆
Original assignee: Ctrip Computer Technology Shanghai Co Ltd
Current assignee: Ctrip Computer Technology Shanghai Co Ltd
Priority date: 2020-08-25
Filing date: 2020-08-25
Publication date: 2020-11-13

Abstract

The invention discloses a processing method, a system, electronic equipment and a medium for data inclination, wherein the processing method for data inclination comprises the following steps: acquiring an inclined key value; splicing the inclined key value with a random number to obtain a splicing key value; expanding the capacity of the second large table according to the splicing key value to obtain an expanded large table; and associating the first large table with the capacity expansion large table. The invention makes the distribution of the key value data more uniform through the data inclination processing, thereby solving the problem of data inclination when the large table is associated with the large table, saving the computing resource, accelerating the computing speed and saving the running time.

Description

Data tilt processing method, system, electronic device and medium

Technical Field

The invention belongs to the technical field of big data calculation optimization, and particularly relates to a data tilt processing method, a data tilt processing system, electronic equipment and a medium.

Background

When large tables are associated with large tables for HDFS (distributed file system) based offline computing, the data is often skewed. In the prior art, the key value of the data skew is often directly calculated without processing. In addition, the basic data form partitions and tasks (tasks) according to the key values, when the number of the key values of the basic data is not uniformly distributed, the processing time of each Task, the occupied memory and a Central Processing Unit (CPU) are not balanced, the total processing time depends on the processing time of the latest Task, and the waste of computing resources and the long computing time are finally caused.

Disclosure of Invention

The invention provides a data tilt processing method, a system, electronic equipment and a medium, aiming at overcoming the defect of long processing time caused by uneven distribution of the number of key values in the prior art.

The invention solves the technical problems through the following technical scheme:

the invention provides a data skew processing method, which comprises the following steps:

acquiring an inclined key value;

splicing the inclined key value with a random number to obtain a splicing key value;

expanding the capacity of the second large table according to the splicing key value to obtain an expanded large table;

and associating the first large table with the capacity expansion large table.

Preferably, the step of obtaining the tilted key value includes:

and if the statistic of the ID is larger than the preset threshold value, setting the ID as an inclined key value, and setting the statistic as the statistic of the ID in the second large table.

Preferably, the step of concatenating the skewed key value with a random number to obtain a concatenated key value includes:

obtaining a random number according to a preset random number generation function, and connecting the inclined key value with the random number by a preset splicing symbol to obtain a splicing key value;

the preset random number generation function is characterized as follows:

ran ═ random ((v1+ t)% 1000000000) × v2, where Ran represents random number, v1 represents first parameter, t represents time stamp, v2 represents second parameter, random () represents random operation,% represents remainder operator, and Ran, v1, v2 are all positive integers.

Preferably, the step of obtaining the expanded large table by expanding the second large table according to the splicing key value includes:

and adding the splicing key value as a newly-added ID to the second large table to obtain a capacity-expansion large table.

The invention also provides a data skew processing system, which comprises a key value acquisition unit, a splicing unit, an expansion unit and an association unit;

the key value acquisition unit is used for acquiring an inclined key value;

the splicing unit is used for splicing the inclined key value with a random number to obtain a splicing key value;

the capacity expansion unit is used for expanding the capacity of the second large table according to the splicing key value to obtain a capacity expansion large table;

the association unit is used for associating the first large table with the capacity expansion large table.

Preferably, if the statistic of the ID is greater than the preset threshold, the key value obtaining unit sets the ID as an inclined key value, and the statistic is a statistic of the ID in the second large table.

Preferably, the splicing unit obtains a random number according to a random number generation function, and connects the tilted key value with the random number by a preset splicing symbol to obtain a splicing key value;

the preset random number generation function is characterized as follows:

Preferably, the concatenation unit adds the concatenation key value as the new ID to the second large table to obtain the capacity expansion large table.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the data tilt processing method.

The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data skew processing method of the present invention.

The positive progress effects of the invention are as follows: the invention makes the distribution of the key value data more uniform through the data inclination processing, thereby solving the problem of data inclination when the large table is associated with the large table, saving the computing resource, accelerating the computing speed and saving the running time.

Drawings

Fig. 1 is a flowchart of a data skew processing method according to embodiment 1 of the present invention.

Fig. 2 is a schematic structural diagram of a data skew processing system according to embodiment 2 of the present invention.

Fig. 3 is a schematic structural diagram of an electronic device according to embodiment 3 of the present invention.

Detailed Description

The invention is further illustrated by the following examples, which are not intended to limit the scope of the invention.

Example 1

The embodiment provides a data skew processing method. Referring to fig. 1, the data skew processing method includes the steps of:

and step S1, acquiring the tilted key value.

And step S2, splicing the tilted key value with a random number to obtain a splicing key value.

And step S3, expanding the capacity of the second large table according to the splicing key value to obtain a capacity expansion large table.

And step S4, associating the first large table with the capacity expansion large table.

In specific implementation, when the first large table (t1) and the second large table (t2) based on the HDFS are associated, assuming that the association field is ID, then (t2) counts the data amount by ID in the table, and the ID value with a larger data amount is the tilted key value. That is, in step S1, if the statistic of the ID is larger than the preset threshold, the ID is set as a tilted key value, and the statistic is the statistic of the ID in the second large table.

In step S2, a random number is obtained according to a preset random number generation function, and the tilted key value is connected to the random number by a preset splicing symbol to obtain a splicing key value.

As an optional implementation, the preset random number generation function is characterized by:

ran ═ random ((v1+ t)% 1000000000) × v2, where Ran represents random number, v1 represents first parameter, t represents time stamp, v2 represents second parameter, random () represents random operation,% represents remainder operator, and Ran, v1, v2 are all positive integers. Wherein the random () operation calls the nextDouble method.

In another alternative embodiment, the preset Random number generating function includes a parameter, and the Random class Random object directly calls the nextDouble method and multiplies the nextDouble method by the input parameter, i.e., the returned integer.

In step S3, the splicing key value is added to the second large table as a newly added ID to obtain a capacity expansion large table. That is, data in which the whole range of integers of the random numbers generated in the ID concatenation step S2 is added as IDs to the t2 table, and the remaining fields have the same contents as the original IDs.

In step S4, the first large table is associated with the capacity expansion large table. The processed t1 is associated with the expanded t2, and the tilted ID is associated with the information of the data expanded by t2, thereby obtaining the desired information.

In an alternative embodiment, taking the example of Job's calculation of hotel financial prepayment revenue, the prepay order requires information associating a first large table and a second large table to pull out the supplier's name, settlement method, hotel group, etc. supplier dimensions. The storage mode of the table is HDFS, the job code is HiveSQL, the scheduling system is Zeus system, and the execution engine is Spark.

In step S1, the supplier ID of the data skew is extracted as a standard for the supplier ID of the last day partition corresponding to the supplier ID of the data amount larger than 100 ten thousand, and written in one table.

In step S2, the prepaid order master table associates the tilted supplier ID list of step S1, and if the list matches, a random number concatenation is performed with a "@" concatenation symbol, an integer range of 100, and the order number is used as the seed calculation parameter.

In step S3, a supplier capacity expansion preprocessing table is designed, and all tilted supplier ID concatenations "@" are calculated and then all integers from 0 to 100 are concatenated as the supplier ID of the capacity expansion portion, while the table retains the original supplier ID. The supplier table expands the preprocessing table (correlated by the original supplier ID) by using left correlation, the matched new supplier ID is taken out, and the information of the rest dimensions is kept unchanged, thus forming an expanded supplier table.

In step S4, the prepaid order master table processed with the tilted supplier ID is associated with the expanded supplier table, and information of the supplier dimensions such as the supplier name, the settlement method, and the hotel group required for the supplier table is removed.

The data inclination processing method of the embodiment enables key value data to be distributed more uniformly through data inclination processing, further solves the problem of data inclination when a large table is associated with a large table, saves computing resources, accelerates computing speed, and saves running time.

Example 2

The embodiment provides a data skew processing system. Referring to fig. 2, the data skew processing system includes a key value obtaining unit 201, a splicing unit 202, a capacity expansion unit 203, and an association unit 204.

The key value obtaining unit 201 is configured to obtain an inclined key value; the splicing unit 202 is configured to splice the tilted key value with a random number to obtain a splicing key value; the capacity expansion unit 203 is configured to expand the second large table according to the splicing key value to obtain a capacity expansion large table; the associating unit 204 is configured to associate the first large table with the capacity expansion large table.

In specific implementation, when the first large table (t1) and the second large table (t2) based on the HDFS are associated, assuming that the association field is ID, then (t2) counts the data amount by ID in the table, and the ID value with a larger data amount is the tilted key value. That is, if the statistic of the ID is larger than the preset threshold, the key value acquisition unit 201 sets the ID as a tilted key value, the statistic being the statistic of the ID in the second large table.

The splicing unit 202 obtains a random number according to a preset random number generation function, and connects the tilted key value with the random number by a preset splicing symbol to obtain a splicing key value.

The capacity expansion unit 203 adds the splicing key value as a newly added ID to the second large table to obtain a capacity expansion large table. That is, data in which the whole range of integers of the random numbers generated in the ID concatenation step S2 is added as IDs to the t2 table, and the remaining fields have the same contents as the original IDs.

The association unit 204 associates the first large table with the capacity expansion large table. The processed t1 is associated with the expanded t2, and the tilted ID is associated with the information of the data expanded by t2, thereby obtaining the desired information.

First, the key value acquisition unit 201 takes out the supplier ID of the data skew as a criterion that the supplier ID of the last day partition corresponds to the supplier ID of the data amount more than 100 ten thousand, and writes into one table.

Then, the splicing unit 202 prepays the order master table to associate with the list of tilted supplier IDs, if matching, random number splicing processing is performed, the splicing symbol is "@" and the integer range is 100, and the order number is used as the seed calculation parameter.

Then, the capacity expansion unit 203 builds a supplier capacity expansion preprocessing table, calculates all tilted supplier ID concatenations "@" and then concatenates all 0-100 integers to serve as the supplier ID of the capacity expansion part, and the table retains the original supplier ID. The supplier table expands the preprocessing table (correlated by the original supplier ID) by using left correlation, the matched new supplier ID is taken out, and the information of the rest dimensions is kept unchanged, thus forming an expanded supplier table.

Finally, the associating unit 204 associates the expanded supplier table with the prepaid order master table processed by the tilted supplier ID, and removes the supplier dimension information such as the supplier name, the settlement method, and the hotel group required by the supplier table.

The processing system for data skew of the embodiment enables the distribution of key value data to be more uniform through data skew processing, further solves the problem of data skew when a large table and a large table are associated, saves computing resources, accelerates the computing speed and saves the running time.

Example 3

Fig. 3 is a schematic structural diagram of an electronic device provided in this embodiment. The electronic device comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor implements the data tilt processing method of embodiment 1 when executing the program. The electronic device 30 shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.

As shown in fig. 3, the electronic device 30 may be embodied in the form of a general purpose computing device, which may be, for example, a server device. The components of the electronic device 30 may include, but are not limited to: the at least one processor 31, the at least one memory 32, and a bus 33 connecting the various system components (including the memory 32 and the processor 31).

The bus 33 includes a data bus, an address bus, and a control bus.

The memory 32 may include volatile memory, such as Random Access Memory (RAM)321 and/or cache memory 322, and may further include Read Only Memory (ROM) 323.

Memory 32 may also include a program/utility 325 having a set (at least one) of program modules 324, such program modules 324 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The processor 31 executes various functional applications and data processing, such as a processing method of data skew according to embodiment 1 of the present invention, by running the computer program stored in the memory 32.

The electronic device 30 may also communicate with one or more external devices 34 (e.g., keyboard, pointing device, etc.). Such communication may be through input/output (I/O) interfaces 35. Also, model-generating device 30 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via network adapter 36. As shown, network adapter 36 communicates with the other modules of model-generating device 30 via bus 33. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the model-generating device 30, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, and data backup storage systems, etc.

It should be noted that although in the above detailed description several units/modules or sub-units/modules of the electronic device are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Example 4

The present embodiment provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor, implements the steps of the processing method of data skew of embodiment 1.

More specific examples, among others, that the readable storage medium may employ may include, but are not limited to: a portable disk, a hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.

In a possible implementation, the present invention can also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps of the processing method for data tilting of embodiment 1 when the program product is run on the terminal device.

Where program code for carrying out the invention is written in any combination of one or more programming languages, the program code may be executed entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on a remote device or entirely on the remote device.

While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.

Claims

1. A method for processing data skew, comprising the steps of:

acquiring an inclined key value;

and associating the first large table with the capacity expansion large table.

2. The method for processing data skew of claim 1, wherein the step of obtaining skewed key values comprises:

and if the statistic of the ID is larger than a preset threshold value, setting the ID as the inclined key value, wherein the statistic is the statistic of the ID in the second large table.

3. The method for processing data skew of claim 1, wherein the step of concatenating the skewed key value with a random number to obtain a concatenated key value comprises:

obtaining the random number according to a preset random number generation function, and connecting the inclined key value with the random number by a preset splicing symbol to obtain a splicing key value;

wherein the preset random number generation function is characterized as:

ran ═ random ((v1+ t)% 1000000000) × v2, where Ran represents the random number, v1 represents the first parameter, t represents the time stamp, v2 represents the second parameter, random () represents the random operation,% represents the remainder operator, and Ran, v1, and v2 are all positive integers.

4. The method for processing data skew of claim 1, wherein the step of expanding the second large table according to the splicing key value to obtain an expanded large table comprises:

and adding the splicing key value as a newly added ID to the second large table to obtain the expansion large table.

5. A processing system for data skew is characterized by comprising a key value acquisition unit, a splicing unit, a capacity expansion unit and an association unit;

the key value obtaining unit is used for obtaining an inclined key value;

6. The data skew processing system of claim 5, wherein the key value obtaining unit sets the ID as a key value of the skew if a statistic of the ID is larger than a preset threshold, the statistic being a statistic of the ID in the second large table.

7. The data skew processing system according to claim 5, wherein the concatenation unit obtains the random number according to a random number generation function, and connects the skewed key value with the random number by a preset concatenation symbol to obtain the concatenation key value;

wherein the preset random number generation function is characterized as:

8. The data skew processing system of claim 5, wherein the concatenation unit adds the concatenation key as a new ID to the second large table to obtain the expanded large table.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of processing data tilting according to any one of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of processing a data tilt of any one of claims 1 to 4.