CN111459937A - Data table association method, device, server and storage medium - Google Patents

Data table association method, device, server and storage medium Download PDF

Info

Publication number
CN111459937A
CN111459937A CN202010227679.3A CN202010227679A CN111459937A CN 111459937 A CN111459937 A CN 111459937A CN 202010227679 A CN202010227679 A CN 202010227679A CN 111459937 A CN111459937 A CN 111459937A
Authority
CN
China
Prior art keywords
data table
data
association
stage
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010227679.3A
Other languages
Chinese (zh)
Inventor
徐翔宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202010227679.3A priority Critical patent/CN111459937A/en
Publication of CN111459937A publication Critical patent/CN111459937A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24558Binary matching operations
    • G06F16/2456Join operations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The embodiment of the application is applicable to the technical field of data processing, and provides a data table association method, a device, a server and a storage medium, wherein the method comprises the following steps: when a calculation request is received, determining a plurality of association stages in an instruction set corresponding to the calculation request; identifying a first data table and a second data table in a first association phase; if the data quantity of at least one data table in the first data table and the second data table is smaller than a preset data threshold value, performing data table association in a broadcasting mode; when the association task of the non-first association stage is executed, counting the data volume of the association data table obtained in the first association stage; if the data volume of the associated data table obtained in the first association stage is smaller than the data threshold, the associated data table obtained in the first association stage is broadcasted to the target data table for association, so that the data volume of network transmission is reduced, the efficiency of program execution is improved, and the problem of program memory overflow caused by data inclination is solved to a certain extent.

Description

Data table association method, device, server and storage medium
Technical Field
The present application belongs to the technical field of data processing, and in particular, to a data table association method, apparatus, server, and storage medium.
Background
As part of the Spark big data framework, Spark SQ L supports the use of standard SQ L queries and HiveQ L to read and write data, can be used for distributed processing of structured data, and can perform Spark data queries like SQ L, helping developers create and run Spark programs more quickly.
The major performance bottleneck of large data processing by using SparkSQ L is the Shuffle problem, the performance of Shuffle directly affects the performance and throughput of the whole program, Shuffle describes the process of data output from one node to another node, Shuffle comprises data preparation in the output stage and data copy processing in the input stage, namely Shuffle Write at the output end and Shuffle Read at the input end.
For example, when two data tables are associated, if data in a large table is pulled to a small table, the amount of data transmitted through the network is increased due to more data to be pulled, which affects the efficiency of program execution, and even causes a problem of program memory overflow due to data skew.
Disclosure of Invention
In view of this, embodiments of the present application provide a data table associating method, an apparatus, a server, and a storage medium, so as to solve the problem in the prior art that when data in a large table is pulled to a small table, the execution efficiency of a program is low due to a large amount of data, and even a program memory overflow due to data skew may occur.
A first aspect of an embodiment of the present application provides a data table association method, including:
when a calculation request aiming at a data table is received, acquiring an instruction set corresponding to the calculation request;
determining a plurality of association stages in the instruction set, the association stages including a first association stage and a non-first association stage adjacent to the first association stage;
when the association task corresponding to the first association stage is executed, identifying a first data table and a second data table to be associated in the first association stage;
if the data amount of at least one data table in the first data table and the second data table is smaller than a preset data threshold, associating the first data table with the second data table in a broadcast mode to obtain an associated data table in the first association stage;
when the associated task corresponding to the non-first associated stage is executed, counting the data volume of the associated data table obtained in the first associated stage;
if the data volume of the associated data table obtained in the first association stage is smaller than the data threshold, the associated data table obtained in the first association stage is broadcasted to a target data table, and the associated data table and the target data table are associated through the same key value.
A second aspect of the embodiments of the present application provides a data table association apparatus, including:
the instruction set acquisition module is used for acquiring an instruction set corresponding to a calculation request when the calculation request aiming at a data table is received;
an association stage determination module to determine a plurality of association stages in the instruction set, the association stages including a first association stage and a non-first association stage adjacent to the first association stage;
the data table identification module is used for identifying a first data table and a second data table to be associated in the first association stage when the association task corresponding to the first association stage is executed;
an initial association module, configured to associate, in a broadcast manner, the first data table and the second data table if at least one data amount of the data table in the first data table and the second data table is smaller than a preset data threshold, so as to obtain an association data table in the first association stage;
the data volume counting module is used for counting the data volume of the associated data table obtained in the first associated stage when the associated task corresponding to the non-first associated stage is executed;
and the target association module is used for broadcasting the associated data table obtained in the first association stage to a target data table if the data volume of the associated data table obtained in the first association stage is smaller than the data threshold, and associating the associated data table with the target data table through the same key value.
A third aspect of embodiments of the present application provides a server, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the data table association method described in the first aspect when executing the computer program.
A fourth aspect of embodiments of the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the data table association method described in the first aspect.
A fifth aspect of embodiments of the present application provides a computer program product, which when run on a server, causes the server to execute the data table association method of the first aspect.
Compared with the prior art, the embodiment of the application has the following advantages:
according to the embodiment of the application, when a calculation request for a data table is received, an instruction set corresponding to the calculation request is obtained, and then each association operation is divided into different association stages according to the instruction set, so that the data volume in two data tables is firstly counted and judged before association processing each time, if an association data table smaller than a preset data threshold exists, the association data table is sent to a data node where a target data table is located in a broadcast mode, association processing is carried out on a node where the target data table with a larger data volume is located, the data volume of network transmission is reduced, the efficiency of program execution is improved, and the problem of program memory overflow caused by data inclination is solved to a certain extent.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the embodiments or the description of the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
FIG. 1 is a flow chart illustrating steps of a method for associating data tables according to an embodiment of the present application;
FIG. 2 is a flow chart illustrating steps of another method for associating data tables according to one embodiment of the present application;
FIG. 3 is a schematic diagram of a spreadsheet correlation apparatus according to an embodiment of the present application;
fig. 4 is a schematic diagram of a server according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
In the prior art, a spare computing framework provides a BroadCast broadcasting small list mode for Shuffle avoidance, and in this mode, a small list with a data volume smaller than a specified threshold value can be identified in a mode that a user specifies the threshold value, and the small list is broadcasted to a node where a large list is located, so that local association processing of the data list is performed. However, for the data table formed after the association, the data table is not broadcasted again in the original Spark execution plan. For example, after two small tables are associated by broadcasting, a temporary small table is registered, and when the temporary small table is associated with a large table, even if the data amount of the temporary small table is smaller than the broadcasting threshold set by the user, the spare program does not perform the broadcasting processing on the temporary small table, but performs the association by the Shuffle method. This kind of execution mode will increase the data amount transmitted by the network, affect the efficiency of program execution, and even cause the problem of program memory overflow due to data skew.
Therefore, in view of the above problems, the core idea of the present embodiment is to determine the data amount of the data list obtained by the previous association each time the association processing of the data list is performed, and if the data amount is smaller than the BroadCast threshold, continue to perform the association of the data list in the BroadCast broadcasting tabble. In the embodiment, by adding the dynamic judgment on the execution process, the data volume of network transmission can be reduced, the execution efficiency of the program is improved, and the problem of program memory overflow caused by data inclination is avoided to a certain extent.
The technical solution of the present application will be described below by way of specific examples.
Referring to fig. 1, a schematic flow chart illustrating steps of a data table association method according to an embodiment of the present application is shown, which may specifically include the following steps:
s101, when a calculation request for a data table is received, acquiring an instruction set corresponding to the calculation request;
it should be noted that the method can be applied to a server. That is, the execution subject of the present embodiment is a server, and the association with the data table is realized by the processing of the server. The server in this embodiment may be a cloud server or a server cluster composed of a plurality of computing devices, and the specific type of the server is not limited in this embodiment.
Specifically, the request may be sent by a worker from a terminal device to a server where the distributed cluster is located according to a certain calculation rule, or may be automatically generated according to a set trigger condition.
Data in the distributed cluster is usually organized in a table form, and in the process of realizing business requirements, correlation calculation is usually performed on data in different data tables so as to meet corresponding business requirements.
In the embodiment, when a calculation request for a data table is received, an instruction set corresponding to the calculation request, namely a set of a plurality of instructions which need to be executed in the subsequent calculation process, can be obtained firstly.
S102, determining a plurality of association stages in the instruction set, wherein the association stages comprise a first association stage and a non-first association stage adjacent to the first association stage;
generally, the result data of the service requirement is not stored in one data table, but is obtained by performing correlation calculation on a plurality of data tables. In business logic, the process of association calculation can be divided into multiple stages, such as a first association stage, a second association stage, a third association stage, and so on.
In this embodiment, all adjacent two association stages may be divided into a first association stage and a non-first association stage. That is, in the present embodiment, after a certain association process is referred to as a first association stage, a subsequent association process adjacent to the certain association process is referred to as a non-first association stage.
For example, if the request is calculated according to SparkSQ L, it is necessary to associate the data table a with the data table B first, and then associate the data table C obtained after association with another data table D to obtain the data table e finally required by the service.
S103, when the association task corresponding to the first association stage is executed, identifying a first data table and a second data table to be associated in the first association stage;
in the instruction set corresponding to the SparkSQ L calculation request, the object of each calculation is written in the form of an instruction, which is used to indicate which data or object needs to be processed in the calculation process.
For the association of the data tables, the two data tables which need to be associated in each association operation can also be directly obtained from the instruction. That is, when a certain associated operation task in the calculation instruction is executed, it can be directly determined from the instruction which two data tables need to be associated.
For example, if policy information is stored in a certain data table A and personal information of an applicant is stored in a data table B, and if policy information and personal information of an applicant need to be collected in the same data table according to business needs, it is necessary to perform association processing on the data table A and the data table B by using SparkSQ L.
S104, if the data quantity of at least one data table in the first data table and the second data table is smaller than a preset data threshold value, associating the first data table with the second data table in a broadcast mode to obtain an associated data table in a first association stage;
in this embodiment, the association processing can be performed on two initial data tables by executing a preset sparkSQ L instruction.
In a specific implementation, the sparkSQ L instructs that whether two different initial data tables meet the condition of a set data threshold or not can be judged first when the two different initial data tables are subjected to association processing, the data threshold may refer to the size of the data amount of the metadata in the data tables, and the specific size may be set according to the actual condition of the system.
Therefore, whether a data table with the data quantity not exceeding the data threshold exists in the two initial data tables to be associated or not can be judged firstly, and if the data table exists, the data table can be broadcasted to the data node where the other data table is located.
For example, if the data amount of the data table a is smaller than the preset data threshold and the data amount of the data table B is larger than the preset data threshold, the data table a may be broadcasted to the data node where the data table B is located; if the data volume of the data table a is larger than the preset data threshold and the data volume of the data table B is smaller than the preset data threshold, the data table B may be broadcasted to the data node where the data table a is located; if the data volumes of the data table a and the data table B are both smaller than the preset data threshold, the data table with the smaller data volume of the data table a and the data table B can be broadcasted to the data node where the data table with the relatively larger data volume is located.
Of course, if the data amount of the data table a and the data table B is greater than the preset data threshold, the data table association cannot be performed in a broadcast manner, but the data table association needs to be performed through a Shuffle operation.
In this embodiment, the association processing for the two initial data tables can be realized by the same key (key) value.
For example, for the data table a storing policy information and the data table B storing personal information of the applicant, the key values of the two data tables may be customer numbers, and when performing the association, data having the same customer number in the two tables may be associated to obtain one associated data table.
S105, when the associated task corresponding to the non-first associated stage is executed, counting the data volume of the associated data table obtained in the first associated stage;
in this embodiment, the command of the non-first association stage may be set in the SparkSQ L program, that is, when association processing needs to be performed on a plurality of data tables, it may be set by the program which two data tables are first associated, and then the associated data table is associated with the other data table.
In the prior art, when the second association processing is executed, sparkSQ L does not perform association processing in a broadcast manner, but needs to perform association in a Shuffle manner, in order to reduce the data amount of network transmission and improve the efficiency of program execution, an execution manner may be set in a sparkSQ L instruction in advance, so that before performing association processing on two data tables in the non-first association stage, the data amount of the association data table obtained in the previous stage is compared with a preset data threshold.
In a specific implementation, a detection condition may be set in the sparkSQ L instruction, and when a task in a non-first association stage is detected to be executed, it is determined whether there is an instruction trigger that requires association processing between a data table obtained in a previous association stage and another data table.
S106, if the data volume of the associated data table obtained in the first association stage is smaller than the data threshold, broadcasting the associated data table obtained in the first association stage to a target data table, and associating the associated data table with the target data table through the same key value.
If the data volume of the associated data list obtained in the previous association stage, i.e. the first association stage, is less than the data threshold, the association task in this stage can be continuously executed in a BroadCast broadcasting tabble mode, and the associated data list is broadcasted to the target data list for association, wherein the target data list is the data list for executing the association task in this stage.
Similarly, when the association processing is performed on the association data table and the target data table, the association processing may be performed by the same key value.
In the embodiment of the application, when a calculation request for a data table is received, an instruction set corresponding to the calculation request is obtained, and then each association operation is divided into different association stages according to the instruction set, so that the data volume in two data tables is counted and judged before association processing each time, if an association data table smaller than a preset data threshold exists, the association data table is sent to a data node where a target data table is located in a broadcast mode, and association processing is performed on a node where the target data table with a larger data volume is located, so that the data volume of network transmission is reduced, the efficiency of program execution is improved, and the problem of program memory overflow caused by data inclination is solved to a certain extent.
Referring to fig. 2, a schematic flow chart illustrating steps of another data table association method according to an embodiment of the present application is shown, which may specifically include the following steps:
s201, when a calculation request for a data table is received, acquiring an instruction set corresponding to the calculation request;
the execution subject of the present embodiment is a server, and the association with the data table is realized by the processing of the server.
The instruction set in this embodiment is a set of multiple instructions that need to be executed in the subsequent calculation process. The computational process required by the computational request can be accomplished by executing the individual instructions in the instruction set.
S202, identifying a plurality of associated instructions contained in the instruction set, wherein each associated instruction corresponds to an associated stage;
typically, one association operation for any two data tables belongs to one association phase. The association instruction may be identified by acquiring an instruction set corresponding to the computation request. Generally, each associated instruction corresponds to an associated phase. The instruction set includes how many correlation instructions, and basically, how many times of correlation operations need to be completed by the calculation request.
S203, determining the execution sequence of each associated instruction, identifying a first associated instruction at the first position of the execution sequence, and determining an associated stage corresponding to the first associated instruction as a first associated stage;
in this embodiment, when determining how many associated stages are included in the entire process, the execution order of each associated instruction may be determined according to the execution order of each instruction.
Generally, an associated instruction which is first in the execution order may be identified as a first associated instruction, and an associated stage corresponding to the first associated instruction is a first associated stage; and taking the next associated instruction executed after the first associated stage as a second associated instruction, wherein the associated stage corresponding to the second associated instruction is the second associated stage. In this way the execution order of the various associated phases is determined one by one.
S204, when the association task corresponding to the first association stage is executed, identifying a first data table and a second data table to be associated in the first association stage;
for the association of the data tables, two data tables which need to be associated in each association operation can be directly obtained from the instruction. That is, when a certain association instruction is executed, it is possible to directly determine which two data tables need to be associated from the instruction.
S205, if the data quantity of at least one data table in the first data table and the second data table is smaller than a preset data threshold value, associating the first data table and the second data table in a broadcast mode to obtain an associated data table in a first association stage;
in this embodiment, when the first association instruction is executed, it may be first determined whether the first data table and the second data table to be associated meet a condition of a set data threshold. Therefore, the first data amount of the first data table and the second data amount of the second data table may be counted, respectively.
Generally, the data size of a data table can be determined directly by looking at the metadata information recorded in the data table.
Therefore, in this embodiment, the data table information of the first data table and the second data table may be obtained separately, and then, by reading the first data amount of the first data table and the second data amount of the second data table described in the data table information, it is determined whether there is a case where the data amount of one data table is smaller than the preset data threshold in the first data table and the second data table.
If the data volume of at least one data table in the first data table and the second data table is smaller than the preset data threshold and the first data volume is larger than the second data volume, it indicates that the second data volume of at least the second data table is smaller than the data threshold, at this time, the second data table can be broadcasted to the first data node where the first data table is located, and the first data table and the second data table are associated at the first data node to obtain an associated data table in a first association stage.
If at least one data table in the first data table and the second data table has a data amount smaller than a preset data threshold and the first data amount is smaller than or equal to the second data amount, it indicates that the first data amount of at least the first data table is smaller than the data threshold, at this time, the first data table may be broadcasted to a second data node where the second data table is located, and the second data node associates the first data table with the second data table to obtain an associated data table in a first association stage.
Of course, if the data amount of the first data table and the data amount of the second data table are both greater than the preset data threshold, the data table association cannot be performed in a broadcast manner. At this time, the association between the two needs to be realized through the Shuffle operation.
S206, when the associated task corresponding to the non-first associated stage is executed, determining the size of a memory occupied by the associated data table obtained in the first associated stage, and taking the size of the memory as the data volume of the associated data table obtained in the first associated stage;
because the associated data table obtained in each stage is obtained by associating two different data tables, the associated data table does not fall into an entity table, but is only a temporary table in the program. The data size of the associated data table cannot be determined directly by looking at the metadata information recorded in the data table. At this time, the size of the data amount in the associated data table may be determined by determining the size of the memory occupied by the associated data table.
The previous association phase adjacent to the non-first association phase may refer to the last association phase, i.e., the first association phase, in the current association phase. Taking the current association stage as the second association stage as an example, the adjacent previous association stage is the first association stage.
S207, if the data volume of the associated data table obtained in the first association stage is smaller than the data threshold, broadcasting the associated data table obtained in the first association stage to a target data node where the target data table is located;
after the data size of the associated data table is determined, whether the data size of the associated data table is smaller than a preset data threshold value or not can be continuously judged. If the data quantity of the associated data table is smaller than the data threshold, the current associated operation can be executed in a small table broadcasting mode.
In a specific implementation, for an associated data table with a data amount smaller than a preset data threshold, the associated data table may be broadcast to a target data node where the target data table is located. The target data table is the data table that needs to be associated with the associated data table of the previous stage at this stage.
S208, associating the associated data table obtained in the first association stage and the data corresponding to the same key value in the target data table at the target data node.
When the target data node performs the association processing on the associated data table and the target data table, the association processing can be realized through the same key value. The same key value may refer to a field that exists in both the association data table and the target data table and that can uniquely identify a certain piece of data.
It should be noted that, in the present embodiment, when a related task corresponding to a non-first related stage is introduced, only one related operation is taken as an example for description, and it should be understood by those skilled in the art that, when each related task is executed, the present embodiment may execute a corresponding data detection process through an instruction, count the data size of a related data table that needs to be related in the current stage, and if the data size is smaller than a preset data threshold, broadcast the related data table to the data stage where the target data table in the current stage is located, reduce the data size of network transmission, and improve the efficiency of program execution.
In the embodiment of the application, when each data table association task is executed, whether the data volume of the association data table obtained in the previous stage is smaller than a preset data threshold is judged, if so, the association data table can be sent to the data node where the target data table is located, and association processing is performed on the node where the target data table is located. In the embodiment, by adding the dynamic judgment on the execution process, the data quantity of the associated data table is judged once during each execution of the associated instruction, so that the data table association can be performed in a broadcast mode under the condition of meeting the requirement of the data threshold, the frequency of Shuffle operation is reduced, the data quantity of network transmission is reduced, the efficiency of program execution is improved, and the risk of data inclination in the data transmission process is avoided to the maximum extent.
It should be noted that, the sequence numbers of the steps in the foregoing embodiments do not mean the execution sequence, and the execution sequence of each process should be determined by the function and the inherent logic of the process, and should not constitute any limitation on the implementation process of the embodiments of the present application.
Referring to fig. 3, a schematic diagram of a data table association apparatus according to an embodiment of the present application is shown, which may specifically include the following modules:
an instruction set obtaining module 301, configured to, when a computation request for a data table is received, obtain an instruction set corresponding to the computation request;
an association stage determination module 302 configured to determine a plurality of association stages in the instruction set, where the association stages include a first association stage and a non-first association stage adjacent to the first association stage;
a data table identifying module 303, configured to identify a first data table and a second data table to be associated in the first association stage when the association task corresponding to the first association stage is executed;
an initial association module 304, configured to associate, in a broadcast manner, the first data table and the second data table if at least one data amount of the data table in the first data table and the second data table is smaller than a preset data threshold, so as to obtain an associated data table in the first association stage;
a data amount counting module 305, configured to count the data amount of the associated data table obtained in the first association stage when the associated task corresponding to the non-first association stage is executed;
the target association module 306 is configured to broadcast the associated data table obtained at the first association stage to a target data table if the data amount of the associated data table obtained at the first association stage is smaller than the data threshold, and associate the associated data table with the target data table by using the same key value.
In this embodiment of the application, the association stage determining module 302 may specifically include the following sub-modules:
and the association instruction identification submodule is used for identifying a plurality of association instructions contained in the instruction set, and each association instruction corresponds to one association stage.
In this embodiment of the present application, the association stage determining module 302 may further include the following sub-modules:
an execution order determination submodule for determining an execution order of each of the associated instructions;
and the first association stage identification submodule is used for identifying a first association instruction at the first bit of the execution sequence and determining the association stage corresponding to the first association instruction as the first association stage.
In this embodiment of the present application, the initial association module 304 may specifically include the following sub-modules:
the data quantity counting submodule is used for respectively counting a first data quantity of the first data table and a second data quantity of the second data table;
a first association submodule, configured to broadcast the second data table to a first data node where the first data table is located if at least one of the first data table and the second data table has a data amount smaller than a preset data threshold and the first data amount is greater than the second data amount, and associate the first data table and the second data table at the first data node to obtain an association data table in the first association stage;
a second association submodule, configured to broadcast the first data table to a second data node where the second data table is located if at least one of the first data table and the second data table has a data amount smaller than a preset data threshold and the first data amount is smaller than or equal to the second data amount, and associate the first data table and the second data table at the second data node to obtain an association data table in the first association stage.
In this embodiment of the present application, the data amount statistics sub-module may specifically include the following units:
a data table information acquiring unit configured to acquire data table information of the first data table and the second data table, respectively;
a data amount reading unit configured to read a first data amount of the first data table and a second data amount of the second data table described in the data table information.
In this embodiment, the data amount statistics module 305 may specifically include the following sub-modules:
and the memory counting submodule is used for determining the size of a memory occupied by the associated data table obtained in the first association stage when the associated task corresponding to the non-first association stage is executed, and taking the size of the memory as the data volume of the associated data table obtained in the first association stage.
In this embodiment of the application, the target association module 306 may specifically include the following sub-modules:
the associated data table broadcasting submodule is used for broadcasting the associated data table obtained in the first associated stage to a target data node where the target data table is located;
and the target data table association submodule is used for associating the association data table obtained in the first association stage with the data corresponding to the same key value in the target data table at the target data node.
According to the embodiment of the application, when a calculation request for a data table is received, an instruction set corresponding to the calculation request is obtained, and then each association operation is divided into different association stages according to the instruction set, so that the data volume in two data tables is firstly counted and judged before association processing each time, if an association data table smaller than a preset data threshold exists, the association data table is sent to a data node where a target data table is located in a broadcasting mode, association processing is carried out on a node where the target data table with a larger data volume is located, the data volume of network transmission is reduced, the efficiency of program execution is improved, and the problem that a program memory overflows due to data inclination is solved to a certain extent.
For the apparatus embodiment, since it is substantially similar to the method embodiment, it is described relatively simply, and reference may be made to the description of the method embodiment section for relevant points.
Referring to fig. 4, a schematic diagram of a server of one embodiment of the present application is shown. As shown in fig. 4, the server 400 of the present embodiment includes: a processor 410, a memory 420, and a computer program 421 stored in the memory 420 and executable on the processor 410. The processor 410 executes the computer program 421 to implement the steps in the various embodiments of the data table association method, such as the steps S101 to S106 shown in fig. 1. Alternatively, the processor 410, when executing the computer program 421, implements the functions of each module/unit in the above-mentioned device embodiments, for example, the functions of the modules 301 to 306 shown in fig. 3.
Illustratively, the computer program 421 may be partitioned into one or more modules/units, which are stored in the memory 420 and executed by the processor 410 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which may be used to describe the execution of the computer program 421 in the server 400. For example, the computer program 421 may be divided into an instruction set obtaining module, an association stage determining module, a data table identifying module, an initial association module, a data amount counting module, and a target association module, where the specific functions of the modules are as follows:
the instruction set acquisition module is used for acquiring an instruction set corresponding to a calculation request when the calculation request aiming at a data table is received;
an association stage determination module to determine a plurality of association stages in the instruction set, the association stages including a first association stage and a non-first association stage adjacent to the first association stage;
the data table identification module is used for identifying a first data table and a second data table to be associated in the first association stage when the association task corresponding to the first association stage is executed;
an initial association module, configured to associate, in a broadcast manner, the first data table and the second data table if at least one data amount of the data table in the first data table and the second data table is smaller than a preset data threshold, so as to obtain an association data table in the first association stage;
the data volume counting module is used for counting the data volume of the associated data table obtained in the first associated stage when the associated task corresponding to the non-first associated stage is executed;
and the target association module is used for broadcasting the associated data table obtained in the first association stage to a target data table if the data volume of the associated data table obtained in the first association stage is smaller than the data threshold, and associating the associated data table with the target data table through the same key value.
The server 400 may be a computing device such as a desktop computer, a notebook, a palm top computer, and a cloud server. The server 400 may include, but is not limited to, a processor 410, a memory 420. Those skilled in the art will appreciate that fig. 4 is merely an example of a server 400 and is not intended to limit server 400 and may include more or fewer components than those shown, or some components may be combined, or different components, e.g., server 400 may also include input output devices, network access devices, buses, etc.
The Processor 410 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 420 may be an internal storage unit of the server 400, such as a hard disk or a memory of the server 400. The memory 420 may also be an external storage device of the server 400, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the server 400. Further, the memory 420 may also include both an internal storage unit and an external storage device of the server 400. The memory 420 is used for storing the computer program 421 and other programs and data required by the server 400. The memory 420 may also be used to temporarily store data that has been output or is to be output.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same. Although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A method for associating data tables, comprising:
when a calculation request aiming at a data table is received, acquiring an instruction set corresponding to the calculation request;
determining a plurality of association stages in the instruction set, the association stages including a first association stage and a non-first association stage adjacent to the first association stage;
when the association task corresponding to the first association stage is executed, identifying a first data table and a second data table to be associated in the first association stage;
if the data amount of at least one data table in the first data table and the second data table is smaller than a preset data threshold, associating the first data table with the second data table in a broadcast mode to obtain an associated data table in the first association stage;
when the associated task corresponding to the non-first associated stage is executed, counting the data volume of the associated data table obtained in the first associated stage;
if the data volume of the associated data table obtained in the first association stage is smaller than the data threshold, the associated data table obtained in the first association stage is broadcasted to a target data table, and the associated data table and the target data table are associated through the same key value.
2. The method of claim 1, wherein determining the plurality of association stages in the instruction set comprises:
a plurality of associated instructions contained in the instruction set are identified, each associated instruction corresponding to an associated stage.
3. The method of claim 2, further comprising, after said identifying a plurality of associated instructions contained in said set of instructions:
determining an execution order of each of the associated instructions;
and identifying a first association instruction at the first bit of the execution sequence, and determining an association stage corresponding to the first association instruction as a first association stage.
4. The method according to claim 1, wherein if the data amount of at least one of the first data table and the second data table is smaller than a preset data threshold, associating the first data table with the second data table in a broadcast manner to obtain the associated data table in the first association stage, includes:
respectively counting a first data volume of the first data table and a second data volume of the second data table;
if the data volume of at least one data table in the first data table and the second data table is smaller than a preset data threshold value, and the first data volume is larger than the second data volume, broadcasting the second data table to a first data node where the first data table is located, and associating the first data table and the second data table at the first data node to obtain an associated data table in a first association stage;
if the data amount of at least one data table in the first data table and the second data table is smaller than a preset data threshold value and the first data amount is smaller than or equal to the second data amount, broadcasting the first data table to a second data node where the second data table is located, and associating the first data table and the second data table at the second data node to obtain an associated data table in the first association stage.
5. The method of claim 4, wherein the separately counting a first amount of data of the first data table and a second amount of data of the second data table comprises:
respectively acquiring data table information of the first data table and the second data table;
and reading a first data size of the first data table and a second data size of the second data table recorded in the data table information.
6. The method according to any one of claims 1 to 5, wherein the counting data amount of the associated data table obtained in the first association stage when the associated task corresponding to the non-first association stage is executed comprises:
and when the associated task corresponding to the non-first associated stage is executed, determining the size of a memory occupied by the associated data table obtained in the first associated stage, and taking the size of the memory as the data volume of the associated data table obtained in the first associated stage.
7. The method according to claim 6, wherein the broadcasting the association data table obtained in the first association stage to a target data table, and associating the association data table and the target data table by the same key value comprises:
broadcasting the associated data table obtained in the first association stage to a target data node where a target data table is located;
and associating the associated data table obtained in the first association stage and the data corresponding to the same key value in the target data table at the target data node.
8. A data table association apparatus, comprising:
the instruction set acquisition module is used for acquiring an instruction set corresponding to a calculation request when the calculation request aiming at a data table is received;
an association stage determination module to determine a plurality of association stages in the instruction set, the association stages including a first association stage and a non-first association stage adjacent to the first association stage;
the data table identification module is used for identifying a first data table and a second data table to be associated in the first association stage when the association task corresponding to the first association stage is executed;
an initial association module, configured to associate, in a broadcast manner, the first data table and the second data table if at least one data amount of the data table in the first data table and the second data table is smaller than a preset data threshold, so as to obtain an association data table in the first association stage;
the data volume counting module is used for counting the data volume of the associated data table obtained in the first associated stage when the associated task corresponding to the non-first associated stage is executed;
and the target association module is used for broadcasting the associated data table obtained in the first association stage to a target data table if the data volume of the associated data table obtained in the first association stage is smaller than the data threshold, and associating the associated data table with the target data table through the same key value.
9. A server comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the data table associating method according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out a data table associating method according to any one of claims 1 to 7.
CN202010227679.3A 2020-03-27 2020-03-27 Data table association method, device, server and storage medium Pending CN111459937A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010227679.3A CN111459937A (en) 2020-03-27 2020-03-27 Data table association method, device, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010227679.3A CN111459937A (en) 2020-03-27 2020-03-27 Data table association method, device, server and storage medium

Publications (1)

Publication Number Publication Date
CN111459937A true CN111459937A (en) 2020-07-28

Family

ID=71682481

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010227679.3A Pending CN111459937A (en) 2020-03-27 2020-03-27 Data table association method, device, server and storage medium

Country Status (1)

Country Link
CN (1) CN111459937A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112181831A (en) * 2020-09-28 2021-01-05 中国平安财产保险股份有限公司 Script performance verification method, device and equipment based on keywords and storage medium
CN112732715A (en) * 2020-12-31 2021-04-30 星环信息科技(上海)股份有限公司 Data table association method, device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108268586A (en) * 2017-09-22 2018-07-10 广东神马搜索科技有限公司 Across the data processing method of more tables of data, device, medium and computing device
CN108897796A (en) * 2018-06-12 2018-11-27 平安科技(深圳)有限公司 A kind of operation system calls method, storage medium and the server of influxdb database
WO2019033519A1 (en) * 2017-08-17 2019-02-21 平安科技(深圳)有限公司 User permission data query method and apparatus, electronic device, and medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019033519A1 (en) * 2017-08-17 2019-02-21 平安科技(深圳)有限公司 User permission data query method and apparatus, electronic device, and medium
CN108268586A (en) * 2017-09-22 2018-07-10 广东神马搜索科技有限公司 Across the data processing method of more tables of data, device, medium and computing device
CN108897796A (en) * 2018-06-12 2018-11-27 平安科技(深圳)有限公司 A kind of operation system calls method, storage medium and the server of influxdb database

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112181831A (en) * 2020-09-28 2021-01-05 中国平安财产保险股份有限公司 Script performance verification method, device and equipment based on keywords and storage medium
CN112732715A (en) * 2020-12-31 2021-04-30 星环信息科技(上海)股份有限公司 Data table association method, device and storage medium
CN112732715B (en) * 2020-12-31 2023-08-25 星环信息科技(上海)股份有限公司 Data table association method, device and storage medium

Similar Documents

Publication Publication Date Title
CN108846749B (en) Partitioned transaction execution system and method based on block chain technology
US11650990B2 (en) Method, medium, and system for joining data tables
CN110489405B (en) Data processing method, device and server
WO2015184782A1 (en) Data query method and device
CN109947804B (en) Data set query optimization method and device, server and storage medium
WO2019052162A1 (en) Method, apparatus and device for improving data cleaning efficiency, and readable storage medium
CN110888981B (en) Title-based document clustering method and device, terminal equipment and medium
CN111459937A (en) Data table association method, device, server and storage medium
WO2017020735A1 (en) Data processing method, backup server and storage system
WO2018205689A1 (en) File merging method, storage device, storage apparatus, and storage medium
CN112506950A (en) Data aggregation processing method, computing node, computing cluster and storage medium
CN110543279A (en) Data storage and processing method, device and system
EP3108400B1 (en) Virus signature matching method and apparatus
CN113486109A (en) Data synchronization method and device of heterogeneous database and electronic equipment
CN111221690B (en) Model determination method and device for integrated circuit design and terminal
CN107368281B (en) Data processing method and device
CN113419792A (en) Event processing method and device, terminal equipment and storage medium
WO2020140623A1 (en) Electronic device, metadata processing method and computer readable storage medium
CN110032564B (en) Method and device for determining association relation of data table
CN114297236A (en) Data blood relationship analysis method, terminal equipment and storage medium
CN113590322A (en) Data processing method and device
US9342511B2 (en) Fast selection in hardware or software
CN110865877A (en) Task request response method and device
CN112612415B (en) Data processing method and device, electronic equipment and storage medium
CN116821146B (en) Apache Iceberg-based data list updating method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination